Reliable Network Communication

Next: Hit-Server Implementation Up: The Server Architecture Previous: Hit-Server Architecture

2.4 Reliable Network Communication

Although we envision the use of switched networks, packet losses are possible. They can occur in switches or even within the hit-server's Ethernet chips, even though the hit-server always has enough receive buffers available. The point of congestion is not main memory itself but the memory/PCI bus. As described in more detail in Section 3.1, the maximum memory-bus bandwidth for DMA is approximately 600 Mbps. As soon as the sum of all cards' incoming and outgoing Ethernet traffic exceeds¹ this value, the receiver FIFOs (4K each) in the Ethernet chips can run over. (The problem is real: the current hardware uses 7 full-duplex Ethernet cards enabling peaks of 1400Mbps.)

The first obvious choice for a reliable transport protocol is TCP. Although some of its features are not necessary for our application (e.g., checksums, handling duplicates and out-of-order packets), adaptive flow control and retransmission of lost packets are required. It is well known that TCP is often costly in terms of processor cycles [18] and that a VMTP-like [4] protocol is better suited for transactions.

An even more important problem of TCP is that its congestion-avoidance policies (which are primarily based on end-to-end flow control) are tailored to current WANs and are not effective for our envisioned scenarios: On a highly loaded LAN, we experience dramatically changing loss rates, e.g. 0% loss for 5ms, then 40% for 2ms, etc. TCP would very quickly reduce its window size to a single packet. This would result in poor bandwidth utilization and not avoid packet losses since, in peak situations, the loss rate depends more on the synchronization between the clients than on the sender's transmission rate. Given that under peak load the round-trip time exceeds 1ms (the client's and server's hardware FIFOs are even good for a 0.9ms delay), it is very difficult to devise a flow-control protocol that can handle the described agility efficiently.

Instead, we use a late-retransmission protocol. Basically, any sender transmits the whole object in a burst, as fast as the network hardware permits. Afterwards, the receiver tells the sender which parts of the object it has received; finally, the sender retransmits the missing, i.e., lost, parts (if any). This procedure is repeated until all data is transferred.

The mentioned protocol does not work for any topology. However, it behaves nicely on a star topology as in our scenario where nearly all communication either goes to or comes from the hit-server. With a switched network, the protocol ensures that the hit-server receives data at its maximum rate and all clients get close to optimal bandwidth. At a first glance, this might be counterintuitive since the protocol seems to invite congestion rather than to avoid it. However, assume that two clients simultaneously send a 1MByte object each to the same hit-server card. Due to congestion, always 1/2 of the data sent will be lost: On the first round, each client sends the entire object; on the second round, each client sends the lost 1/2M, on the third the 1/4M lost in the second round, etc. In total, each client sends 2MByte with full speed to effectively transfer 1MByte, i.e., gets 50% of the available bandwidth. In the same time, the hit-server receives 2x1 MByte at the highest possible rate. The point is that only such packets are lost that could not have been transmitted under ideal flow control, and that the ``unnecessary'' transmissions consume only resources that otherwise would be unused².

Since this paper concentrates on the hit-server design, we will neither go into details of the protocol nor proof its properties nor discuss further ``good'' topologies here. For the context of this paper, it is relevant that the protocol is reliable, performs well under peak load, is cheap for low loss rates, is robust against random losses and highly fluctuating loss rates, and can be ``asymmetrically'' implemented such that it requires more processor cycles on the client side and less on the hit-server side.

¹ In reality, the situation is complicated by DMA bursts, bus arbitration policies and the existence of multi-level PCI buses. However, all this is hardware and most of its parameters and algorithms cannot be influenced by software. A detailed description is beyond the scope of this paper.

² There are some pathological situations. If, e.g., 100 clients simultaneously start sending an object of 100 packets, each round effectively transfers only 1 packet per client. Then we needed 1 ack per transferred packet, similar to 1-packet windows in TCP. (Nicely, we needed only 0.1 acks per transferred packet, if 1000-packet objects were sent.) Therefore, as soon as a client notices that the effective ratio of acks to effectively transferred packets becomes too high, it takes random rests while transmitting the packets. Since all active clients act in a similar way, the congestion and the loss rate decreases so that the ack ratios become better. For short objects, additional transmission rounds are needed: the packets are retransmitted a second or third time without waiting for an ack.

Next: Hit-Server Implementation Up: The Server Architecture Previous: Hit-Server Architecture

Vsevolod Panteleenko
Tue Apr 28 11:56:10 EDT 1998