This is the third article of our series on TCP, covering all that you need to know to troubleshoot performance problems impacting business critical applications. After considering How TCP opens and closes connections, we will now examine problems that can happen to a connection in progress, specifically network packet loss.
What causes network packet loss?
The two most common causes of network packet loss are:
- Layer two (L2) errors
- and network congestion
If a frame becomes errored from point to point on a connection due to cabling issues, duplex problems, or other layer 1 events, the receiver will determine that the data is corrupted and drop it. In most cases, an error counter will be incremented on the interface, which helps when locating where the loss occurred.
Traffic congestion can cause input/output discards on interface links, especially when translating between link speeds (10Gbps to 1Gbps for example). On these connections, the egress link may not be able to keep up with the amount of ingress traffic, which may result in dropped packets. The sender of the traffic will determine the loss occurred and retransmit. These are typically labelled as “discards” on interfaces.
As we have seen in this series, TCP is a connection-oriented protocol. Part of the function of establishing a connection is creating the mechanism to track data that has been sent and acknowledge what is received. This way, TCP can detect if a packet goes missing and resend it accordingly, ensuring reliable transmission of data.
Network packet loss? Are we still coping with that today?
Yes. Despite the maturity of network links to 10Gbps and beyond, packet loss is still an underlying network event that impacts applications today. To troubleshoot these issues, we first need to understand how packets are dropped, how we can detect these events, and how we can resolve them.
Each byte of data sent in a TCP connection has an associated sequence number. This is indicated on the sequence number field of the TCP header.
When the receiving socket detects an incoming segment of data, it uses the acknowledgement number in the TCP header to indicate receipt. After sending a packet of data, the sender will start a retransmission timer of variable length. If it does not receive an acknowledgment before the timer expires, the sender will assume the segment has been lost and will retransmit it.
The TCP retransmission mechanism ensures that data is reliably sent from end to end. If retransmissions are detected in a TCP connection, it is logical to assume that packet loss has occurred on the network somewhere between client and server.
TCP Duplicate / Selective Acknowledgments
Most packet analyzers will indicate a duplicate acknowledgment condition when two ACK packets are detected with the same ACK numbers.
How do these happen?
Sending TCP sockets usually transmit data in a series. Rather than sending one segment of data at a time and waiting for an acknowledgement, transmitting stations will send several packets in succession. If one of these packets in the stream goes missing, the receiving socket can indicate which packet was lost using selective acknowledgments.
These allow the receiver to continue to acknowledge incoming data while informing the sender of the missing packet(s) in the stream.
As shown above, selective acknowledgements will use the ACK number in the TCP header to indicate which packet was lost. At the same time, in these ACK packets, the receiver can use the SACK option in the TCP header to show which packets have been successfully received after the point of loss.
The SACK option is a function that is advertised by each station at the beginning of the TCP connection. Most network analyzers will flag these packets as duplicate acknowledgements because the ACK number will stay the same until the missing packet is retransmitted, filling the gap in the sequence.
Typically, duplicate acknowledgements mean that one or more packets have been lost in the stream and the connection is attempting to recover. They are a common symptom of packet loss. In most cases, once the sender receives three duplicate acknowledgments, it will immediately retransmit the missing packet instead of waiting for a timer to expire. These are called fast retransmissions.
Connections with more latency between client and server will typically have more duplicate acknowledgement packets when a segment is lost. In high latency connections, it is possible to observe several hundred duplicate acknowledgements for a single lost packet.
If TCP Retransmissions and duplicate acknowledgments are detected on a connection, don’t assume that the sky is falling and performance has come to a screeching halt. Depending on the network between endpoints, a small amount of them may be normal.
For example, if a service provider is connecting end users to applications in a data center, or if the application is hosted in a cloud environment, there are several connections that are beyond the control and visibility of the network team. End users may perceive performance as normal, but a small number of retransmissions may exist.
However, when troubleshooting an application performance problem with incrementing retransmissions for the very users who are complaining, the underlying culprit is likely packet loss. Or at least, packet loss will be a significant part of the puzzle.
Lost packets require retransmissions, which take time, which will slow applications down. Depending on how many occur and how fast the endpoints can recover the missing packets, they can significantly impact application performance.
In these cases, walk the link between client and server, analyzing link-level errors for all infrastructure devices you control. It may be that you discover the faulty cable, Frame Check Sequence counter (FCS), or discard indicator that is contributing to the packet loss.
SkyLIGHT PVX Makes This Easy
SkyLIGHT PVX is designed to detect and count retransmissions and duplicate acknowledgments. It can help us to hone in on which connections are suffering packet loss and identify if this is significantly impacting the application or if these are occurring during normal performance.
One of the keys in diagnosing packet loss is understanding where (which systems are suffering from packet loss), when (continuously or momentaneously) and in which conditions (only for certain services or all of them).
The overview and drill down capabilities can help you figure this out. Here are a few examples of screens that could help you find out quickly:
Matrix view of retransmission rates across the network (from segment to segment)
Top conversations with high retransmission rates (PV)
Evolution of the retransmission rates in a TCP conversation (PV)
Evolution of the rate of DupACKs per session in a specific conversation (PV)
In the next article, we will examine the function of TCP Windows and how they impact application performance.