skip to main content

developerWorks  >  Linux | Web development  >

Performance Tradeoffs of TCP Selective Acknowledgments

Does the SACK optimization also present a DoS opportunity?

developerWorks
Document options
PDF format - A4

PDF - A4
45KB (9 pages)

PDF format - letter

PDF - Letter
45KB (10 pages)

Get Adobe® Reader®

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Intermediate

Patrick McManus (mcmanus@ducksong.com)Independent Consultant

17 Sep 2007

Selective acknowledgments (SACKs) are an optional feature of TCP which is necessary to effectively use all of the availble bandwidth of some networks. While they are good for throughput, processing these options has proven to be CPU intensive for the TCP sender. This weakness can be exploited by a malicious peer even under commodity network conditions. This article contributes experimental measurements that characterize the extent of the problem within the Linux TCP stack. SACK is enabled by default on most distributions.

Background

Mixed in with the usual Linux development chatter heard on Internet over the last few months, has been a significant discussion of Linux's TCP SACK (Selective Acknowledgment) implementation. These comments were generally concerned with the performance of the TCP stack when processing certain SACK events. Some hinted at the presence of a security exposure.

I was intrigued by the discussion, but it seemed to lack hard data. What specific conditions were they talking about? Is this a minor performance nit, or is it an out right server DoS opportunity?

David Miller: "This is a problem that basically every single TCP stack out there right now is vulnerable to, lots of CPU processing for invalid or maliciously created SACK blocks."
Ilpo Jarvinen: "However, as long as there is depency[sic] to skb's fack_count in sacktag, there seems to be some room for cpu processing attacks even with RB-tree case because walk becomes necessary on the slow path."
NC State Univ: "We show the efficiency of SACK processing in this experiment. As we have seen a lot of cases that TCP didn't work well with large window drop especially with large buffer."
CHC IT: "And finally a warning for both 2.4 and 2.6: for very large BDP paths where the TCP window is > 20 MB, you are likely to hit the Linux SACK implementation problem. If Linux has too many packets in flight when it gets a SACK event, it takes too long to located [sic] the SACKed packet, and you get a TCP timeout and CWND goes back to 1 packet."

This article will look at the SACK implementation and its performance under non ideal conditions starting with Linux 2.6.22. That release is currently used as a common distribution kernel on Ubuntu 7.10. That kernel is several months old now. Since then the developers haven't just been talking about the issue - they have been writing code for it. The current development kernel, 2.6.25, contains a series of patches from Ilpo Jarvinen that deal with SACK performance. We will finish up the article by looking at how that code may change things and also take a quick look at some more changes under consideration still further down the road.



Back to top


A Quick SACK Primer

SACK is defined by RFCs 2018, 2883, and 3517 (See Resources). Plain TCP (i.e. non-SACK) acknowledgments are strictly cumulative - an acknowledgment of N means that byte N and all previous bytes have been received. The problem SACK is meant to address is the "all or nothing" nature of the plain cumulative acknowledgment.

For instance, if packet #2 in the sequence 0 to 9 were to be the only loss during a transfer then the receiver could only issue a plain ACK for #1 because that is the highest packet it has received without a gap. However, a SACK receiver could instead issue an ACK for #1 plus a SACK option for packets 3 through 9. This extra information helps the sender determine that the losses are fairly minimal and it only needs to retransmit a little bit of data. Without this extra information it would need to retransmit much more data and slow down its sending rate to accommodate what looks like a high loss network.

SACK is especially important to effectively utilize all available bandwidth on high delay connections. Because of the high delay there are often large numbers of packets "in flight" waiting for acknowledgment at any given time. In Linux these packets sit in the retransmission queue until they have been acknowledged and then are no longer needed. The packets are in sequence number order but are not indexed in any way. When a received SACK option needs to be processed, the TCP stack must find the packet(s) in the retransmission queue that the SACK applies to. The larger the retransmission queue, the harder it is to find the data of interest.

Up to four SACK options may be present in each packet.



Back to top


The Attack Scenario

The general exposure stems from the fact that the SACK option receiver can be made to do an arbitrary amount of work based on the reception of a single packet. This N:1 ratio allows SACK senders to overwhelm receivers on much more powerful platforms.

The specific attack scenario for a Linux SACK processor requires a retransmission queue that already holds large numbers of packets. The attacker then sends a packet full of SACK options designed to force the other host to scan that entire queue to process each option. An option that refers to a packet at the end of the queue may require a full list traversal by the receiving TCP stack in order to identify which packet the option refers to. An attacking client can obviously send any SACK option it likes, but less obviously, it can also easily control the size of the peer's retransmission queue. The creator of the SACK option is able establish the amount of work the option receiver will need to perform for each option received because it also controls the sizes of the queue.

The number of packets in the retransmission queue is fundamentally driven by the bandwidth delay product (BDP) between the two hosts. Bandwidth of a path is capped by the physical properties of the network - an attacker cannot easily increase bandwidth without acquiring more infrastructure. However, she can arbitrarily add delay by stalling each acknowledgment a little while before transmission. In order to effectively utilize the bandwidth of this high delay connection the server will need to have enough packets in flight so that their aggregate transmission time equals the time it takes to get the delayed acknowledgment back. If it does not do this there are periods where the network is not moving any packets and therefore its bandwidth is not fully utilized. High delay paths require lots of packets in the flight to use the network efficiently and TCP senders aggresively try and fill this window within the limits of congestion control standards.

Delaying acknowledgments effectively increases the size of the retransmission queue. This is a pre-requisite to the attack. Filling a relatively slow 10 MByte/sec connection with 1750ms of delay generates a window of over 12,000 packets. Faster connections result in even bigger windows but part of the vulnerability stems from the fact that this approach is viable on standard home broadband connections just by introducing large delays.

The client selects the value of its SACK options whenever transmitting a delayed ACK. The sender adds sack options referring to the data in the packets most recently arrived (i.e. those with the highest sequence numbers seen). This data is just starting to have its own ACK delayed, but it can be SACKed now. This ensures maximum distance in the retransmission queue between the packets being ACKed and SACKed.

This article will focus on this particular attack scenario. It is known as "find-first" attack because it forces the TCP stack to spend excessive amounts of time finding the first byte referenced by the SACK option.



Back to top


Measuring the Attack Scenario

Kode Vicious: "Take a Freaking Measurement!"

With the background under our belts, it is time to have the rubber meet the road and find out if this is a serious issue or not. Is this an area for micro-optimization, a full fledged crisis, or something in between? To figure that out we need to roll up our sleeves and actually characterize the extent of the issue using real data. Task number one is to decide what data to gather and how to measure it.

Experiment Setup

The most important item of interest to collect is CPU utilization on the server. Oprofile can do the heavy lifting for this using the standard CLK_UNHALTED counter. It is also interesting to count the number of packets scanned while processing the SACKs, and the average size of the retransmission window. I instrumented the server source code to obtain the packets scanned counter. I also reran the tests without this annotation to make sure I was getting the same results a standard server would see.

The size of the in flight window is also interesting data. The average size of the retransmit queue can be calculated from the packets scanned counter if the number of SACK options sent is also tracked at the client. For the non-SACK test cases I used the typical length of the delayed ACK queue reported by the client as an approximation of the lower bound of the sender's window size.

All of the measurements were taken on a standard Linux server. A custom client was written to drive the experiments and trigger the code paths of interest on the server. The client implemented its own TCP stack and ran on top of the raw socket kernel API. The client was certainly not a complete TCP stack, but it was enough to test the variables of interest.

The client starts the experiment by connecting to the server and sending a simple HTTP request for a 700MB ISO file. It then consumes all the data sent by the server over a normal 100 Mbit/sec network in response. Each packet from the server is acknowledged after a 1750ms delay. The server strives to fill that whole 1750ms window by gradually sending more and more packets at one time before it sees them acknowledged. I observed windows in flight of over 14,000 packets.

The client has a configurable option to add the SACK information relating to the most recently arrived data onto each ACK as it is transmitted. If this option is enabled, the generated acknowledgment contains up to four SACK options. Each option refers to a random 1 byte range of one of the four most recently arrived packets which are currently having their acknowledgments delayed. If less than four packets are waiting, then a corresponding number of options are generated.

In order to allow meaningful comparisons, this data was collected from three different vantage points.

  1. Baseline - In the first test I sought to obtain a baseline measurement. Instead of using a custom client and TCP stack, I used the standard Linux TCP stack and the wget command line HTTP client.
  2. Custom but no-SACK - The second data point needed to use the custom large window inducing client, but it did not make use of the SACK option at all. This data point allows the separation of basic effects caused by large windows from those caused by handling malicious SACK options.
  3. Custom with SACK -The last data set was gathered using the large window inducing client while also including 4 SACK options on every ACK it sent.

The server used was an older 1.2Ghz Athlon XP server.

The Measurements


Server Utilization Measurements
MethodACKs ProcessedPackets Scanned for SACK processingElapsed TimeCPU UtilizationSampled Ticks per ACKSampled Ticks per KB transferredAverage Retransmission Queue Length
Baseline252,95501:0222%1.720.565
Custom but no-SACK498,27502:599%1.471.037,000 - 10,000
Custom with SACK534,202755,368,50012:4733%10.878.131,414

Misleading Data

A few initially surprising data points are easily explained.

Note that the number of ACKs is twice as high for the custom client as compared to the baseline wget. This is because the custom client is designed to aggravate server behavior. As a result it does not do coalesced acknowledgments. Most TCP stacks coalesce two acknowledgments into one if they are sent back to back without any data. The custom client does have an option to do that, but it was not turned on for this data. A run was made with the option engaged and it turns out not to materially impact the results - read on for more on why later on.

The alert reader may also notice that the number of sampled ticks per KB transferred is 16 times as great for the code that triggers the SACK path as compared with the baseline number, but the CPU utilization differential is a more modest 1.5. The key additional piece of data is the elapsed time of the test. The baseline used 22% of the CPU for a little over a minute, where the SACK client used 33% of the CPU over almost 13 minutes. At the end of the day, that is a lot more cycles used by the SACK client to achieve the same amount of data transfer.

At first glance, these measurements do not appear to be such a bad story after all. While CPU use is up during the attack to 33%, it is not completely exhausted. Exhausting the CPU would prevent other work from getting done and constitute a denial of service.

Unfortunately, digging just a touch deeper raises some concerns. The total transfer time went way up - from just 1 minute at the baseline to almost 13 minutes under the full attack scenario. Additionally, the increased CPU utilization rate is sustained over that full 13 minutes, as compared to the original one minute. In aggregate that is a lot more cycles expended in order to accomplish the same goal. You can see this easily by comparing the sampled ticks per KB transferred number between the three data points.

Digging even deeper shows the 33% CPU utilization number to be misleading. The 13 minutes consist of a repeating cycle of bursts several seconds in duration where the entire server is occupied at 100% utilization. The bursts are followed by lulls in CPU use and then the cycle repeats again. The overall result is an average 33% utilization, but there are prolonged periods where the CPU is completely monopolized by TCP processing triggered by a remote host.

Consider these graphs of CPU utilization vs time for all three scenarios:


Baseline using wget and no SACK
graph showing low cpu usage

The baseline measurement is nice and efficient. The transfer completes quickly and fills the available bandwidth without using more than 25% of the CPU at any given time. This proves this modest server is more than capable of filling the network under the right circumstances.


Large window custom client with no SACK involved
graph showing a little cpu

The custom client, when not using SACK, also generates a reasonable utilization graph. The added complication of managing the large windows and their inevitable losses leads to some extra CPU being used, but at no time is the server incapable of keeping up.


Large window custom client with SACK
graph showing bursty cpu

You cannot help but be shocked at how much extra blue ink is spilled on this plot of the SACK enabled download. While overall utilization is an average 33%, the graph clearly illustrates recurring bursts where the server is fully swamped.

The graph looks to be repeatedly plotting y=x^2 until y=100%. This is the manifestation of each SACK option needing to scan the entire retransmission queue in order to find the data it refers to. When the congestion window is growing on the sender, it effectively doubles the number of packets in flight during the time it takes to send a packet and receive the acknowledgment for it back. This doubled queue needs to be inspected for each SACK option received - note the astonishing 755 million packet comparisons done in SACK processing for just a 1/2 million ACKs received. This algorithm generates the exponential behavior seen on the plots.

What About A Bigger Server?

An initial concern about the experiment was that it used a modest 1.2Ghz Athlon as the server testbed. This would seem to underestimate the capacity of the server side.

However the quadratic behavior seen here is the kind of trend that cannot be helped by testing a faster server or implementing optimizations such as coalescing 2 ACKs into 1. Such changes will allow the congestion window to grow beyond the modest 3200 packets limits seen during these measurements, but the hockey-stick growth of the CPU utilization on the server will surely still max out before opening the window all the way to 10,000 or more that is necessary to fill this modest Fast Ethernet network.

The final mystery is why this is not totally catastrophic for the server. It would be reasonable to expect the utilization to ramp to 100% and then stay clamped at that level through the rest of the transfer. Instead we see a series of fits and starts. It is not as if the retransmission queue is getting shorter. Or is it?

In practice the retransmission queue does periodically shrink all the way back to zero packets. From that point the process begins anew, congestion windows again expand, and the SACK overhead picks up again. The whole process takes a number of seconds to reach a size where it impacts the server CPU usage again. That explains the cyclic nature of the graph.

The queue appears to shrink because the intense processing burden at peak causes a TCP timeout based recovery to occur. The stack reacts to the timeout by resetting the congestion window back to an empty state. This renewed small window hinders transfer speed, but it also keeps the server alive and able to attend to some other work while the window grows into dangerous territory again.



Back to top


Kernel Developments

What Is Going On In Development?

The Linux networking team is already working on this code. On Nov 15, Ilpo Jarvinen posted a significant rework of the SACK processing path. That code was placed into Linus's pre-2.6.25 tree during the merge window on January 28th. The whole series was 10 patches. This article will focus on three key patches of interest.

Generally the patches are focused on improving SACK performance in the general case. They might help some with attacking scenarios, but they are not thought to be an antidote to the scenarios measured in this article.

Ilpo Jarvinen - "I think that we cannot prevent malicious guys from inventing ways to circumvent the tweak and optimization that may exists for the typical, legitimate use cases."

The first change is to optimize the caching strategy in the general case where the SACK option contains only information for data at higher sequence numbers than have been previously SACKed. Generally speaking this means that an already reported hole remains in the outstanding window, but new data is being SACKed at the more recent end of the window. This is the common case for normal operation. The patch optimizes this case by converting the cached reference from a sequence number to a pointer to the highest packet in the queue that has been SACKed in the past. Using this information, a later patch processes SACKs that only address data higher than this cached value by using the cached pointer as the starting point for the list traversal and in turn save most of the work of walking the list.

Unfortunately, the malicious test client cannot be optimized by this strategy. A typical ACK from this client does contain a SACK option for data higher than has been seen before but it also contains sequence references to the immediately prior packets as well. The 2.6.25 implementation will need to walk the retransmission queue from the start in order to find this data.

The second change contains a reorganization to support a different "skipping" queue traversal algorithm of the queue with future patches. While this does not immediately help the test results discussed here, as skipping is still implemented with the same linear walk, the future changes supported by this will have a substantial impact on attack scenarios. The comments included with the patches two major future changes in the works.

The first potential future change would modify the unacknowledged packet list so that it is will be organized with an index as a red-black tree instead of the current linear list. This will allow log(N) look up of packets referenced by SACK options. A change that introduces some sort of random access index is critical to address the find-first attacks against the TCP stack which may access random elements of the large retransmission queue.

The other organizational change addresses a concern not yet explicitly raised in this article. The index structure will give good performance looking up individual packets, but a SACK option can cover arbitrary byte ranges that include multiple packets. There is nothing to stop a malicious client from sending options that cover almost all of the data in the window. This is different from the find-first attack this article has focused on. Indeed the first packet may be the first packet in the list and thus very easy to find. However, finding the packets of interest quickly is of small solace if the whole queue needs to be walked linearly to process the SACK option. The code change would be a reorganization of the current list into two lists, one containing data that had been SACKed and one with data that had not been SACKed. This can help considerably by reducing the search space to just that data not previously SACKed. There are some related complications regarding a related specification called DSACK (duplicate SACK), but partitioning is the direction being considered.

The last patch of interest is a change in congestion control semantics in order to take advantage of the SACK rules in RFC 3517. These changes allow the kernel to avoid full timeout based recovery scenarios under more circumstances. These timeout based recoveries require shrinking the sending window all the way down and slowly re-growing it over time back to the level supported by the current bandwidth delay product. The recovery time is responsible for the pauses between activity bursts during the test.

Measuring 2.6.25-pre

With this new code under our belts, it was time to remeasure the data point that used the custom delay inducing client with SACK options turned on. This time the test is done against the 2.6.25 development code. The earlier 3 data points are included in the table for easy reference.


Server Utilization Measurements

MethodACKs ProcessedPackets Scanned for SACK processingElapsed TimeCPU UtilizationSampled Ticks per ACKSampled Ticks per KB transferredAverage Retransmission Queue Length
Baseline252,95501:0222%1.720.565
Custom but no-SACK498,27502:599%1.471.037,000 - 10,000
Custom with SACK534,202755,368,50012:4733%10.878.131,414
Custom with SACK against pre-2.6.25530,8792,768,229,47210:4249%13.610.075,214

This is the graph of CPU usage over time using the custom large window client with malicious SACK options against a pre-2.6.25 development kernel:


Large window custom client with SACK against kernel pre-2.6.25
Graph showing LOTS of cpu

The CPU graph, which had previously consisted of cyclical fits of starts and stops, is now more consistently saturated over time. Even though the kernel code has been made meaningfully more efficient the test uses more cycles to accomplish the same file transfer. This is a counter-intuitive result.

The newer code does complete more quickly, but by every measure it also severely monopolizes the CPU for long periods of time and uses more overall CPU time. The missing link that both of these things can be attributed to is the reduction in TCP timeout based recoveries on the newer kernel because of the RFC 3517 related changes. The 2.6.22 code averaged 17 timeouts for each run of the test client. The 2.6.25 code averaged just 2. The result is graphed dramatically as much less idle ramp up time in between time out events, resulting in less down time.

Fewer timeouts means the sender maintained a larger window on average. Large windows are necessary for decent throughput on high latency links. This development version of the TCP stack had a significant advantage in transfer speed and finished 2 minutes quicker than the deployed stack largely because it was able to keep wider windows open.

However, those larger average windows also mean the SACK receiving code needs to do a lot more work for every packet received as the queue to be scanned contains more packets. The 2.7 billion packets scanned for the file transfer (4 times as many as the previous kernel version), and 10.07 sampled ticks per KB transferred are testament to exactly how much work had to be done.

Faster processors do not help much in this situation either. They would be able to scan larger packet chains in the same amount of time, but that in turn would simply expand the window a little bit creating yet more work for each new option to be processed. A lot more processing cycles would be expended in order to process the same number of SACK options - the faster processor just creates more overhead for itself without getting any more real work done.



Back to top


Conclusion

The performance implications of maliciously crafted SACK options can be very significant but do not rise to the level of a trivially exercisable DoS attack. The saving grace is the self-regulating nature of the occasional timeouts but it not hard to imagine a different client that could pace itself at a rate which occupied the server but did not overwhelm it to the point of timeout.

Computers that do not send large blocks of data do not need to be concerned about this as they can never fill the large windows pre-requisite for the exploit. While Selective Acknowledgements are critical for good performance on high bandwidth delay product network links, they are still an optional feature which can be disabled without sacrificing interoperability. The sysctl variable net.ipv4.tcp_sack can be set to 0 to disable SACK in the TCP stack.

Good work is going on for the general SACK handling case in the current Linux kernel development tree. This has laid the ground work for future developments, such as packet list indices and partitions, which will help with some of the attack vectors down the road.



Resources



About the author

Author1 photo

Patrick is an experienced software engineer and entrepreneur with a track record in multiple startup and emerging technology ventures. He specializes in close to the metal high bandwidth networked services.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top