A perfSONAR User's Group

Per-Packet Provenance, Part 1: Managing Burst Size To Sidestep Queue Drops

[I’m reporting on work done with the guidance of Darrell Newcomb, and the assistance of John Hess (John guided me too, but that was on other stuff). I should probably also acknowledge the others who have no doubt told me about this in a presentation somewhere, or elsewhere, and I forgot about it.]

[update: although the invocations of tc cited here do the trick, I should point out that at the time I did this, I was misapprehending the meaning of “burst” in the scheduler spec. I need to circle back and clean this up…]

Sometimes a path just doesn’t perform as it should. Usually this comes to light on Friday afternoon, just after anyone who could assist has left the office. Sometimes you can’t wait until Monday. If a long path performance is disabling-ly bad, due to a shallow queue somewhere along the way, and you only have access to the sending end, what can you do to band-aid the situation until Monday, or until the duly-appointed maintenance window?

What follows is a couple of real life examples of how to make a network path with a shallow queue behave better, when you don’t have access to the box with the shallow queue.

These are real results from troubleshooting inquiries. All hostnames have been replaced with the names of dogs from comic strips.

Real Life Example #1:

Across a 49 mS RTT, 10 Gbps path, which includes a long hop under an ocean, the following was observed:

[root@odie whinery]# bwctl -c snoopy
 bwctl: Using tool: iperf3
 bwctl: 15 seconds until test results available

 SENDER START
 Connecting to host snoopy, port 5795
 [ 15] local odie port 46902 connected to snoopy port 5795
 [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
 [ 15]   0.00-1.00   sec  6.14 MBytes  51.5 Mbits/sec    2    323 KBytes
 [ 15]   1.00-2.00   sec  6.57 MBytes  55.1 Mbits/sec    1    201 KBytes
 [ 15]   2.00-3.00   sec  5.17 MBytes  43.4 Mbits/sec    0    350 KBytes
 [ 15]   3.00-4.00   sec  11.1 MBytes  92.9 Mbits/sec    0    891 KBytes
 [ 15]   4.00-5.00   sec  23.9 MBytes   200 Mbits/sec    4    708 KBytes
 [ 15]   5.00-6.00   sec  15.5 MBytes   130 Mbits/sec    0    883 KBytes
 [ 15]   6.00-7.00   sec  9.18 MBytes  77.0 Mbits/sec    1    516 KBytes
 [ 15]   7.00-8.00   sec  12.7 MBytes   107 Mbits/sec    0    883 KBytes
 [ 15]   8.00-9.00   sec  27.6 MBytes   231 Mbits/sec    0   1.97 MBytes
 [ 15]   9.00-10.00  sec  20.2 MBytes   169 Mbits/sec    2    446 KBytes
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bandwidth       Retr
 [ 15]   0.00-10.00  sec   138 MBytes   116 Mbits/sec   10             sender
 [ 15]   0.00-10.00  sec   134 MBytes   113 Mbits/sec                  receiver

 iperf Done.

 SENDER END

Wow. For and end-to-end 10 GbE path, that’s horrible. What’s going on is that there is a shallow queue in the middle with a small decrease in “wire speed”, stepping down from 10.000 Gbit/s to 9.953 Gbit/s (a LAN-PHY to WAN-PHY change).  Using nuttcp‘s burst mode, the queue depth at the PHY change was measured at about 45 packets. Since the queue is only 45 packets, a burst only needs to exceed the drain rate for 325 microseconds before the queue overflows. So the problem isn’t visible as a high rate flow, or a peak on an MRTG graph. Since the more-than-325 microsecond bursts that overflow the queue are surrounded, temporally, by silence (there are gaps between the bursts), the sender can be disciplined to spread packets out and avoid those bursts.

Enter Linux Traffic Control and the Hierarchical Token Bucket queuing discipline, which allows us to pace packets out of our local interface and limit the burst size as well.

After a bit of fiddling, a sweet spot was found, by limiting rate and burst size:

[root@odie whinery]# /sbin/tc qdisc del dev eth0 root
 [root@odie-ps whinery]# /sbin/tc qdisc add dev eth0 handle 1: root htb
 [root@odie-ps whinery]# /sbin/tc class add dev eth0 parent 1: classid 1:1 \
                           htb rate 3000mbit ceil 4000mbit burst 250k
 [root@odie-ps whinery]# /sbin/tc filter add dev eth0 parent 1: protocol ip \
                           prio 1 u32 match ip dst <snoopy's IP>/32 flowid 1:1

And voila:

[root@melody-ps whinery]# bwctl -c snoopy

 bwctl: Using tool: iperf3
 bwctl: 54 seconds until test results available

 SENDER START
 Connecting to host snoopy, port 5783
 [ 15] local odie port 50396 connected to snoopy port 5783
 [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
 [ 15]   0.00-1.00   sec  20.9 MBytes   175 Mbits/sec    0   3.72 MBytes
 [ 15]   1.00-2.00   sec   235 MBytes  1971 Mbits/sec    0   19.6 MBytes
 [ 15]   2.00-3.00   sec   368 MBytes  3083 Mbits/sec    0   20.3 MBytes
 [ 15]   3.00-4.00   sec   389 MBytes  3261 Mbits/sec    0   20.4 MBytes
 [ 15]   4.00-5.00   sec   371 MBytes  3114 Mbits/sec    0   20.7 MBytes
 [ 15]   5.00-6.00   sec   379 MBytes  3177 Mbits/sec    0   20.9 MBytes
 [ 15]   6.00-7.00   sec   389 MBytes  3261 Mbits/sec    0   20.9 MBytes
 [ 15]   7.00-8.00   sec   391 MBytes  3282 Mbits/sec    0   20.9 MBytes
 [ 15]   8.00-9.00   sec   391 MBytes  3282 Mbits/sec    0   20.9 MBytes
 [ 15]   9.00-10.00  sec   385 MBytes  3230 Mbits/sec    0   21.1 MBytes
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bandwidth       Retr
 [ 15]   0.00-10.00  sec  3.24 GBytes  2784 Mbits/sec    0             sender
 [ 15]   0.00-10.00  sec  3.23 GBytes  2773 Mbits/sec                  receiver

 iperf Done.

 SENDER END

Which isn’t 10 Gbits/sec, but it is a 24-fold increase in transferred data volume, using the same shallow queue.

Real-Life Example #2:

A 9-hop, 120 mS RTT path from a 10 Gbits/sec sender to a 1 Gbit/sec receiver. All intermediate hops are connected at 10 Gbits/sec.

 

With no burst control:

#bwctl -c daisy

bwctl: Using tool: iperf3
bwctl: 17 seconds until test results available

SENDER START
Connecting to host daisy, port 5417
[ 15] local ruff port 52915 connected to daisy port 5417
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[ 15]   0.00-1.00   sec  1.17 MBytes  9.78 Mbits/sec    0    305 KBytes
[ 15]   1.00-2.00   sec  8.24 MBytes  69.1 Mbits/sec   86    542 KBytes
[ 15]   2.00-3.00   sec  2.89 MBytes  24.2 Mbits/sec   80    386 KBytes
[ 15]   3.00-4.00   sec  3.14 MBytes  26.3 Mbits/sec    0    433 KBytes
[ 15]   4.00-5.00   sec  4.54 MBytes  38.1 Mbits/sec    0    628 KBytes
[ 15]   5.00-6.00   sec  6.00 MBytes  50.4 Mbits/sec    0    932 KBytes
[ 15]   6.00-7.00   sec  8.75 MBytes  73.4 Mbits/sec    0   1.34 MBytes
[ 15]   7.00-8.00   sec  14.9 MBytes   125 Mbits/sec    0   1.97 MBytes
[ 15]   8.00-9.00   sec  21.2 MBytes   178 Mbits/sec    0   2.78 MBytes
[ 15]   9.00-10.00  sec  27.5 MBytes   231 Mbits/sec    0   3.73 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[ 15]   0.00-10.00  sec  98.4 MBytes  82.5 Mbits/sec  166             sender
[ 15]   0.00-10.00  sec  93.1 MBytes  78.1 Mbits/sec                  receiver

iperf Done.

SENDER END

Add a pinch of encouragement:

#/sbin/tc qdisc del dev eth0 root
#/sbin/tc qdisc add dev eth0 handle 1: root htb
#/sbin/tc class add dev eth0 parent 1: classid 1:1 htb rate 500mbit ceil 550mbit burst 32k
#/sbin/tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \
           match ip dst <daisy's IP>/32 flowid 1:1

Pretty good at 500 Mbits/sec and 32k (about a 21 packet burst limit with MTU=1500).

#bwctl -c daisy

bwctl: Using tool: iperf3
bwctl: 46 seconds until test results available

SENDER START
Connecting to host daisy, port 5419
[ 15] local ruff port 57295 connected to daisy port 5419
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[ 15]   0.00-1.00   sec  1.30 MBytes  10.9 Mbits/sec    0    338 KBytes
[ 15]   1.00-2.00   sec  14.4 MBytes   121 Mbits/sec    0   2.72 MBytes
[ 15]   2.00-3.00   sec  45.0 MBytes   377 Mbits/sec    0   7.51 MBytes
[ 15]   3.00-4.00   sec  61.2 MBytes   514 Mbits/sec    0   7.99 MBytes
[ 15]   4.00-5.00   sec  61.2 MBytes   514 Mbits/sec    0   7.99 MBytes
[ 15]   5.00-6.00   sec  61.2 MBytes   514 Mbits/sec    0   7.99 MBytes
[ 15]   6.00-7.00   sec  61.2 MBytes   514 Mbits/sec    0   7.99 MBytes
[ 15]   7.00-8.00   sec  61.2 MBytes   514 Mbits/sec    0   7.99 MBytes
[ 15]   8.00-9.00   sec  60.0 MBytes   503 Mbits/sec    0   7.99 MBytes
[ 15]   9.00-10.00  sec  61.2 MBytes   514 Mbits/sec    0   7.99 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[ 15]   0.00-10.00  sec   488 MBytes   410 Mbits/sec    0             sender
[ 15]   0.00-10.00  sec   480 MBytes   402 Mbits/sec                  receiver

iperf Done.

SENDER END

After increasing limit to 900 Mbits/sec:

/sbin/tc qdisc del dev p2p1 root
 /sbin/tc qdisc add dev p2p1 handle 1: root htb
 /sbin/tc class add dev p2p1 parent 1: classid 1:1 htb rate 900mbit \
         ceil 950mbit burst 32k
 /sbin/tc filter add dev p2p1 parent 1: protocol ip prio 1 u32 \
           match ip dst <daisy's IP>/32 flowid 1:1

 

bwctl -t 20  -c daisy

bwctl: Using tool: iperf3
bwctl: 27 seconds until test results available

SENDER START
Connecting to host daisy, port 5501
[ 15] local ruff port 42228 connected to daisy port 5501
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[ 15]   0.00-1.00   sec   946 KBytes  7.75 Mbits/sec    0    263 KBytes
[ 15]   1.00-2.00   sec  12.6 MBytes   106 Mbits/sec    0   2.33 MBytes
[ 15]   2.00-3.00   sec  36.2 MBytes   304 Mbits/sec    0   5.90 MBytes
[ 15]   3.00-4.00   sec  78.8 MBytes   660 Mbits/sec    0   11.1 MBytes
[ 15]   4.00-5.00   sec   102 MBytes   860 Mbits/sec    0   13.6 MBytes
[ 15]   5.00-6.00   sec  76.2 MBytes   640 Mbits/sec    4   6.76 MBytes
[ 15]   6.00-7.00   sec  56.2 MBytes   472 Mbits/sec    0   6.79 MBytes
[ 15]   7.00-8.00   sec  56.2 MBytes   472 Mbits/sec    0   6.96 MBytes
[ 15]   8.00-9.00   sec  57.5 MBytes   482 Mbits/sec    0   7.29 MBytes
[ 15]   9.00-10.00  sec  62.5 MBytes   524 Mbits/sec    0   7.79 MBytes
[ 15]  10.00-11.00  sec  66.2 MBytes   556 Mbits/sec    0   8.45 MBytes
[ 15]  11.00-12.00  sec  72.5 MBytes   608 Mbits/sec    0   9.29 MBytes
[ 15]  12.00-13.00  sec  80.0 MBytes   671 Mbits/sec    0   10.3 MBytes
[ 15]  13.00-14.00  sec  88.8 MBytes   744 Mbits/sec    0   11.5 MBytes
[ 15]  14.00-15.00  sec   100 MBytes   839 Mbits/sec    0   12.9 MBytes
[ 15]  15.00-16.00  sec   108 MBytes   902 Mbits/sec    0   13.5 MBytes
[ 15]  16.00-17.00  sec   109 MBytes   912 Mbits/sec    0   13.6 MBytes
[ 15]  17.00-18.00  sec   109 MBytes   912 Mbits/sec    0   13.6 MBytes
[ 15]  18.00-19.00  sec   109 MBytes   912 Mbits/sec    0   13.6 MBytes
[ 15]  19.00-20.00  sec   109 MBytes   912 Mbits/sec    0   13.6 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[ 15]   0.00-20.00  sec  1.45 GBytes   625 Mbits/sec    4             sender
[ 15]   0.00-20.00  sec  1.45 GBytes   621 Mbits/sec                  receiver

iperf Done.

SENDER END

Which actually approaches maximum rate, with a 1 Gbit/sec receiver.

Note that, as the throughput increases, there’s less space between bursts to pace into, so burst pacing can only go so far.

Although I had messed with Linux queuing disciplines from about 10 years ago, this journey began with:

https://fasterdata.es.net/host-tuning/packet-pacing/htc-based-pacing/

Reportedly, things are somewhat simpler with the Fair Queuing (FQ) scheduler, which appears in most-recent versions of RH kernel trees, although it has been in upstream kernel since 3.1:

https://fasterdata.es.net/host-tuning/packet-pacing/

Additionally, Brian Tierney writes:

If you have the time and inclination, you could try installing the ‘elrepo’ 4.8 kernel with FQ support on CentOS6 (instructions here: https://fasterdata.es.net/host-tuning/linux/recent-tcp-enchancements/)

Then you can just use iperf3 and/or bwctl with the “-b” flag, which uses FQ via the ‘setsockopt’ system call to set the SO_MAX_PACING_RATE option.

This should have the same end result, and is easier than messing with htb.

This is really part 2 or “Per-Packet Provenance”, but part 1, to be entitled “Notes On UDP Throughput Testing” turned into a whole Thing. Stay tuned.

Comments are currently closed.

3 thoughts on “Per-Packet Provenance, Part 1: Managing Burst Size To Sidestep Queue Drops

  • jim warner says:

    Great report.

    The report says: Using nuttcp‘s burst mode, the queue depth at the PHY change was measured at about 45 packets.

    What size packets? Or did you check several packet sizes and find the answer to always be “45”?

    Thanks!

  • Alan Whinery says:

    The answer to “what size packets did you use in that test?” was 8192 payload bytes + 42 bytes of header + framing. That’s simply because the default payload size in nuttcp is 8192.

    The answer to “what packet size led to a count of 45 packets?” is — although I think I tried other sizes, and that it didn’t matter, one gets to a point of many-test-result-cognitive-overload, and I will want to bring actual results to the discussion. You open up the interesting (obvious?) question of what kind of buffer overages we saw — running out of RAM, versus exceeding the number of queue-able packet descriptors with adequate RAM. Since the original symptoms of the problem being worked included having OK performance at 1500 MTU but horrible performance at 9000 MTU, RAM limit may have been an issue.

    I do have the results of those tests, somewhere. I usually draft and email to nobody and leave it in the draft folder when I have a bunch of screenfulls that I don’t want to give up.

    I had intended to do another post about burst appication to find queue depths, and to repeat the aforementioned test, I would have to get the mainland side to run a stand-alone nuttcp server again.

    I also posit that one could, without the standalone nuttcp server, troubleshoot a shallow queue with scheduler-based burst limiting as described in this post.

    (also iperf3 mentions a similar burst mode in the built-in help under “-b”)

  • jim warner says:

    This is a nit pick. 10.000 Gb/s (LAN PHY) and 9.953 Gb/s (WAN PHY) should not be compared. The first is the payload data bit rate. The second is the optical bit rate that carries the burden of line coding. 10 GE uses 66B/64B line coding so the numbers that should be compared are 10.3 Gb/s and 9.953 Gb/s. The speed mismatch is worse. Writing 1’s and 0’s directly onto the fiber without a line code would not guarantee the transition density needed for clock recovery — so everybody does it. Read more (if you care) at:

    http://www.ieee802.org/3/bn/public/mar13/hajduczenia_3bn_04_0313.pdf