Per-Packet Provenance, Part 1: Managing Burst Size To Sidestep Queue Drops
[I’m reporting on work done with the guidance of Darrell Newcomb, and the assistance of John Hess (John guided me too, but that was on other stuff). I should probably also acknowledge the others who have no doubt told me about this in a presentation somewhere, or elsewhere, and I forgot about it.]
[update: although the invocations of tc cited here do the trick, I should point out that at the time I did this, I was misapprehending the meaning of “burst” in the scheduler spec. I need to circle back and clean this up…]
Sometimes a path just doesn’t perform as it should. Usually this comes to light on Friday afternoon, just after anyone who could assist has left the office. Sometimes you can’t wait until Monday. If a long path performance is disabling-ly bad, due to a shallow queue somewhere along the way, and you only have access to the sending end, what can you do to band-aid the situation until Monday, or until the duly-appointed maintenance window?
What follows is a couple of real life examples of how to make a network path with a shallow queue behave better, when you don’t have access to the box with the shallow queue.
These are real results from troubleshooting inquiries. All hostnames have been replaced with the names of dogs from comic strips.
Real Life Example #1:
Across a 49 mS RTT, 10 Gbps path, which includes a long hop under an ocean, the following was observed:
[root@odie whinery]# bwctl -c snoopy bwctl: Using tool: iperf3 bwctl: 15 seconds until test results available SENDER START Connecting to host snoopy, port 5795 [ 15] local odie port 46902 connected to snoopy port 5795 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 15] 0.00-1.00 sec 6.14 MBytes 51.5 Mbits/sec 2 323 KBytes [ 15] 1.00-2.00 sec 6.57 MBytes 55.1 Mbits/sec 1 201 KBytes [ 15] 2.00-3.00 sec 5.17 MBytes 43.4 Mbits/sec 0 350 KBytes [ 15] 3.00-4.00 sec 11.1 MBytes 92.9 Mbits/sec 0 891 KBytes [ 15] 4.00-5.00 sec 23.9 MBytes 200 Mbits/sec 4 708 KBytes [ 15] 5.00-6.00 sec 15.5 MBytes 130 Mbits/sec 0 883 KBytes [ 15] 6.00-7.00 sec 9.18 MBytes 77.0 Mbits/sec 1 516 KBytes [ 15] 7.00-8.00 sec 12.7 MBytes 107 Mbits/sec 0 883 KBytes [ 15] 8.00-9.00 sec 27.6 MBytes 231 Mbits/sec 0 1.97 MBytes [ 15] 9.00-10.00 sec 20.2 MBytes 169 Mbits/sec 2 446 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 15] 0.00-10.00 sec 138 MBytes 116 Mbits/sec 10 sender [ 15] 0.00-10.00 sec 134 MBytes 113 Mbits/sec receiver iperf Done. SENDER END
Wow. For and end-to-end 10 GbE path, that’s horrible. What’s going on is that there is a shallow queue in the middle with a small decrease in “wire speed”, stepping down from 10.000 Gbit/s to 9.953 Gbit/s (a LAN-PHY to WAN-PHY change). Using nuttcp‘s burst mode, the queue depth at the PHY change was measured at about 45 packets. Since the queue is only 45 packets, a burst only needs to exceed the drain rate for 325 microseconds before the queue overflows. So the problem isn’t visible as a high rate flow, or a peak on an MRTG graph. Since the more-than-325 microsecond bursts that overflow the queue are surrounded, temporally, by silence (there are gaps between the bursts), the sender can be disciplined to spread packets out and avoid those bursts.
Enter Linux Traffic Control and the Hierarchical Token Bucket queuing discipline, which allows us to pace packets out of our local interface and limit the burst size as well.
After a bit of fiddling, a sweet spot was found, by limiting rate and burst size:
[root@odie whinery]# /sbin/tc qdisc del dev eth0 root [root@odie-ps whinery]# /sbin/tc qdisc add dev eth0 handle 1: root htb [root@odie-ps whinery]# /sbin/tc class add dev eth0 parent 1: classid 1:1 \ htb rate 3000mbit ceil 4000mbit burst 250k [root@odie-ps whinery]# /sbin/tc filter add dev eth0 parent 1: protocol ip \ prio 1 u32 match ip dst <snoopy's IP>/32 flowid 1:1
And voila:
[root@melody-ps whinery]# bwctl -c snoopy bwctl: Using tool: iperf3 bwctl: 54 seconds until test results available SENDER START Connecting to host snoopy, port 5783 [ 15] local odie port 50396 connected to snoopy port 5783 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 15] 0.00-1.00 sec 20.9 MBytes 175 Mbits/sec 0 3.72 MBytes [ 15] 1.00-2.00 sec 235 MBytes 1971 Mbits/sec 0 19.6 MBytes [ 15] 2.00-3.00 sec 368 MBytes 3083 Mbits/sec 0 20.3 MBytes [ 15] 3.00-4.00 sec 389 MBytes 3261 Mbits/sec 0 20.4 MBytes [ 15] 4.00-5.00 sec 371 MBytes 3114 Mbits/sec 0 20.7 MBytes [ 15] 5.00-6.00 sec 379 MBytes 3177 Mbits/sec 0 20.9 MBytes [ 15] 6.00-7.00 sec 389 MBytes 3261 Mbits/sec 0 20.9 MBytes [ 15] 7.00-8.00 sec 391 MBytes 3282 Mbits/sec 0 20.9 MBytes [ 15] 8.00-9.00 sec 391 MBytes 3282 Mbits/sec 0 20.9 MBytes [ 15] 9.00-10.00 sec 385 MBytes 3230 Mbits/sec 0 21.1 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 15] 0.00-10.00 sec 3.24 GBytes 2784 Mbits/sec 0 sender [ 15] 0.00-10.00 sec 3.23 GBytes 2773 Mbits/sec receiver iperf Done. SENDER END
Which isn’t 10 Gbits/sec, but it is a 24-fold increase in transferred data volume, using the same shallow queue.
Real-Life Example #2:
A 9-hop, 120 mS RTT path from a 10 Gbits/sec sender to a 1 Gbit/sec receiver. All intermediate hops are connected at 10 Gbits/sec.
With no burst control:
#bwctl -c daisy bwctl: Using tool: iperf3 bwctl: 17 seconds until test results available SENDER START Connecting to host daisy, port 5417 [ 15] local ruff port 52915 connected to daisy port 5417 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 15] 0.00-1.00 sec 1.17 MBytes 9.78 Mbits/sec 0 305 KBytes [ 15] 1.00-2.00 sec 8.24 MBytes 69.1 Mbits/sec 86 542 KBytes [ 15] 2.00-3.00 sec 2.89 MBytes 24.2 Mbits/sec 80 386 KBytes [ 15] 3.00-4.00 sec 3.14 MBytes 26.3 Mbits/sec 0 433 KBytes [ 15] 4.00-5.00 sec 4.54 MBytes 38.1 Mbits/sec 0 628 KBytes [ 15] 5.00-6.00 sec 6.00 MBytes 50.4 Mbits/sec 0 932 KBytes [ 15] 6.00-7.00 sec 8.75 MBytes 73.4 Mbits/sec 0 1.34 MBytes [ 15] 7.00-8.00 sec 14.9 MBytes 125 Mbits/sec 0 1.97 MBytes [ 15] 8.00-9.00 sec 21.2 MBytes 178 Mbits/sec 0 2.78 MBytes [ 15] 9.00-10.00 sec 27.5 MBytes 231 Mbits/sec 0 3.73 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 15] 0.00-10.00 sec 98.4 MBytes 82.5 Mbits/sec 166 sender [ 15] 0.00-10.00 sec 93.1 MBytes 78.1 Mbits/sec receiver iperf Done. SENDER END
Add a pinch of encouragement:
#/sbin/tc qdisc del dev eth0 root #/sbin/tc qdisc add dev eth0 handle 1: root htb #/sbin/tc class add dev eth0 parent 1: classid 1:1 htb rate 500mbit ceil 550mbit burst 32k #/sbin/tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \ match ip dst <daisy's IP>/32 flowid 1:1
Pretty good at 500 Mbits/sec and 32k (about a 21 packet burst limit with MTU=1500).
#bwctl -c daisy bwctl: Using tool: iperf3 bwctl: 46 seconds until test results available SENDER START Connecting to host daisy, port 5419 [ 15] local ruff port 57295 connected to daisy port 5419 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 15] 0.00-1.00 sec 1.30 MBytes 10.9 Mbits/sec 0 338 KBytes [ 15] 1.00-2.00 sec 14.4 MBytes 121 Mbits/sec 0 2.72 MBytes [ 15] 2.00-3.00 sec 45.0 MBytes 377 Mbits/sec 0 7.51 MBytes [ 15] 3.00-4.00 sec 61.2 MBytes 514 Mbits/sec 0 7.99 MBytes [ 15] 4.00-5.00 sec 61.2 MBytes 514 Mbits/sec 0 7.99 MBytes [ 15] 5.00-6.00 sec 61.2 MBytes 514 Mbits/sec 0 7.99 MBytes [ 15] 6.00-7.00 sec 61.2 MBytes 514 Mbits/sec 0 7.99 MBytes [ 15] 7.00-8.00 sec 61.2 MBytes 514 Mbits/sec 0 7.99 MBytes [ 15] 8.00-9.00 sec 60.0 MBytes 503 Mbits/sec 0 7.99 MBytes [ 15] 9.00-10.00 sec 61.2 MBytes 514 Mbits/sec 0 7.99 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 15] 0.00-10.00 sec 488 MBytes 410 Mbits/sec 0 sender [ 15] 0.00-10.00 sec 480 MBytes 402 Mbits/sec receiver iperf Done. SENDER END
After increasing limit to 900 Mbits/sec:
/sbin/tc qdisc del dev p2p1 root /sbin/tc qdisc add dev p2p1 handle 1: root htb /sbin/tc class add dev p2p1 parent 1: classid 1:1 htb rate 900mbit \ ceil 950mbit burst 32k /sbin/tc filter add dev p2p1 parent 1: protocol ip prio 1 u32 \ match ip dst <daisy's IP>/32 flowid 1:1
bwctl -t 20 -c daisy bwctl: Using tool: iperf3 bwctl: 27 seconds until test results available SENDER START Connecting to host daisy, port 5501 [ 15] local ruff port 42228 connected to daisy port 5501 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 15] 0.00-1.00 sec 946 KBytes 7.75 Mbits/sec 0 263 KBytes [ 15] 1.00-2.00 sec 12.6 MBytes 106 Mbits/sec 0 2.33 MBytes [ 15] 2.00-3.00 sec 36.2 MBytes 304 Mbits/sec 0 5.90 MBytes [ 15] 3.00-4.00 sec 78.8 MBytes 660 Mbits/sec 0 11.1 MBytes [ 15] 4.00-5.00 sec 102 MBytes 860 Mbits/sec 0 13.6 MBytes [ 15] 5.00-6.00 sec 76.2 MBytes 640 Mbits/sec 4 6.76 MBytes [ 15] 6.00-7.00 sec 56.2 MBytes 472 Mbits/sec 0 6.79 MBytes [ 15] 7.00-8.00 sec 56.2 MBytes 472 Mbits/sec 0 6.96 MBytes [ 15] 8.00-9.00 sec 57.5 MBytes 482 Mbits/sec 0 7.29 MBytes [ 15] 9.00-10.00 sec 62.5 MBytes 524 Mbits/sec 0 7.79 MBytes [ 15] 10.00-11.00 sec 66.2 MBytes 556 Mbits/sec 0 8.45 MBytes [ 15] 11.00-12.00 sec 72.5 MBytes 608 Mbits/sec 0 9.29 MBytes [ 15] 12.00-13.00 sec 80.0 MBytes 671 Mbits/sec 0 10.3 MBytes [ 15] 13.00-14.00 sec 88.8 MBytes 744 Mbits/sec 0 11.5 MBytes [ 15] 14.00-15.00 sec 100 MBytes 839 Mbits/sec 0 12.9 MBytes [ 15] 15.00-16.00 sec 108 MBytes 902 Mbits/sec 0 13.5 MBytes [ 15] 16.00-17.00 sec 109 MBytes 912 Mbits/sec 0 13.6 MBytes [ 15] 17.00-18.00 sec 109 MBytes 912 Mbits/sec 0 13.6 MBytes [ 15] 18.00-19.00 sec 109 MBytes 912 Mbits/sec 0 13.6 MBytes [ 15] 19.00-20.00 sec 109 MBytes 912 Mbits/sec 0 13.6 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 15] 0.00-20.00 sec 1.45 GBytes 625 Mbits/sec 4 sender [ 15] 0.00-20.00 sec 1.45 GBytes 621 Mbits/sec receiver iperf Done. SENDER END
Which actually approaches maximum rate, with a 1 Gbit/sec receiver.
Note that, as the throughput increases, there’s less space between bursts to pace into, so burst pacing can only go so far.
Although I had messed with Linux queuing disciplines from about 10 years ago, this journey began with:
https://fasterdata.es.net/host-tuning/packet-pacing/htc-based-pacing/
Reportedly, things are somewhat simpler with the Fair Queuing (FQ) scheduler, which appears in most-recent versions of RH kernel trees, although it has been in upstream kernel since 3.1:
https://fasterdata.es.net/host-tuning/packet-pacing/
Additionally, Brian Tierney writes:
If you have the time and inclination, you could try installing the ‘elrepo’ 4.8 kernel with FQ support on CentOS6 (instructions here: https://fasterdata.es.net/host-tuning/linux/recent-tcp-enchancements/)
Then you can just use iperf3 and/or bwctl with the “-b” flag, which uses FQ via the ‘setsockopt’ system call to set the SO_MAX_PACING_RATE option.
This should have the same end result, and is easier than messing with htb.
This is really part 2 or “Per-Packet Provenance”, but part 1, to be entitled “Notes On UDP Throughput Testing” turned into a whole Thing. Stay tuned.
Notes On ECS Liva X as a perfSONAR Node Per-Packet Provenance Part 0: Introduction
Comments are currently closed.