A perfSONAR User's Group

Iperf “self-tests”

From a performance-node-users@internet2.edu thread:

https://lists.internet2.edu/sympa/arc/performance-node-users/2013-05/msg00048.html

Answer from Ezra Kissel at I.U. , which illuminates some things perhaps relevant to understanding throughput and tester validation.

My original question: What is the value or meaning of results from a “self test”, as would occur when one runs an iperf server on a host, and then runs the client on the same host, testing against the local server? (see list archive thread above for more context).

Ezra says:

What you basically end up testing with a loopback throughput test like iperf or similar is a measure of how fast the host can copy to/from memory. With TCP sockets and a typical send/recv loop, you’re moving the user space buffer into a kernel buffer, doing the TCP transfer, and from a kernel buffer back to the receiving user space buffer. 

Here’s the output of the linux perf profiling tool while running an iperf loopback test on my laptop: 

PerfTop: 8738 irqs/sec kernel:94.8% exact: 0.0% [4000Hz cycles], (all, 4 CPUs) 
——————————————————————————————————————-

49.36% [kernel] [k] copy_user_enhanced_fast_string
2.41% [kernel] [k] __alloc_pages_nodemask
1.82% [kernel] [k] get_page_from_freelist
1.64% [kernel] [k] tcp_sendmsg
1.56% [kernel] [k] put_page
1.36% [kernel] [k] __free_one_page
1.11% [kernel] [k] list_del
0.99% [kernel] [k] do_raw_spin_lock
0.80% [kernel] [k] put_page_testzero
0.78% [kernel] [k] __alloc_skb
0.77% [kernel] [k] tcp_recvmsg
0.69% libxul.so [.] 0x0000000000ca3d7f
0.65% [kernel] [k] skb_release_data

$ numactl -C 3 iperf -c 127.0.0.1 -t 120 -i 2
————————————————————
Client connecting to 127.0.0.1, TCP port 5001
TCP window size: 169 KByte (default)
————————————————————
[ 3] local 127.0.0.1 port 37416 connected with 127.0.0.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 2.0 sec 6.56 GBytes 28.2 Gbits/sec
[ 3] 2.0- 4.0 sec 6.62 GBytes 28.4 Gbits/sec
[ 3] 4.0- 6.0 sec 9.34 GBytes 40.1 Gbits/sec
[ 3] 6.0- 8.0 sec 10.0 GBytes 43.0 Gbits/sec
[ 3] 8.0-10.0 sec 10.0 GBytes 43.1 Gbits/sec
[ 3] 10.0-12.0 sec 10.0 GBytes 43.1 Gbits/sec
[ 3] 12.0-14.0 sec 10.0 GBytes 43.0 Gbits/sec

You can see that a large chunk of time is spent just copying the user space buffer. Then there’s some amount of time actually doing the TCP transfer and in-kernel SKB allocation and cleanup. These type of tests are typically CPU bound, as in the iperf client is using 100% of a single core to run the test. This is Sys%, not User%, btw. 

If you can reduce the amount of user space copying, e.g. with the linux splice() call, you can reduce the CPU overhead. In this case, you basically run into the limit of your memory speed and bus bandwidth. 

Here’s a program that uses splice(), for example:

PerfTop: 7075 irqs/sec kernel:96.7% exact: 0.0% [4000Hz cycles], (all, 4 CPUs) 
——————————————————————————————————————-

14.83% [kernel] [k] copy_user_enhanced_fast_string
8.55% [kernel] [k] do_raw_spin_lock
3.06% [kernel] [k] tcp_sendpage
2.00% [kernel] [k] follow_page
1.83% [kernel] [k] put_page_testzero
1.61% [kernel] [k] _local_bh_enable_ip.isra.10
1.39% [kernel] [k] tcp_transmit_skb
1.29% [kernel] [k] tcp_ack
1.26% [kernel] [k] skb_release_data
1.23% [ip_tables] [k] ipt_do_table
1.11% [kernel] [k] tcp_v4_rcv
1.06% [kernel] [k] get_page

$ ./xfer_test -c 127.0.0.1 -t 120 -i 2 -o 16 -a 1 -A 2 -S
Using a buffer of size 65536 with 1 partitions of size 65536
[0.0-2.0 sec] 11.27 GB 45.09 Gb/s
[2.0-4.0 sec] 11.04 GB 44.14 Gb/s
[4.0-6.0 sec] 11.56 GB 46.22 Gb/s
[6.0-8.0 sec] 11.76 GB 47.04 Gb/s
[8.0-10.0 sec] 11.30 GB 45.20 Gb/s
[10.0-12.0 sec] 11.77 GB 47.07 Gb/s
[12.0-14.0 sec] 11.50 GB 45.98 Gb/s
[14.0-16.0 sec] 11.25 GB 44.98 Gb/s

In this test, the amount of time spent copying in the kernel is significantly reduced and we’re never CPU bound. Now the performance fluctuates a bit more as the test is limited by my memory and contention for the available bandwidth as other things are running on my laptop. 

An important note for tuning: for any high-performance throughput testing, you want to, at a minimum, bind the client/server threads to a single core to prevent CPU migrations. numactl can be used, or you can build this capability into your program using the affinity macros and scheduling calls in sched.h. 

– ezra

A further dialogue:

On 5/30/2013 2:22 PM, Alan Whinery wrote:

Thanks Ezra, exactly the kind of clarification I was looking for.

> How would using numactl be different from using taskset (just curious)?

> In the past I have used taskset to keep multiple processes from ganging
> up on the same CPU core, although I hadn’t considered that an
> un-disciplined process would change cores during execution. If “CPU
> migrations” are at all common, then they ought to be included in the
> network-throughput-testing mindset (?). As for numactl, I first heard
> of it about 2 minutes ago…


For setting logical CPU affinity, taskset serves the same purpose. numactl also lets you set CPU and memory affinity for NUMA systems. Examples include multi-socket Sandy Bridge and Ivy Bridge systems. In those cases you will want to make sure you do not cross socket boundaries, and with real NICs, you often want IRQ affinity for the NIC bound to the same socket as your application. Achieving good application performance at >10Gb/s is really a user-driven process, for better or worse…

Of course, you will typically not see issues with throughput testing at <10Gb/s on most modern hardware with or without affinity tuning. I imagine that probably covers a significant space of testing that goes on in the context of this list. It’s just good to keep in mind for those really fast links.

– ezra

 

Comments are currently closed.