A perfSONAR User's Group

NPAD Tweaks

Someday I would like to do a post or page here which ties NPAD/pathdiag up into a neat, comprehensible package. I think that it is probably one of the most useful, least understood parts of perfSONAR, as it offers diagnostics that cannot be found elsewhere. If you want to get a good look an close-in tester validation, NPAD is the tool for you. In order to get an idea why, read the following paper:

http://staff.psc.edu/mathis/papers/mathis08pathdiag.pdf

Which is not at all hard to understand, and points out things that you have either suspected or deduced: that modern, enhanced TCP is so good at dancing around path problems that it renders those problems invisible on a short path. The NPAD tool can expose those invisible problems, when it works. This post is about the things that you can do to fix broken NPAD deployments, as well has how to tweak NPAD to function in the world of common 10 Gig paths. You may note that the currently deployed NPAD is showing its age in several respects. This post is about realizing the value of NPAD as it currently exists in the wild, and may perhaps inform further development.

All of the following assumes that you are working on a relatively up-to-date (like, 3.2+)  pS-Performance Toolkit node. If you have installed NPAD yourself on a non-toolkit node, you may have to figure out what directories, user-ids, et cetera, apply. 

You are responsible for backing up files before you change them. 

Things You Should Do Right Now

(all of which, I believe can be accomplished by upgrading to toolkit 3.3, when it is released.)

  1. Make sure that /var/lib/npad/ServerData/summary.html is owned by user “npad”. For example, become root, and do “chown npad.npad /var/lib/npad/ServerData/summary.html”.
  2. Install numpy ( yum install numpy ).
  3. Install the Gnuplot.py python module (http://sourceforge.net/projects/gnuplot-py/ )
  4. install gnuplot ( yum install gnuplot )
  5. compile diag-client.c and keep it handy

Item 1 will fix a widely observed problem with report presentation, although there may be other report generation issues.

Items 2,3,4 will make report graphs work, and there is often a thousand word’s worth of info in them, especially since NPAD doesn’t explicitly report all of those interesting metrics mentioned in the paper mentioned above.

Item 5 gives you a command line client, to be invoked as:

./diag-client <npad-server-address> <npad-port> <rtt> <rate>

Which is probably more often relevant to the perfSONAR R&E network scenario than the browser-based client. Helpful hints: the <npad-port> is most often 8001, and the report will end up on the server you tested against. Go to the NPAD page on the server in question and click “Results Summary” in the top header to see results, which may work unless you need to check item 1 above.  Since CENTOS does not offer non-X11 gnuplot (like “gnuplot-nox” in Debian/Ubuntu) you will see errors about fonts, displays, and IOErrors at the end of the diag-client run. You can safely ignore these gnuplot-related mesages.

Things You Can Try, Maybe Permanently

In the age of common 10 Gigabit paths, facing the age of 100 gigabit paths and beyond, NPAD could use some adjustments in the way it scans the “window space”. Typical modern window space is wider than it was a few years ago, and it’s still necessary to search from the pre-queuing point to the congestion loss point in order to have full results. As a result, you may encounter the “Exceeded running time limit” message when running a test. My observation has been that most tests along 10 Gig paths will complete within 10-15 minutes, and the default  “watchdog” timeout appears to be 5 minutes. In order to tweak the timeout, you can edit /opt/npad/DiagServerConfig.py. The Toolkit 3.2.2 default looks like:

#
# This file is auto-generated by config.py in the NPAD Diagnostic Server
# distribution. Hand-editing is not recommended. Re-run config.py with
# the '-p' flag to change values.
#

LOGBASE_URL = "ServerData"
LOGBASE_FILE = "/var/lib/npad/ServerData"
CONTROL_ADDR = ""
CONTROL_PORT = 8001
TEST_PORTRANGE_MIN = 8002
TEST_PORTRANGE_MAX = 8020
THREADS = 1
PATHDIAG_PATH = "/opt/npad/pathdiag.py"
MKDATASUMMARY_PATH = "/opt/npad/mkdatasummary.py"
MAX_SOURCESINK_TIME = 60
WATCHDOG_TIME = 360
WC_PATH = "/usr/lib/python2.4/site-packages"
WWW_DIR = "/var/lib/npad"
WWW_PORT = 8000

First of all, note the comment at the top that suggests you should not edit by hand. The referenced “config.py” is not included with the toolkit, and is pretty much install-time biased. The Toolkit installation has no automatic generation of this file, and a hand edit will not be overwritten. So for the purposes of tweaking the time limit, it’s probably OK to edit the file. If you make the “WATCHDOG_TIME” limit 1200 seconds, most tests that are going to finish will have time to complete, and tests that won’t finish will be caught within a reasonable time. Even the NPAD design target case, with the 5 minute watchdog time wasn’t something most of us wanted to stare at while it ran, so 20 minutes is a reasonable “run test and go do something else” time. There is probably a re-focus of NPAD coming, and further optimization of the coarse, scan interval, or intermediate medium resolution scans, or additional scan heuristics could be in the works, for that time when > 10 Gig becomes more prevalent.

Also, the wider window space at current scan resolutions makes the graphs a little crowded, and one band-aid you may choose is to simply make the graph canvas bigger. In order to do this, you can change the numbers in /opt/npad/pathdiag.py, line ~711:

set terminal png size 600, 450

Where “600, 450” are the dimensions of the gnuplot canvas. As an experiment, I changed mine to double (1200,900), which makes the graph less crowded, and fits on my desktop screen.

Things You Should NOT Do, Even Though They’re Supremely Awesome

NPAD is designed to evaluate short paths, because it is intended to expose those invisible short-path flaws. But what if those near-network tests all come out clean, and your long path of interest still suffers from poor performance? NPAD’s pre-congestion loss and the coherence of the graph output on a long path can be useful. I spent some time trying to make NPAD work with large bandwidth-delay-product, and ran into several interesting obstacles. One is that although pathdiag.py calculates the BDP as

 125 * BW-in-bits-per-second * Delay-in-milliseconds (the “125” is 1000 * 0.125, necessary to convert to seconds and bytes, respectively)

It tries to request 2 * BDP as a safety margin. The Toolkit comes with a default upper-bound for TCP socket buffers of 16 MB (16777216), which covers what’s  required within 67 mS RTT at 1 Gbps, or 6.7 mS at 10 Gbps (remember, pathdiag.py tries to get 2X BDP). If the test is unable to observe the congestion loss point, some uncertainty will remain about the path. Still, an unfinished-yet-reporting test may offer useful information.

After fishing around for how much window pathdiag wanted, I finally set:

net.core.rmem_max = 1073741824
net.core.wmem_max = 1073741824
net.ipv4.tcp_wmem = 4096 87380 1073741824
net.ipv4.tcp_rmem = 4096 87380 1073741824

(yes, a max of 1 GByte) on two machines each with 4GB total system RAM, and so far haven’t crashed anything. Note that “I haven’t crashed anything” is very different from “I recommend that you do this”.

Unresolved Issues

There may be more than one issue with report generation and display. One set of symptoms is simply that the results summary is blank. That’s the summary.html ownership thing mentioned above. There’s also a problem which gets “404 File Not Found” when trying to display reports, except when one uses the web applet which aut-forwards to the results page, which works.

Finally, I encountered the following error message, which I haven’t yet understood. It may be a kernel limitation, or something (diag-client output):

Using: rtt 70 ms and rate 10000
Connected.
Control connection established.
port = 8003
Starting test.
Parameters based on 9783 pkts estimated peak window
peakwin=87532556 minpackets=3 maxpackets=40963 stepsize=4096
Target run length is 195149784 packets (or a loss rate of 0.00000051%)
Test 1a (11 seconds): Coarse Scan
Test 1b (11 seconds): ...
failed to set window: Tried 586442972, Got back 586407180
error Pathdiag failed to generate report.

Comments are currently closed.