Author Archives: Philip Kearns

Tracking what process is generating network traffic

Every now and again I’ve had situations where I see Internet traffic which doesn’t correspond to any obvious activity I’ve initiated.  Well on Linux and Unix (and hence MacOS) it’s easy enough to track down the offending process.

(The commands tcpdump and lsof must be run as root or under sudo.)  Doing a tcpdump (snoop in Solaris) I could see the traffic and it was obvious which was the source/destintation causing most of the traffic:

mistral:~(1)+>- tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on en0, link-type EN10MB (Ethernet), capture size 65535 bytes
23:01:28.320299 IP 193.120.121.160.http > 10.1.1.183.49977: Flags [.], seq 2435402503:2435403911, ack 3052965256, win 4692, options [nop,nop,TS val 1326515797 ecr 324351108], length 1408
23:01:28.320923 IP 193.120.121.160.http > 10.1.1.183.49977: Flags [.], seq 1408:2816, ack 1, win 4692, options [nop,nop,TS val 1326515797 ecr 324351108], length 1408
23:01:28.321015 IP 10.1.1.183.49977 > 193.120.121.160.http: Flags [.], ack 2816, win 8104, options [nop,nop,TS val 324351161 ecr 1326515797], length 0
23:01:28.333327 IP 193.120.121.160.http > 10.1.1.183.49977: Flags [.], seq 2816:4224, ack 1, win 4692, options [nop,nop,TS val 1326515804 ecr 324351108], length 1408
[...]
23:01:32.717541 IP 10.1.1.183.49977 > 193.120.121.160.http: Flags [.], ack 767360, win 8104, options [nop,nop,TS val 324355362 ecr 1326520067], length 0
23:01:32.718204 IP 193.120.121.160.http > 10.1.1.183.49977: Flags [.], seq 767360:768768, ack 1, win 4692, options [nop,nop,TS val 1326520068 ecr 324355189], length 1408
23:01:32.718325 IP 10.1.1.183.49977 > 193.120.121.160.http: Flags [.], ack 768768, win 8192, options [nop,nop,TS val 324355363 ecr 1326520068], length 0
23:01:32.875095 IP 193.120.121.160.http > 10.1.1.183.49977: Flags [.], seq 768768:770176, ack 1, win 4692, options [nop,nop,TS val 1326520238 ecr 324355191], length 1408
^C
970 packets captured
1004 packets received by filter
0 packets dropped by kernel
mistral:~(2)+>-

In this example my local address is 10.1.1.183 and it can also be seen that the port of process which is talking to the Internet is 49977.  I used lsof to find out what process had that port open, and then used ps to show me the details of the process:

mistral:-(2)+>- lsof -i TCP:49977
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
SoftwareU 11684 _softwareupdate 15u IPv4 0x39d5096a0cd02d61 0t0 TCP 10.1.1.183:49977->193.120.121.160:http (ESTABLISHED)
mistral:~(3)+>- ps -lfp 11684
 UID PID PPID F CPU PRI NI SZ RSS WCHAN S ADDR TTY TIME CMD STIME
 200 11684 1 4004 0 63 0 3672308 135516 - Ss 0 ?? 0:28.45 /System/Library/ 10:58pm
mistral:~(4)+>- ps -fp 11684
 UID PID PPID C STIME TTY TIME CMD
 200 11684 1 0 10:58pm ?? 0:29.15 /System/Library/CoreServices/Software Update.app/Contents/Resources/SoftwareUpdateCheck -Check YES
mistral:~(5)+>- id
uid=0(root) gid=0(wheel) groups=0(wheel),404(com.apple.sharepoint.group.2),401(com.apple.access_screensharing),1(daemon),2(kmem),3(sys),4(tty),5(operator),8(procview),9(procmod),12(everyone),20(staff),29(certusers),33(_appstore),61(localaccounts),80(admin),98(_lpadmin),100(_lpoperator),204(_developer),403(com.apple.access_ssh)
mistral:~(6)+>-

And that’s it, the process was the automatic OS updater.

Finding a severe resource hog on your server

Have you ever experienced the situation where a server which becomes bafflingly unresponsive to the point when even your monitoring services are failing to report.  Then to have the server start responding, sometimes after a reboot?  You suspect it was a process that went berserk, but which one?  The monitoring software itself was out of action during the crisis, so it can’t tell you anything.

I have come across this quite a few times in my career, most recently in clustered database servers.  Sometimes these outages can be so severe that even very lightweight monitoring software like sar and Xymon (formerly known as Hobbit and Big Brother) can be taken out of action.  In the past I have resorted to using a loop to save the output of the command top to a file every 30 seconds, hoping to catch an event, but that is ugly for so many reasons.

SQLite to the rescue

SQLite is a serverless implementation of SQL.  It is a tiny binary—less than 50 kbyte on Mac OS X—and it stores its data in a file.  The way SQLite helps us here is that it allows us to store the data and analyse and/or trim it using all sorts of criteria.

So first create the database file:

sqlite3 storetops.sqlite3

That will give you a sqlite> prompt at which you can type this:

CREATE TABLE savetop (
HostName TEXT, -- So we know which host this data came from
DateAndTime TEXT,
Load REAL,
TopProcesses TEXT
) ;

Type .quit to exit.  (SQL professionals will balk at my not using the DATETIME type, but SQLite contains no such data type.)  You next want to write a simple script which will write to that at intervals of, say, 30 seconds.  I’ve written a basic performance-monitoring script which you can use on Linux or MacOS X.

Here is a sample of data I collected on my Mac:

sqlite> SELECT HostName,DateAndTime,Load FROM savetop ;
vger|2012-09-03 19:51:30|0.93
vger|2012-09-03 19:51:43|1.3
vger|2012-09-03 19:51:55|1.17
vger|2012-09-03 19:52:08|1.79
vger|2012-09-03 19:52:20|1.66
vger|2012-09-03 19:52:33|1.44
vger|2012-09-03 19:52:45|1.22
vger|2012-09-03 19:52:57|1.34
vger|2012-09-03 19:53:10|1.36
vger|2012-09-03 19:53:22|1.23
vger|2012-09-03 22:10:11|1.06
vger|2012-09-03 22:10:24|1.59
vger|2012-09-03 22:10:36|1.46
vger|2012-09-03 22:10:49|1.24
vger|2012-09-03 22:11:01|1.2
vger|2012-09-03 22:11:13|1.01
vger|2012-09-03 22:11:26|1.38
vger|2012-09-03 22:11:38|1.48
vger|2012-09-03 22:11:51|1.33
vger|2012-09-03 22:12:03|1.71
sqlite>

If there is an item of interest I can examine its top processes:

SELECT TopProcesses FROM savetop WHERE DateAndTime='2012-09-03 22:10:36' ;

Suppose I want to count the number of events where the load was greater than 1.4:

sqlite> SELECT COUNT(HostName) FROM savetop WHERE Load>1.4 ;
7
sqlite>

Of course this is artificial—normally we are looking at much higher high loads—however it illustrates the advantage of this approach.  If you are monitoring over a long period it’s likely the SQLite file will get very large but that is also very easy to remedy:

sqlite> SELECT COUNT(HostName) FROM savetop ;
20
sqlite> DELETE FROM savetop WHERE Load<1.4 ;
sqlite> SELECT COUNT(HostName) FROM savetop ;
7
sqlite>

Use of load rather than CPU usage

While this is beyond the immediate scope of this post, some of you might be wondering why I am using load instead of CPU usage.  Most operating systems (including Windows) have built-in strategies for handing CPU hogs, for instance in Linux and Unix the process priority for a CPU hog is automatically downgraded.  The result is a server whose CPU is flat out can be still quite usable.  Load—which represents the number of processes waiting to execute—is a much more reliable indicator of a server in distress.  High load can be caused by too many processes trying to access the CPUs and/or I/O delays, which in turn can be caused by busy or slow disks.  On Solaris and Windows you can determine if a server is CPU-bound by checking the percentage of time runnable processes (those not waiting for I/O or sleeping for other reasons) are waiting for CPUs; if this is higher than, say, 5% then the server is CPU-bound.