Category Archives: Linux/Unix

How Red Hat made Linux palatable for business

A recent blog on TechRepublic mentioned the importance of Red Hat for Linux however I don’t think the blogger, Jack Wallen, quite hit the nail on the head.

The first worthwhile Linux distribution was Debian and it continues to be the most important base for user-accessible Linux, most notably Ubuntu. Ubuntu is a superb desktop, but it is exasperating as a server OS. Debian is based on the principle of constant updates which just doesn’t work in a business environment, where configuration management is critical. Constant updating is particularly perilous with Open Source; I have seen point updates break things. Further, in a professional environment you want to stage updates. So for instance you might introduce Apache 2.2 in Test, but leave it on 2.0 for Staging and Production. Later you can introduce it in Staging and Production. This is just too difficult on Debian to bother even trying.

Enter Red Hat. RH understood the needs of business, and created a much more controllable Linux. It also introduced a new, extremely valuable, facility: the ability to stay on old versions of software without the associated risks. For instance, Apache 2.0 has a few known vulnerabilities and the only remedy from the Apache Software Foundation was to upgrade to the latest version of Apache. This left businesseses in a quandary: Upgrade and risk almost certainly breaking the corporate web site, or just hope no one notices it’s running a vulnerable web server. Red Hat had the solution: it back-ported the security fixes into Apache 2.0 and all was well. This is a service it provides for all its packages.

RHEL isn’t a flawless business OS—for instance, patch auditing is unsatisfactory—but it’s what made Linux acceptable to the business community.

Missing disk space Linux/Unix: when df disagrees with du -s

A common situation many admins find themselves in is where they quickly have to clear down disk space.  So for instance, say /u01 is filling up.  The Oracle admin knows that the database will simply stop if he doesn’t take action quickly.  With the judicious use of du -s he finds some large directories and quickly deletes a few temporary files he know the database doesn’t immediately need.  He does a ‘df -h’ to find that it hasn’t made any difference!  He then does his ‘du -s’ and it shows the space has been freed up.  He doesn’t know it, but he has deleted at least one open file whose space won’t be freed up until the process is closed.  What he should have done is this:

echo "" > offendingfile

where offendingfile is the huge file.

In the case of the Oracle admin it’s likely his only choice is to restart the database.  Consider a more general case where a Linux/Unix admin has deleted files but has lost track of where the files were and what might be using them.  Or one admin deleted the files and scarpered leaving the other trying to clean up the mess.  He is left with the bigger challenge of trying to find what process is holding what files open.

A starting point: lsof

The lsof command can be a good starting point, however you are now looking for a needle in smaller haystack, so you will have to do some further filtering.  On CentOS 6 it will mark files which have been deleted, however it seems to throw up quite a few false positives.

To illustrate the problem of open files I have created some C code which will create a big file and sleep for 1,000 seconds.  Compiling and running the binary I will get a 10 Mbyte file:


If I then remove the file I have then created the situation described above.  On CentOS 6 I could run:

lsof | fgrep '(deleted)'

but that produces 24 results (among which are files that haven’t been deleted, like /usr/bin/gnome-screensaver), so it would be a good idea to shrink the range.  For instance it’s likely in this situation that is just one file system that is full so you could grep for its mount point.  That does it nicely in our example:

[root@centos6 ~]# lsof | fgrep '(deleted)' | fgrep /var
createope 11012 admin 3u REG 253,3 10485761 693 /var/tmp/SampleBigFile (deleted)
[root@centos6 ~]#

In MacOS (Darwin) there is no ‘(deleted)’ label so go straight for checking for /var:

vger:~ root# lsof | egrep 'REG.*/var/tmp'
mysqld    346 _mysql 4u  REG 14,18        0 6217706 /private/var/tmp/ibu4Nw9X
mysqld    346 _mysql 5u  REG 14,18        0 6217707 /private/var/tmp/ib6jCfyT
mysqld    346 _mysql 6u  REG 14,18        0 6217708 /private/var/tmp/ibu9Zqxb
mysqld    346 _mysql 7u  REG 14,18        0 6217709 /private/var/tmp/iboukiVq
mysqld    346 _mysql 11u REG 14,18        0 6217710 /private/var/tmp/ibLRW39J
createope 42775 admin 3u REG 14,18 10485761 6308941 /private/var/tmp/SampleBigFile
vger:~ root#

(REG indicates a regular file.)  While our big file is clearly identifiable here, if it wasn’t you could try something like sort -k7 to sort on file size.

When all debugging routes have failed: network scans and/or code tracing

In the world of car, bike and motorbike mechanics there is a versatile tool which is something of a last resort: the vice-grips (sometimes referred to as the bodger’s tool, because of people’s tendency to shear bolts with them).  In the world of operating systems there are two tools I have found to be like vice-grips, but not potentially harmful: Network scanning and code tracing.

Network scanning

Most operating systems have a way of scanning the network:

  • Linux: tcpdump, Wireshark
  • Darwin (MacOS): tcpdump, Wireshark
  • Solaris: snoop, tcpdump, Wireshark
  • Windows: Wireshark (there is also a version of tcpdump for Windows)

So, why is network scanning useful?  Well consider the situation where you have installed the monitoring software, Xymon.  The server is already working and most of the clients are responding, but the server isn’t receiving data from one of the clients.  Xymon uses port 1984 so you can check to watch the traffic going to and from the server:

[root@host1 etc]# tcpdump port 1984
tcpdump: verbose output suppressed, use -v or for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
15:08:00.857457 IP > S 1387852790:1387852790(0) win 5840 <mss 1460,sackOK,timestamp 119364978 0,nop,wscale 2>
15:08:00.864380 IP > S 3491816971:3491816971(0) ack 1387852791 win 5792 <mss 1460,sackOK,timestamp 8108268 119364978,nop,wscale 0>
15:08:00.864553 IP > . ack 1 win 1460 <nop,nop,timestamp 119364993 8108268>15:08:00.865187 IP > . 1:1449(1448) ack 1 win 1460 <nop,nop,timestamp 119364993 8108268>
15:08:00.865419 IP > . 1449:2897(1448) ack 1 win 1460 <nop,nop,timestamp 119364994 8108268>
15:08:00.867342 IP > . ack 1449 win 8688 <nop,nop,timestamp 8108268 119364993>
15:08:00.867486 IP > P 2897:4345(1448) ack 1 win 1460 <nop,nop,timestamp 119364996 8108268>
15:08:00.867684 IP > . 4345:5793(1448) ack 1 win 1460 <nop,nop,timestamp 119364996 8108268>
15:08:00.868361 IP > . ack 2897 win 11584 <nop,nop,timestamp 8108268 119364994>
15:08:00.869032 IP > . 5793:7241(1448) ack 1 win 1460 <nop,nop,timestamp 119364997 8108268>

So in this example the traffic is going from host1 to the Xymon server’s port, so it looks like Xymon is receiving the data.  What’s wrong is that DNS knows this host as not host1 so Xymon doesn’t realise it’s receiving data for host1.  So there are a few solutions, for example you can configure host1 to explicitly tell Xymon that it is host1.

Another example was when I was trying to get some commercial software working in a firewall, where the DNS servers were locked down to resolve only addresses we allowed them too.  The documentation said that it would need to be able to resolve, say,, but it still wasn’t working.  So gave it just one DNS server and watched what addresses it asked for and sure enough it was asking for, but also for, say,  So I needed to make sure that was added to the list of addresses it could resolve.

Another nice thing about tcpdump in particular is its data can be saved to a file which can be imported into Wireshark on another server.  This is very handy if you have a sensitive host where you can’t run the GUI of Wireshark.

There’s a lot to this subject but I hope this helps.

Code tracing

When I say code tracing, I mean tracing system and library calls.  Most operating systems have a way to do this:

  • Linux: strace, ltrace
  • Darwin: dtruss, dtrace (both require root/sudo)
  • Solaris: truss, dtrace
  • Windows: (none that I can find)

In my opinion Linux has the best implementation of code tracing.  (Darwin/FreeBSD/Solaris’s DTrace  and Linux’s SystemTap are exceedingly powerful, but beyond the scope of this post.)  Suppose you want to see what environment variables a program is using:

[admin2@centos6 ~]$ ltrace -e getenv -o /tmp/tmp.adm2.ltrace vi
[admin2@centos6 ~]$ ls -l /tmp/tmp.adm2.ltrace
-rw-rw-r--. 1 admin2 admin2 1777 Oct 2 05:08 /tmp/tmp.adm2.ltrace
[admin2@centos6 ~]$ vim /tmp/tmp.adm2.ltrace
[admin2@centos6 ~]$ cat /tmp/tmp.adm2.ltrace
(0, 0, 0, 0x7fcf69b6d918, 88) = 0x3b6ec21160
getenv("HOME") = "/home/admin2"
getenv("VIM_POSIX") = NULL
getenv("SHELL") = "/bin/bash"
getenv("TMPDIR") = NULL
getenv("TEMP") = NULL
getenv("TMP") = NULL
getenv("VIM") = NULL
getenv("VIM") = NULL
getenv("VIMRUNTIME") = "/usr/share/vim/vim72"
getenv("VIM") = "/usr/share/vim"
getenv("TERM") = "xterm"
getenv("COLORFGBG") = NULL
getenv("VIMINIT") = NULL
getenv("HOME") = "/home/admin2"
getenv("EXINIT") = NULL
getenv("HOME") = "/home/admin2"
(0x3b6ec21160, 0, 0, 0x3b6ec21160, 0) = 140608
(0, 0, 0, 3, 0x963cf85) = 0x3b6ec21160
+++ exited (status 0) +++
[admin2@centos6 ~]$

So consider you have a program which is reading a configuration file from somewhere but you can’t figure out where.  The best thing is to check its open() (which will cover fopen() too), stat() and lstat().  stat and lstat check existence, permissions etc. of a closed file.  So this example uses vi (even though the esteemed Meneer Bram Moolenaar has so extensively documented vim this is a redundant example):

[admin2@centos6 ~]$ strace -e stat,lstat,open -o /tmp/tmp.adm2.strace vi
[admin2@centos6 ~]$ cat /tmp/tmp.adm2.strace
open("/etc/", O_RDONLY) = 3
open("/lib64/", O_RDONLY) = 3
open("/lib64/", O_RDONLY) = 3
open("/lib64/", O_RDONLY) = 3
open("/lib64/", O_RDONLY) = 3
open("/lib64/", O_RDONLY) = 3
open("/lib64/", O_RDONLY) = 3
open("/lib64/", O_RDONLY) = 3
open("/lib64/", O_RDONLY) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
stat("/usr/share/vim/vim72", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/share/vim", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/home/admin2/.terminfo", 0x7fff4b7dbb00) = -1 ENOENT (No such file or directory)
stat("/etc/terminfo", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/share/terminfo", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/share/terminfo/x/xterm", O_RDONLY) = 3
open(".", O_RDONLY) = 3
stat("/etc/virc", {st_mode=S_IFREG|0644, st_size=1962, ...}) = 0
open("/etc/virc", O_RDONLY) = 3
open(".", O_RDONLY) = 3
stat("/home/admin2/.vimrc", 0x7fff4b7dd460) = -1 ENOENT (No such file or directory)
open("/home/admin2/.vimrc", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/home/admin2/_vimrc", O_RDONLY) = -1 ENOENT (No such file or directory)
open(".", O_RDONLY) = 3
stat("/home/admin2/.exrc", 0x7fff4b7dd460) = -1 ENOENT (No such file or directory)
open("/home/admin2/.exrc", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/etc/nsswitch.conf", O_RDONLY) = 3
open("/etc/", O_RDONLY) = 3
open("/lib64/", O_RDONLY) = 3
open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
[admin2@centos6 ~]$

It is my belief that the true mastery of a skill is to take from the specific to the general and back to the specific again.  So these are specific examples of using these tools, which I hope gives you an insight into the general principles so you can apply them to your specific problems.


Sleep command for a random amount of time

Most Unix/Linux users will be familiar the sleep command which you can use to delay for the specified number of seconds.  A few years ago I had need for a sleep command which would sleep for a random amount of time, so I came up with some code which as it happens is a nice example of interrupt handling in Unix/Linux.

The code I’ve written to do this random sleep can be freely used but I would like you to leave the reference to this site,  It has no external dependencies, so this should compile it:

cc -Wall -O randsleep.c -o randsleep

The two options I use are to warn about any dodgy coding (-Wall) and -O for optimisation (not really an issue here!).

It is used like this:

randsleep [-v] <lower limit> <upper limit>

The -v option will echo the random time it has calculated, e.g:

vger:~(217)+>- randsleep -v 2 7
Sleeping for 3.49 seconds
vger:~(218)+>- randsleep -v 2 7
Sleeping for 5.30 seconds


Tracking what process is generating network traffic

Every now and again I’ve had situations where I see Internet traffic which doesn’t correspond to any obvious activity I’ve initiated.  Well on Linux and Unix (and hence MacOS) it’s easy enough to track down the offending process.

(The commands tcpdump and lsof must be run as root or under sudo.)  Doing a tcpdump (snoop in Solaris) I could see the traffic and it was obvious which was the source/destintation causing most of the traffic:

mistral:~(1)+>- tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on en0, link-type EN10MB (Ethernet), capture size 65535 bytes
23:01:28.320299 IP > Flags [.], seq 2435402503:2435403911, ack 3052965256, win 4692, options [nop,nop,TS val 1326515797 ecr 324351108], length 1408
23:01:28.320923 IP > Flags [.], seq 1408:2816, ack 1, win 4692, options [nop,nop,TS val 1326515797 ecr 324351108], length 1408
23:01:28.321015 IP > Flags [.], ack 2816, win 8104, options [nop,nop,TS val 324351161 ecr 1326515797], length 0
23:01:28.333327 IP > Flags [.], seq 2816:4224, ack 1, win 4692, options [nop,nop,TS val 1326515804 ecr 324351108], length 1408
23:01:32.717541 IP > Flags [.], ack 767360, win 8104, options [nop,nop,TS val 324355362 ecr 1326520067], length 0
23:01:32.718204 IP > Flags [.], seq 767360:768768, ack 1, win 4692, options [nop,nop,TS val 1326520068 ecr 324355189], length 1408
23:01:32.718325 IP > Flags [.], ack 768768, win 8192, options [nop,nop,TS val 324355363 ecr 1326520068], length 0
23:01:32.875095 IP > Flags [.], seq 768768:770176, ack 1, win 4692, options [nop,nop,TS val 1326520238 ecr 324355191], length 1408
970 packets captured
1004 packets received by filter
0 packets dropped by kernel

In this example my local address is and it can also be seen that the port of process which is talking to the Internet is 49977.  I used lsof to find out what process had that port open, and then used ps to show me the details of the process:

mistral:-(2)+>- lsof -i TCP:49977
SoftwareU 11684 _softwareupdate 15u IPv4 0x39d5096a0cd02d61 0t0 TCP> (ESTABLISHED)
mistral:~(3)+>- ps -lfp 11684
 200 11684 1 4004 0 63 0 3672308 135516 - Ss 0 ?? 0:28.45 /System/Library/ 10:58pm
mistral:~(4)+>- ps -fp 11684
 200 11684 1 0 10:58pm ?? 0:29.15 /System/Library/CoreServices/Software -Check YES
mistral:~(5)+>- id
uid=0(root) gid=0(wheel) groups=0(wheel),404(,401(,1(daemon),2(kmem),3(sys),4(tty),5(operator),8(procview),9(procmod),12(everyone),20(staff),29(certusers),33(_appstore),61(localaccounts),80(admin),98(_lpadmin),100(_lpoperator),204(_developer),403(

And that’s it, the process was the automatic OS updater.

Finding a severe resource hog on your server

Have you ever experienced the situation where a server which becomes bafflingly unresponsive to the point when even your monitoring services are failing to report.  Then to have the server start responding, sometimes after a reboot?  You suspect it was a process that went berserk, but which one?  The monitoring software itself was out of action during the crisis, so it can’t tell you anything.

I have come across this quite a few times in my career, most recently in clustered database servers.  Sometimes these outages can be so severe that even very lightweight monitoring software like sar and Xymon (formerly known as Hobbit and Big Brother) can be taken out of action.  In the past I have resorted to using a loop to save the output of the command top to a file every 30 seconds, hoping to catch an event, but that is ugly for so many reasons.

SQLite to the rescue

SQLite is a serverless implementation of SQL.  It is a tiny binary—less than 50 kbyte on Mac OS X—and it stores its data in a file.  The way SQLite helps us here is that it allows us to store the data and analyse and/or trim it using all sorts of criteria.

So first create the database file:

sqlite3 storetops.sqlite3

That will give you a sqlite> prompt at which you can type this:

CREATE TABLE savetop (
HostName TEXT, -- So we know which host this data came from
DateAndTime TEXT,
Load REAL,
TopProcesses TEXT
) ;

Type .quit to exit.  (SQL professionals will balk at my not using the DATETIME type, but SQLite contains no such data type.)  You next want to write a simple script which will write to that at intervals of, say, 30 seconds.  I’ve written a basic performance-monitoring script which you can use on Linux or MacOS X.

Here is a sample of data I collected on my Mac:

sqlite> SELECT HostName,DateAndTime,Load FROM savetop ;
vger|2012-09-03 19:51:30|0.93
vger|2012-09-03 19:51:43|1.3
vger|2012-09-03 19:51:55|1.17
vger|2012-09-03 19:52:08|1.79
vger|2012-09-03 19:52:20|1.66
vger|2012-09-03 19:52:33|1.44
vger|2012-09-03 19:52:45|1.22
vger|2012-09-03 19:52:57|1.34
vger|2012-09-03 19:53:10|1.36
vger|2012-09-03 19:53:22|1.23
vger|2012-09-03 22:10:11|1.06
vger|2012-09-03 22:10:24|1.59
vger|2012-09-03 22:10:36|1.46
vger|2012-09-03 22:10:49|1.24
vger|2012-09-03 22:11:01|1.2
vger|2012-09-03 22:11:13|1.01
vger|2012-09-03 22:11:26|1.38
vger|2012-09-03 22:11:38|1.48
vger|2012-09-03 22:11:51|1.33
vger|2012-09-03 22:12:03|1.71

If there is an item of interest I can examine its top processes:

SELECT TopProcesses FROM savetop WHERE DateAndTime='2012-09-03 22:10:36' ;

Suppose I want to count the number of events where the load was greater than 1.4:

sqlite> SELECT COUNT(HostName) FROM savetop WHERE Load>1.4 ;

Of course this is artificial—normally we are looking at much higher high loads—however it illustrates the advantage of this approach.  If you are monitoring over a long period it’s likely the SQLite file will get very large but that is also very easy to remedy:

sqlite> SELECT COUNT(HostName) FROM savetop ;
sqlite> DELETE FROM savetop WHERE Load<1.4 ;
sqlite> SELECT COUNT(HostName) FROM savetop ;

Use of load rather than CPU usage

While this is beyond the immediate scope of this post, some of you might be wondering why I am using load instead of CPU usage.  Most operating systems (including Windows) have built-in strategies for handing CPU hogs, for instance in Linux and Unix the process priority for a CPU hog is automatically downgraded.  The result is a server whose CPU is flat out can be still quite usable.  Load—which represents the number of processes waiting to execute—is a much more reliable indicator of a server in distress.  High load can be caused by too many processes trying to access the CPUs and/or I/O delays, which in turn can be caused by busy or slow disks.  On Solaris and Windows you can determine if a server is CPU-bound by checking the percentage of time runnable processes (those not waiting for I/O or sleeping for other reasons) are waiting for CPUs; if this is higher than, say, 5% then the server is CPU-bound.

Linux automatic (Kickstart) install from a DVD

This assumes you are au fait with the basic workings of Red Hat Kickstart.

Consider the situation where you want to install Linux using your standard configuration in a sales office, but the people there have only basic Windows skills.  The easiest way to do this is with a Kickstart DVD.  This will allow you to do the standard install by sending the DVD to the sales office, tell them to boot off it and then type in one simple command.  You could also use that DVD for getting a new site up and running, or in a disaster-recovery situation.

Copying the CDs/DVD

If you are dealing with Red Hat Enterprise Linux (RHEL) 4 or earlier then you might have them on multiple CDROMs or CDROM ISO files, in which case you’ll have to combine them before continuing.  Considering the example of multiple ISO files:

mount -o loop,ro CDROM1.ISO /mnt ; (cd /mnt ; tar cf - .) | (cd /RHELcombined ; tar xf -) ; umount /mnt
mount -o loop,ro CDROM2.ISO /mnt ; (cd /mnt ; tar cf - .) | (cd /RHELcombined ; tar xf -); umount /mnt

Do that for all of the CDROMs.  When you finish copying the DVD or CDROMs the destination directory (/RHELcombined in the example above) will contain a file called .discinfo, which will look something like this:

Red Hat Enterprise Linux 3

That will be the .discinfo from the last disc you extracted, the 4 in the example above refers to the disc number; you have to replace this with a comma-separated list of all the discs, so 1,2,3,4 in this example.

Adding your own Kickstart configuration file

Next you need to adapt your existing network-based Kickstart, if any, by removing the line referencing the installation media, usually it will be something like this:

nfs --server= --dir=/Kickstart/RHEL3

Replace that line with:


Another field you might want to change is the rootpw.  Normally it would look something like this:

rootpw --iscrypted $1$iCRdXskv$nbxsMQw0BUGi6VgEhaIIN.

For a DVD install it might make sense to remove the ‘–iscrypted’ and put in a clear-text password.  Check the disk space specifications in the Kickstart configuration file to make sure they don’t exceed the space on the server you intend installing.  Copy that file into the root with the name ks.cfg.  Unfortunately it has to be that name, which in turns means you can have only one Kickstart configuration per DVD.

Creating the ISO file

The command line I have below needs two variables:

  • $DstISOfile – the name of the destination file
  • $SrcDir – the name of the directory into which you extracted the CDROMs

Set those two before running the command:

cd $SrcDir ; time mkisofs -J -R -T -o $DstISOfile -b isolinux/isolinux.bin -c isolinux/ -no-emul-boot -boot-load-size 8 -boot-info-table . > /tmp/standardoutput 2>/tmp/erroroutput

A few notes on the command line:

  • I find it’s handy to time how long the command runs.
  • The command’s output is very verbose so I have redirected its stdout to one file and stderr to another (that syntax won’t work in csh/tcsh)
  • It should take about ten minutes to run

Testing and burning

I would recommend you test the ISO image using a virtual machine.  Boot off the ISO image and it will quickly respond with a ‘boot:’ prompt, type in this:

linux ks=cdrom

This will launch the installer which will first wipe out the disk, create the partitions and logical volumes—if any—and finally install the requested packages.  Once you’re satisfied that it works as planned, you can burn ISO image as you would burn an DVD.  You should then find a test server to test the install.  Be warned, there will be no prompts after the one above, your server will be wiped out in less than a minute.  Depending on the speed of the server and its DVD drive Linux will install in about fifteen minutes.