October, 2012 | Linux Tech

A common situation many admins find themselves in is where they quickly have to clear down disk space. So for instance, say /u01 is filling up. The Oracle admin knows that the database will simply stop if he doesn’t take action quickly. With the judicious use of du -s he finds some large directories and quickly deletes a few temporary files he know the database doesn’t immediately need. He does a ‘df -h’ to find that it hasn’t made any difference! He then does his ‘du -s’ and it shows the space has been freed up. He doesn’t know it, but he has deleted at least one open file whose space won’t be freed up until the process is closed. What he should have done is this:

echo "" > offendingfile

where offendingfile is the huge file.

In the case of the Oracle admin it’s likely his only choice is to restart the database. Consider a more general case where a Linux/Unix admin has deleted files but has lost track of where the files were and what might be using them. Or one admin deleted the files and scarpered leaving the other trying to clean up the mess. He is left with the bigger challenge of trying to find what process is holding what files open.

A starting point: lsof

The lsof command can be a good starting point, however you are now looking for a needle in smaller haystack, so you will have to do some further filtering. On CentOS 6 it will mark files which have been deleted, however it seems to throw up quite a few false positives.

To illustrate the problem of open files I have created some C code which will create a big file and sleep for 1,000 seconds. Compiling and running the binary I will get a 10 Mbyte file:

/var/tmp/SampleBigFile

If I then remove the file I have then created the situation described above. On CentOS 6 I could run:

lsof | fgrep '(deleted)'

but that produces 24 results (among which are files that haven’t been deleted, like /usr/bin/gnome-screensaver), so it would be a good idea to shrink the range. For instance it’s likely in this situation that is just one file system that is full so you could grep for its mount point. That does it nicely in our example:

[root@centos6 ~]# lsof | fgrep '(deleted)' | fgrep /var
createope 11012 admin 3u REG 253,3 10485761 693 /var/tmp/SampleBigFile (deleted)
[root@centos6 ~]#

In MacOS (Darwin) there is no ‘(deleted)’ label so go straight for checking for /var:

vger:~ root# lsof | egrep 'REG.*/var/tmp'
mysqld    346 _mysql 4u  REG 14,18        0 6217706 /private/var/tmp/ibu4Nw9X
mysqld    346 _mysql 5u  REG 14,18        0 6217707 /private/var/tmp/ib6jCfyT
mysqld    346 _mysql 6u  REG 14,18        0 6217708 /private/var/tmp/ibu9Zqxb
mysqld    346 _mysql 7u  REG 14,18        0 6217709 /private/var/tmp/iboukiVq
mysqld    346 _mysql 11u REG 14,18        0 6217710 /private/var/tmp/ibLRW39J
createope 42775 admin 3u REG 14,18 10485761 6308941 /private/var/tmp/SampleBigFile
vger:~ root#

(REG indicates a regular file.) While our big file is clearly identifiable here, if it wasn’t you could try something like sort -k7 to sort on file size.

In the world of car, bike and motorbike mechanics there is a versatile tool which is something of a last resort: the vice-grips (sometimes referred to as the bodger’s tool, because of people’s tendency to shear bolts with them). In the world of operating systems there are two tools I have found to be like vice-grips, but not potentially harmful: Network scanning and code tracing.

Network scanning

Most operating systems have a way of scanning the network:

Linux: tcpdump, Wireshark
Darwin (MacOS): tcpdump, Wireshark
Solaris: snoop, tcpdump, Wireshark
Windows: Wireshark (there is also a version of tcpdump for Windows)

So, why is network scanning useful? Well consider the situation where you have installed the monitoring software, Xymon. The server is already working and most of the clients are responding, but the server isn’t receiving data from one of the clients. Xymon uses port 1984 so you can check to watch the traffic going to and from the server:

[root@host1 etc]# tcpdump port 1984
tcpdump: verbose output suppressed, use -v or for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
15:08:00.857457 IP host2.linuxtech.ie.32821 > xymonserver.linuxtech.ie.1984: S 1387852790:1387852790(0) win 5840 <mss 1460,sackOK,timestamp 119364978 0,nop,wscale 2>
15:08:00.864380 IP xymonserver.linuxtech.ie.1984 > host2.linuxtech.ie.32821: S 3491816971:3491816971(0) ack 1387852791 win 5792 <mss 1460,sackOK,timestamp 8108268 119364978,nop,wscale 0>
15:08:00.864553 IP host2.linuxtech.ie.32821 > xymonserver.linuxtech.ie.1984: . ack 1 win 1460 <nop,nop,timestamp 119364993 8108268>15:08:00.865187 IP host2.linuxtech.ie.32821 > xymonserver.linuxtech.ie.1984: . 1:1449(1448) ack 1 win 1460 <nop,nop,timestamp 119364993 8108268>
15:08:00.865419 IP host2.linuxtech.ie.32821 > xymonserver.linuxtech.ie.1984: . 1449:2897(1448) ack 1 win 1460 <nop,nop,timestamp 119364994 8108268>
15:08:00.867342 IP xymonserver.linuxtech.ie.1984 > host2.linuxtech.ie.32821: . ack 1449 win 8688 <nop,nop,timestamp 8108268 119364993>
15:08:00.867486 IP host2.linuxtech.ie.32821 > xymonserver.linuxtech.ie.1984: P 2897:4345(1448) ack 1 win 1460 <nop,nop,timestamp 119364996 8108268>
15:08:00.867684 IP host2.linuxtech.ie.32821 > xymonserver.linuxtech.ie.1984: . 4345:5793(1448) ack 1 win 1460 <nop,nop,timestamp 119364996 8108268>
15:08:00.868361 IP xymonserver.linuxtech.ie.1984 > host2.linuxtech.ie.32821: . ack 2897 win 11584 <nop,nop,timestamp 8108268 119364994>
15:08:00.869032 IP host2.linuxtech.ie.32821 > xymonserver.linuxtech.ie.1984: . 5793:7241(1448) ack 1 win 1460 <nop,nop,timestamp 119364997 8108268>

So in this example the traffic is going from host1 to the Xymon server’s port, so it looks like Xymon is receiving the data. What’s wrong is that DNS knows this host as host2.linuxtech.ie not host1 so Xymon doesn’t realise it’s receiving data for host1. So there are a few solutions, for example you can configure host1 to explicitly tell Xymon that it is host1.

Another example was when I was trying to get some commercial software working in a firewall, where the DNS servers were locked down to resolve only addresses we allowed them too. The documentation said that it would need to be able to resolve, say, swcheck.sweet.ie, but it still wasn’t working. So gave it just one DNS server and watched what addresses it asked for and sure enough it was asking for swcheck.sweet.ie, but also for, say, dwnld.sweet.ie. So I needed to make sure that was added to the list of addresses it could resolve.

Another nice thing about tcpdump in particular is its data can be saved to a file which can be imported into Wireshark on another server. This is very handy if you have a sensitive host where you can’t run the GUI of Wireshark.

There’s a lot to this subject but I hope this helps.

Code tracing

When I say code tracing, I mean tracing system and library calls. Most operating systems have a way to do this:

Linux: strace, ltrace
Darwin: dtruss, dtrace (both require root/sudo)
Solaris: truss, dtrace
Windows: (none that I can find)

In my opinion Linux has the best implementation of code tracing. (Darwin/FreeBSD/Solaris’s DTrace and Linux’s SystemTap are exceedingly powerful, but beyond the scope of this post.) Suppose you want to see what environment variables a program is using:

[admin2@centos6 ~]$ ltrace -e getenv -o /tmp/tmp.adm2.ltrace vi
[admin2@centos6 ~]$ ls -l /tmp/tmp.adm2.ltrace
-rw-rw-r--. 1 admin2 admin2 1777 Oct 2 05:08 /tmp/tmp.adm2.ltrace
[admin2@centos6 ~]$ vim /tmp/tmp.adm2.ltrace
[admin2@centos6 ~]$ cat /tmp/tmp.adm2.ltrace
(0, 0, 0, 0x7fcf69b6d918, 88) = 0x3b6ec21160
getenv("HOME") = "/home/admin2"
getenv("VIM_POSIX") = NULL
getenv("SHELL") = "/bin/bash"
getenv("TMPDIR") = NULL
getenv("TEMP") = NULL
getenv("TMP") = NULL
getenv("VIMRUNTIME") = NULL
getenv("VIM") = NULL
getenv("VIM") = NULL
getenv("VIMRUNTIME") = "/usr/share/vim/vim72"
getenv("VIM") = "/usr/share/vim"
getenv("TERM") = "xterm"
getenv("COLORFGBG") = NULL
getenv("VIMINIT") = NULL
getenv("HOME") = "/home/admin2"
getenv("EXINIT") = NULL
getenv("HOME") = "/home/admin2"
(0x3b6ec21160, 0, 0, 0x3b6ec21160, 0) = 140608
(0, 0, 0, 3, 0x963cf85) = 0x3b6ec21160
+++ exited (status 0) +++
[admin2@centos6 ~]$

So consider you have a program which is reading a configuration file from somewhere but you can’t figure out where. The best thing is to check its open() (which will cover fopen() too), stat() and lstat(). stat and lstat check existence, permissions etc. of a closed file. So this example uses vi (even though the esteemed Meneer Bram Moolenaar has so extensively documented vim this is a redundant example):

[admin2@centos6 ~]$ strace -e stat,lstat,open -o /tmp/tmp.adm2.strace vi
[admin2@centos6 ~]$ cat /tmp/tmp.adm2.strace
open("/etc/ld.so.cache", O_RDONLY) = 3
open("/lib64/libm.so.6", O_RDONLY) = 3
open("/lib64/libselinux.so.1", O_RDONLY) = 3
open("/lib64/libncurses.so.5", O_RDONLY) = 3
open("/lib64/libacl.so.1", O_RDONLY) = 3
open("/lib64/libc.so.6", O_RDONLY) = 3
open("/lib64/libtinfo.so.5", O_RDONLY) = 3
open("/lib64/libdl.so.2", O_RDONLY) = 3
open("/lib64/libattr.so.1", O_RDONLY) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
stat("/usr/share/vim/vim72", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/share/vim", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/home/admin2/.terminfo", 0x7fff4b7dbb00) = -1 ENOENT (No such file or directory)
stat("/etc/terminfo", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/share/terminfo", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/share/terminfo/x/xterm", O_RDONLY) = 3
open(".", O_RDONLY) = 3
stat("/etc/virc", {st_mode=S_IFREG|0644, st_size=1962, ...}) = 0
open("/etc/virc", O_RDONLY) = 3
open(".", O_RDONLY) = 3
stat("/home/admin2/.vimrc", 0x7fff4b7dd460) = -1 ENOENT (No such file or directory)
open("/home/admin2/.vimrc", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/home/admin2/_vimrc", O_RDONLY) = -1 ENOENT (No such file or directory)
open(".", O_RDONLY) = 3
stat("/home/admin2/.exrc", 0x7fff4b7dd460) = -1 ENOENT (No such file or directory)
open("/home/admin2/.exrc", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/etc/nsswitch.conf", O_RDONLY) = 3
open("/etc/ld.so.cache", O_RDONLY) = 3
open("/lib64/libnss_files.so.2", O_RDONLY) = 3
open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
[admin2@centos6 ~]$

It is my belief that the true mastery of a skill is to take from the specific to the general and back to the specific again. So these are specific examples of using these tools, which I hope gives you an insight into the general principles so you can apply them to your specific problems.

Linux Tech

Blog about Linux, Unix, MacOS, scripting, and programming

Monthly Archives: October 2012

Missing disk space Linux/Unix: when df disagrees with du -s

A starting point: lsof

When all debugging routes have failed: network scans and/or code tracing

Network scanning

Code tracing