The Unix Pipe

With any installation of a Unix you get a bunch of programs like grep, tar, and scores of others.

Here are some examples of using these tools in combination by using the Unix pipe. These combinations really amount to one-line programs. (The theme is IM log analysis.)

(Gaim saves IM logs in files named with the date and time.)

cd ~/.gaim/logs/aim/my_sn/; find . | egrep '2006-08-26' | xargs head

"Show me the beginning of every conversation I had yesterday."

(Within each account, Gaim stores the logs for each of your buddies in a directory named using the buddy's screen name.)

cd ~/.gaim/logs/aim/my_sn/; du -sk * | sort -gr | cut -f2 | head

"Print the screen names of the people I talk to the most, in descending order of total log size."

cd ~/.gaim/logs/aim/my_sn/; egrep -i -r "my_sn:.*linux" * | wc -l

"How many times have I mentioned 'linux' in conversation?" (Answer: 83.)

Here is some stuff I actually use on a regular basis:

wget -O - http://www.example.com/file.tar.gz | tar -xz

Download and decompress a file in one step. This way, I don't have to make a temporary directory in which to download the file, I don't have to remember where I put the file, and I don't have to delete it when I'm done.

ssh phil@remotehost 'cat ~/filelist | xargs tar -C ~/ -cz' > ./backup`date +%Y%m%d`.tar.gz

Make backups of selected files over the network from a remote host. This command reads a file I keep (named filelist) which contains a list of all the files/directories I want to back up, one per line.

For your edification, or if you're insomniac, here are full explanations for what the above commands are doing:

  • find recursively lists every file in the log directory; egrep does filtering to pass on only those lines (filenames) which contain that particular date; xargs constructs and executes the command "head file1 file2 ..." where file1, file2, etc. are the lines it gets from stdin; head, in turn, prints the beginning (first 10 lines) of each file argument.
  • du prints each directory named, preceded by its disk usage; sort sorts all the rows by the first column (the disk usage) in reverse (decreasing) order; cut trims off the size so that only the names remain (the second column, hence -f2) and head limits the output to the first 10 lines.
  • egrep -i -r searches recursively over all lines in all files contained in the directory; wc -l takes in all the lines and prints as output only the number of lines.
  • wget -O - downloads the file and outputs to stdout instead of to a file; tar -xz extracts a .tar.gz from stdin to the current directory (the absence of -f FILE means use stdin instead of reading from a file).
  • The quoted command is run on the remote host: "cat ~/filelist | xargs tar -cz" constructs and executes the command "tar -cz file1 file2 ...", supplying to tar all the files I've listed in filelist. This compresses all the files I named and writes the archive to stdout. The archive is then written to a file named something like backup20060827.tar.gz. (The date command is executed and outputs something like "20060827"; this string is then pasted in to the command.)

Bonus: In the last example, the stdout of the last command on the remote host can be immediately redirected to a file on the local host. ssh is in general capable of connecting pipes between programs on different hosts. It automatically streams that information over the network (encrypted, of course) so that the connection is transparent to everyone involved!

Further reading: on a GNU system, typing info coreutils will bring up information about the base GNU tools, like cat, head, and more.