April 2009

Clever software names

Many free software projects have, among other characteristics, punny project names. Some notable examples:

GNU CSSC (Compatibly Stupid Source Control), a clone of the Unix source control system SCCS (Source Code Control System).
GPG (GNU Privacy Guard), a clone of Phil Zimmerman's encryption program PGP (Pretty Good Privacy)

I suspect that people would think twice about choosing those names if those products were invented today. Naming a similar product (a clone, even) with a name that is intentionally confusingly similar seems, indeed, like the perfect way to get yourself slapped with a trademark lawsuit. (After all, it happened to Lindows.)

At least GNU is safe. No reasonable person would think that GNU's Not Unix was the same as Unix.

Microsoft Excel is a creation of staggering boneheadedness

What I learned about Microsoft Excel today makes me not really want to use it again for anything that is even moderately important.

Did you know that there is no easy way to import data into Excel with any fidelity?

Just write to a delimited file, you say. CSV data, TSV data, whatever... they are all very easy to produce programmatically. And they will all get screwed up when you open them.

You see, when you open such a file, one of two things will happen, both of them bad.

First, Excel may open the file without complaint. I think this happens when you double-click on the file in Windows. Excel will then apply heuristics to set the format for each cell appropriately. These heuristics are not 100% reliable, for which I can hardly fault Excel. As one example, a cell containing a list of numbers 50001,50002,50003,50014,50018 is interpreted as a single large integer, which Excel converts to the floating-point number 5.00e24.

However, and here is the problem, the conversions are silent (no warning is given upon opening a CSV file) and lossy (above, some of the precision needed to construct the original sequence is lost), and they cannot be reverted by any magic incantations within Excel after you've opened the spreadsheet.

That's right. When opening CSV/TSV files, Microsoft Excel's default behavior is to silently corrupt your data.

I'm running into this problem at work, and I'm just glad that someone noticed what was wrong, because this is really pernicious. You might spot-check your spreadsheet and think that everything is fine and not notice that your data is corrupted starting in row 8000. This is actually happening in biomedical research. Gene names, identifiers, you name it— are silently and irreparably converted to numbers and dates in Excel. (And those researchers don't believe there's any good way to deal with this either.)

For the record, OpenOffice gets this right, by preserving data verbatim by default when importing. That is positively brilliant in comparison.

Now, the second thing that might have happened when you opened your file in Excel is that you invoke Excel's "Import Wizard." I think this happens when you choose "File" "Open" and select a delimited file. Excel will dutifully ask you how the columns are to be delimited and what format to use for each column.

And, although you can select "Text" format here (meaning, preserve the input verbatim), the default format is the "do what Excel thinks is best" option. Once again, Microsoft Excel's default behavior is to corrupt your data. Sure, in this case at least you can click on each column you think might have a problem and select "Text" format for it. But if you are a human, and you're opening these kinds of files all the time, and you trust yourself to do this consistently and correctly, you probably deserve what's coming to you. Humans are not good at performing repetitive tasks. That's supposed to be the computer's job.

What are we expected to do, write out Excel's XML-based file format directly, now that's it's nominally "open" and "documented"? That seems really heavyweight. It is kind of a remarkable oversight that it is so difficult to massage arbitrary data into any format that can be reliably read by Excel.

Extracting select pages from a PDF

Ever needed to make a new PDF containing a subset of the pages of an existing PDF?

The pdfnup tool is nominally for printing PDFs n-up (n logical pages to a sheet), but it can be told to cut and paste pages into a new PDF, like so:

pdfnup original.pdf --nup 1x1 --pages 1,3,5,7,21-25 --outfile subset.pdf

On Ubuntu, you can install pdfnup with a simple sudo aptitude install pdfjam.

Toto, I've a feeling we're not in 32-bit land anymore

I got a new computer and have had some time to put it through its paces. It's a Dell Studio XPS, with a Core i7 920, 6GB of RAM, and a 24" display. All things considered it was a pretty good deal (ordered in January, $1060 shipped). I later upgraded to 2TB of disk.

This thing is blazing. And building with make -j8 makes me happy inside.

As far as I can tell, everything in it works out of the box with all free software on Ubuntu GNU/Linux Jaunty Jackalope (9.04) x86_64. (Even the ATI Mobility Radeon HD 3450.)

The display, a Dell S2409W, is nothing to sneeze at, either.

There is space inside the case to mount an extra hard drive. It goes in vertically, which is nice because you don't have to wrestle with all the cables and wedge it past the RAM to get it in. However, there is only space for one additional hard drive.

Power consumption is much better than my old desktop. The machine idles at 97W. This goes up to maybe 170W at high load.

My only complaint is the noise. The fans go on like a leafblower when you're pegging the CPU, but I don't really care about that part. The problem for me is more the baseline noise. The computer isn't loud, per se... but it isn't quiet either. It's in my room and I put on earplugs when I go to sleep. Take that with a grain of salt, though, because I'm a person who has trouble falling asleep in the presence of a wall clock.