Tuesday, January 18, 2011

Mass scanning

It's been awhile since I had time to work on a weekend project. :(

This weekend I worked on something fairly practical: I have way too many physical documents strewn around. Searching through them to find stuff sucks. Organizing them sucks even more, because I'm too lazy to ever do it. And, of course, even paper that organized itself automatically would suck, because it's paper. I hate paper.

So, I resolved to scan everything, and then somehow organize it electronically.

Step 1: Obtain Scanner

Technically, I already had a scanner. But, it was a flatbed scanner which I would have to manually load one page at a time. Obviously, for this task I would need an automatic document feeder. And, of course, the scanner would have to work in Linux. So, I headed to Fry's to look at the selection with the SANE supported device list loaded up on my phone.

Unfortunately, when I got to Fry's, I discovered that there is a bewildering array of different scanners available and practically no documentation on the advantages and disadvantages of each. You'd think they'd list basic things like pages-per-minute or the capacity of the document feeder, but they don't. And since I inexplicably get no phone reception in Fry's, I really had no basis on which to make a decision.

After staring at things for a bit, I was approached by one of the weirdos that works there (I swear almost everyone who works at Fry's gives me the creeps). For some reason I decided to try asking his opinion.


The guy looked at the list on my phone and said "What do they have for Canon?". After looking down the list, he saw the Canon PIXMA MX860 was listed as being fully supported. He pointed out that the MX870 is now available, and is a very popular unit. 870 vs. 860 seemed like it ought to be a minor incremental revision, and therefore ought to use the same protocol, right? Being at a loss for what else to do, I decided to go with it. Dumb idea.

Things looked promising at first. Not only did Sane appear to have added explicit support for the MX870 in a recent Git revision, but Canon themselves appeared to offer official Linux drivers for the device. Great! Should be no problem, right?

First I tried using Canon's driver. It turns out, though, that Canon's driver requires that you use Canon's "ScanGear MP" software. This software is GUI-only and fairly painful to use. I really needed something scriptable. The software appeared to be an open source frontend on top of closed-source libraries, so presumably I could script it by editing the source, but I decided to try SANE instead since it already supports scripting.

Well, after compiling the latest SANE sources, I discovered that the MX870 isn't quite supported after all. It kind of works, but after scanning a stack of documents, the scanner tends to be left in a broken state at which point it needs to be power-cycled before it works again. I spent several hours tracing through the SANE code trying to find the problem to no avail: it appears that the protocol changed somehow. SANE implemented the protocol by reverse-engineering it, so there is no documentation, and the code is only guessing at a lot of points. Having no previous experience with SANE or this protocol, I really had no chance of getting anywhere.

OK, so, back to the Canon drivers. They are part-open-source, right? So I figured I could just replace the UI with a simple command-line frontend. Guess again. It turns out the engineers who wrote this code are completely and utterly incompetent. There is no separation between UI code and driver logic. The scan procedure pulls its parameters directly from the UI widgets. The code is littered with cryptically-named function calls, half of which are implemented in the closed-source libraries with no documentation. The only comments anywhere in the code were the kind that tell you what is already plainly obvious. You know, like:

/* Set the Foo param */

I gave up on trying to do anything with this code fairly quickly. But, while looking at it, I discovered something interesting: the package appeared to include a SANE backend!

Of course, since the package came with literally no documentation whatsoever (seriously, not even a README), I would never have known this functionality was present if I hadn't been digging through code. It turns out that the binary installer puts the library in the wrong location, hence SANE didn't notice it either. So, I went ahead and copied it to the right place!

And... nothing. When things go wrong, SANE is really poor at telling you what. It just continued to act like the driver didn't exist. After a great deal of poking around, I eventually realized that the driver was 32-bit, while SANE was 64-bit, thus dlopen() on the driver failed. But SANE didn't bother printing any sort of error message. Ugh.

So I compiled a 32-bit SANE and tried again. Still nothing. Turned out I had made a typo in the config that, again, was not reported by SANE even though it would have been easy to do so. Ugh. OK, try again. Nothing. strace showed that the driver was being opened, but it wasn't getting anywhere.

So I looked at the driver code again. This time I was looking at the SANE interface glue, which is also open source (but again, calls into closed-source libraries). I ended up fixing three or four different bugs just to get it to initialize correctly. I don't know how they managed to write the rest of the driver without the initialization working.

With all that done, finally, SANE could use the driver to scan images! Hooray! Except, not. I scanned one document, and ended up with a corrupted image that showed two small copies of the document side-by-side and then cut off in the middle.

Fuck it.

HP Officejet Pro 8500

I returned the printer to Fry's. They didn't give me the full price because I had opened the ink cartridges. Of course, the damned thing refused to boot up without ink cartridges, even though I just wanted to scan, so I had no choice but to open them. Ugh.

Anyway, this time I came prepared. The internets told me that the best bet for Linux printers and scanners is HP. And indeed, my previous printer/scanner was an HP and I was impressed by the quality of the Linux drivers. So, I looked at what HP models Fry's had and took the cheapest one with an automatic document feeder. That turned out to be the 8500. It was about twice the cost of the Canon but I really just wanted something that worked.

And work it did. As soon as the 20-minute first boot process finished (WTF?), the thing worked perfectly right away.

Step 2: Organize scans

The scanner can convert physical documents into electronic ones, but then how to I organize them? Carefully rename the files one-by-one and sort them into directories? Ugh. I probably have a couple thousand pages to go through. I need something that scales. Furthermore, ideally, the process of sorting the documents -- even just specifying which pages go together into a single document -- needs to be completely separate from the process of scanning them. I just want to shove piles of paper into my scanner and figure out what to do with them later.

As it turns out, a coworker of mine had the same thought some time ago, and wrote a little app called Scanning Cabinet to help him. It uploads pages as you scan them to an AppEngine app, where you can then go add metadata later.

The code is pretty rudimentary, so I had to make a number of tweaks. Perhaps the biggest one is that there is one piece of metadata that really needs to be specified at scan time: the location where I will put the pile of paper after scanning. I want to take each pile out of the scanner and put it directly into a folder with an arbitrary label, then keep track of the fact that all those documents can be found in that folder later if necessary. Brad's code has "physical location" as part of the metadata for a document, but it's something you specify with all the other metadata, long after you scanned the documents. At that point, the connection to physical paper is already long gone.

So, I modified the code to record the batch ID directly into the image files as comment tags. I also tweaked various things and fixed a couple bugs, and made the metadata form sticky so that if I am tagging several similar documents in a row I don't have to keep retyping the same stuff.

Does this scale?

I haven't started uploading en masse yet. However, I have some doubts about whether even this approach is scalable. Even with sticky form values, it takes at least 10 seconds to tag each document, often significantly longer. I think that would add up to a few solid days of work to go through my whole history.

Therefore, I'm thinking now that I need to find a way to hook some OCR into this system. But how far should I go? Is it enough to just make all the documents searchable based on OCR text, and then not bother organizing at all? Or would it be better to develop at OCR-assisted organization scheme?

This is starting to sound like a lot of work, and I have so many other projects I want to be working on.

For now, I think I will simply scan all my docs to images, and leave it at that. I can always feed those images into Scanning Cabinet later on, or maybe a better system will reveal itself. NeatWorks looks like everything I want, but unfortunately it seems like a highly proprietary system (and does not run on Linux anyway). I don't want to lock my documents up in software like that.


  1. I would do the OCR to search approach, search is just the best interface for everything. But then, I've been paperless for a long time, so I'd never have to do this. ;) How is NeatWorks locking you up? That link says they export to a ton of open formats, you can't begrudge them their proprietary algorithms just to get the OCR right.

  2. Hi,
    I can see this is a really old post (got there from your Slashdot story) but just wanted to say - scan and OCR works great, I've been doing it for 2 years now. My wife and I are doctors so we get loads of paper mail which all ends up in the shredder. I use a Fujitsu ScanSnap S1500 into DevonThink Pro Office on my Mac. It OCRs the text and makes PDF files which have the jpeg and the OCR'd text, so when you open the file you see the original image but you can select, cut and past text. It does so vaguely intelligent auto-categorisation based on the OCR results. Well worth putting the time in if you haven't done so already.

  3. Archivista http://www.archivista.ch/en/ is a Linux-based VM distro for OCRing and filing scanned documents.