Tuesday, April 15, 2008

Book scanning and _Ten Thousand Cents_

Mark Tomasko writes, in the April 13 2008 e-Sylum, that he “hopes the '$150,000 digitizing machine' never shows up at the ANA library.” He feels that many institutions would destroy their originals if presented with a copy.

Mr. Tomasko is correct to feel that way. Every bibliophile who wants a good horror show should read Nicholson Baker's Double Fold. It turns out that for the past thirty years libraries have been microfilming and throwing away the originals. (Occasionaliy they sell them instead). The microfilms are often bad -- misaligned, underexposed, overexposed. They are black and white for color originals.

Digitizations such as Googles are a lot better in quality. The stuff you see on Google's web site is greatly reduced from the scans Google makes. (They are still filled with mis-scans, fold-out-plates that aren't folded out, etc.) Don't blame 'digitization' for something that has been going on for decades.

I really doubt the ANA would throw away originals after scanning. It's the public-funded, underfunded, public libraries that want to clear out space for the things they think today's taxpayers want.

Numismatics, especially ancient numismatics, whose key works are scarce or rare and out-of-print, has a problem unless the material becomes available. Works on ancient coins also tend to be in a foreign language.

Visit Google Book Search and look at a book. The public domain books now have a feature 'View Plain Text' that provides access to the OCRed text. Currently it isn't very good, but suppose it got better. It's very hand to copy text and paste it into Google Language Tools and translate it into any language.

It's kind of painful to do this. Currently the translations aren't very good. Google could easily make this much easier, by linking the two existing services together. They already do this for Google Talk chatting service.

What about quality? It's poor now, especially for numismatic texts. The reason it's poor is that Google uses a statistical translator algorithm trained from a small set of bilingual newspaper articles. It's going to get better. To understand how this works, check out an hour-long video presentation Theorizing from Data by Peter Norvig.

I predict that Google will develop software to let the public help with fixing OCR and translation glitches, if they haven't already. It's possible that folks will do this for free, but Norvig suggests they could be paid. There is already software that lets people, especially cheap offshore labor, work on large collaborative projects. This brings us to Ten Thousands Cents.

Ten Thousand Cents is a ‘conceptual art’ counterfeiting experiment. Artists Aaron Koblin and Takashi Kawashima used Amazon.com's “Mechanical Turk” web site to hire people to draw 10,000 pictures from what appeared to be abstract photographs.

The drawings, when assembled, reveal a US $100 bill. The artists are selling prints of the art piece for $100. The ‘prints’ are one-sided counterfeit hand-drawn $100 bills. The web site has more details and a two-minute video.

So imagine a future where teenagers and retirees are earning minimum wage helping Google's translation algorithms, where any book written before 1921 is available in any language. Imagine Google's high-resolution scanners getting to books before libraries de-accession the works anyway because a poor-quality microfilm that no human has ever looked at exists in a basement in Michigan.

No comments: