Google Promotes Open Source OCR Library

"You might wonder," reads a Google corporate blog post yesterday morning, "why Google is interested in [optical character recognition]." Indeed, you might wonder that if you didn't already know that Google has been deeply involved with an on-again/off-again project to produce a digital library of the world's literary material.

Although that project is officially suspended, work continues on one of the technical prerequisites to making such a library possible: a project called Tesseract, begun in 1985 at the University of Nevada at Las Vegas. The school worked with HP to construct a reliable OCR system that works with all manners of printed text.

As the World Wide Web started to take root, Tesseract began losing ground, perhaps mainly due to the reorganization of HP from a research company to a consumer products firm. In 2005, Google apparently made a successful case for UNLV to release Tesseract into open source.