American Historical Association Critiques Google Books

Robert Townsend lists on the American Historical Association Blog several problems with Google Books and its program of scanning books to make them accessible on the internet. His critique is mostly focused on the accuracy of Google’s scanning, or rather the inaccuracy of its OCR, which he believes has led to a variety of problems.

The problem of quality control only exacerbates my most basic worry about the larger rush to digitize every scrap of information—that we are adding to the pile much faster than the technology can advance to extract the information in a useful or meaningful way. When I have asked people who know a lot more about the technology than me about this problem, they tend to wave their hand and mumble about “brilliant scientists” and “technological progress.” Forgive me if I remain unconvinced. Even for someone fairly proficient in Boolean search terms I find a lot of the results from Google Books (and Google more generally) just page after page of useless and irrelevant information. I find it increasingly hard to believe that Google can add tens of thousands of additional books each month to the information pile—many containing basic mistakes in content and metadata—and the information results will actually grow better over time.

The problem with Townsend’s quick dismissal of the opinions of “people who know a lot more about technology than [him]” is that there is stunningly brilliant work being done in field of search technology not to mention OCR software.

I remember a consumer flatbed scanner ten years ago that would take 5 minutes or so to scan a flat page of text, and the OCR recognition was pitiful. By contast, I am now able to use my digital camera and tripod to photograph documents for my own research at about 5 to 10 pages a minute (depending on page size, condition, binding, etc.) and my personal OCR software is upwards of 90% successful at recognizing type (I still wouldn’t dare try anything handwritten). For my own research, the OCR capability is something I’m experimenting with, not relying upon. Digitizing the documents I’m finding in archives is immensely useful, however, because I can enlarge the images on a large monitor and zoom in to read them much more effectively than I could with a magnifying glass. Five or ten years from now, when OCR is even more advanced, there’s nothing stopping me from running it on my TIFF images.

My own experience with research thus far is beside the point of Google Books. I offered the experience as some form of vindication for the technology underlying Google’s book scanning process. If a lowly graduate student like myself can do what I described fairly easily and inexpensively, what tools must Google have at its disposal?

Townsend in his piece continues to be a nay-sayer, and concludes that the Google’s immense financial resources are reason to be mistrustful of the motives behind its efforts.

So I have to ask, what’s the rush? In Google’s case the answer seems clear enough. Like any large corporation with a lot of excess cash the company seems bent on scooping up as much market share as possible, driving competition off the board and increasing the number of people seeing (and clicking on) its highly lucrative ads. But I am not sure why the rest of us should share the company’s sense of haste. Surely the libraries providing the content, and anyone else who cares about a rich digital environment, needs to worry about the potential costs of creating a “universal library” that is filled with mistakes and an impenetrable smog of information. Shouldn’t we ponder the costs to history if the real libraries take error-filled digital versions of particular books and bury the originals in a dark archive (or the dumpster)? And what is the cost to historical thinking if the only substantive information one can glean out of Google is precisely the kind of narrow facts and dates that make history classes such a bore? The future will be here soon enough. Shouldn’t we make sure we will be happy when we get there?

Of course, the problem with the argument that the Google “library” is “error-filled” is that it is constantly being updated and revised. The “errors” that do exist, as the result of faulty OCR or scanning, are going to be fixed as part of the Google’s process. The fear that actual libraries will relegate their scanned books to a “dark archive” or “dumpster” seems unfounded at this point. It is hard to imagine any historian or scholar tolerating that sort of destruction no matter how accurate and complete the digital archive becomes. Rather, the digital should complement, supplement, and augment the original physical archive.

A fellow Berkeley student, Jo Guldi, has used Google Books much more than I have for her research and dissertation work. On her blog she writes about how many new places Google Books search results have taken her by introducing her to a wealth of obscure 19th century texts. Google Books and Google search results were one of many starting points for her work, and opened up new avenues of inquiry for her.

I expect that Google Books, combined with many of their other search technologies, will help scholars in a wide variety of ways. Instead of criticizing their efforts, graduate students as well as established scholars should be lending the expertise we have with the content to Google and other organizations that have the technical and financial resources to help us with our research and writing.

Townsend’s critique on the AHA Blog: American Historical Association Blog: Google Books: What’s Not to Like?

Tags:
Fatal error: Call to undefined function utw_showtagsforcurrentpost() in /home/andrewpk/public_html/blog/wp-content/themes/ak/single.php on line 27