This article originally appeared in the Yiddish Forverts.
After years of work by a small team of linguists, computer programmers, and volunteer editors, visitors to the Yiddish Book Center’s website can now search millions of pages of digitized Yiddish books with the aid of a newly launched computer program.
The program, Jochre, allows users to search for a specific word or phrase and instantly find every mention of it in more than 10,000 Yiddish books. Previously the books, which have been available online in PDF form for a decade, were only searchable by title and author name. It’s no exaggeration, note Yiddish scholars, to say that the software will revolutionize their field.
With Jochre, every Yiddish-reading computer user can now perform searches in a matter of seconds that would have taken a skilled researcher years. Linguists can quickly determine how common a word was at a given time while literary scholars and historians can instantly find which texts reference a writer, historical figure, or topic. Since the Yiddish Book Center’s collection of 10,000 titles constitutes the majority of Yiddish language books, Yiddish literature is now among the world’s most readily searchable literatures.
Eddy Portnoy, academic advisor to the YIVO Institute for Jewish Research and author of Bad Rabbi and Other Strange but True Stories from the Yiddish Press, told the Forverts that the opportunities this software affords Yiddish scholars are “extraordinary. It’s actually hard to describe how big of a deal this is for the field. I really thought I’d never live to see this day.”
Portnoy recalled that about twenty years ago, when the Yiddish Book Center was just beginning the process of digitizing its collections, he asked the organization’s founder and president, Aaron Lansky, if there were plans to make the texts searchable.
“Aaron (Lansky) told me that it would never happen,” Portnoy said. “The technology simply didn’t exist and he thought that it would cost millions of dollars to create it.”
More than anyone, the person most responsible for the Yiddish Book Center’s new searchability feature is the linguist and computer programmer Assaf Urieli. A multilingual South-African-born Israeli who lives in France, Urieli runs a software company whose clients include CNES, the French equivalent of NASA. Urieli discovered the Yiddish Book Center’s online collection shortly after it went live in 2009 and decided to create a computer program that would allow him to find words within the collection’s scanned books. According to a profile of him published in 2012 in the Yiddish Book Center’s journal Pakn-Treger, Urieli originally thought that the project would take him only a couple of months. He soon realized, however, that it would take years to complete. He persevered and two years later, his program could identity letters with 97% accuracy.
At that point Urieli emailed Aaron Lansky, offering to donate the software to the Yiddish Book Center and continue working on it pro bono with one condition: that the Center would allow other libraries and archives with Yiddish collections to use it for free. After showing Lansky an early version of Jochre and discussing the project, the two envisioned that the software would be perfected within about two years. Today, a little over eight years later, it’s finally ready. A beta-version is now available on the Yiddish Book Center’s website.
Staff at the Yiddish Book Center told the Forverts that a more polished version of the software with an expanded user interface will launch shortly.
So how does Jochre work? Jochre (the initials stand for Java Optical Character Recognition), is an optical character recognition (OCR) program which matches the shape of every letter on a scanned page with combinations of letters in a dictionary in order to recognize words. Although such programs have long existed for English and other languages that use the Latin alphabet, when Urieli began his project in 2009, OCR software was still in its infancy for the Hebrew alphabet. Getting an OCR program to work with older Hebrew and Yiddish texts posed an additional challenge as the fonts used in older books vary greatly and several Hebrew letters (for instance ס samekh and ם final mem) often appear nearly identical in 19th and early 20th century works.
Yiddish itself added many complications. The language’s spelling conventions vary greatly by time and place, so the software had to be trained to recognize a wide variety of spellings and match them to the standardized Yiddish orthography found in 21st century dictionaries. Additionally, the program had to be taught the intricacies of Yiddish grammar so that it could recognize both singular and plural forms of nouns and the five adjectival endings associated with different cases.
Now that the work of making the Yiddish Book Center’s collections searchable has been largely completed, the Center’s president Aaron Lansky hopes that Jochre will be adopted by other institutions. For years, Lansky has envisioned the creation of a searchable Yiddish library that would combine the Yiddish Book Center’s collections with those of the National Library of Israel and other organizations to make the entirety of modern Yiddish literature readily accessible to the public.
“It would be incredible if the Yiddish press became fully searchable,” said YIVO’s Portnoy, referring to Hebrew University’s Historical Jewish Press Archive, which includes hundreds of thousands of pages of Yiddish newspapers, including nearly the full run of the Forverts between 1897 and 1979. “Hopefully, we’ll soon be able to do that work much more quickly. Spending endless hours pouring over microfilm was an interesting experience but I don’t miss it.”
Speaking about the technology’s impact on the field more broadly, Portnoy added: “This is an enormous boost to everyone who works with Yiddish literature, linguistics, or folklore. I’m really excited to see what new scholarship will come from it.”