Dear Tesseracters,

At Wikisource, the free digital library and sister project of Wikipedia, we have founded a user group [1] to promote international coordination and partnerships with fellow organizations. We have thousands of high quality volunteer proofread pages [2] matched by scans in ca. 50 different languages [3]. Our editing interface of one single page looks like this [4], which has another view as "index" [5] or as text with all pages together [6]. There are several verification levels, the most important are "yellow" which means that one contributor proofread the page, and "green" which means that a second person verified the proofread text.

This past weekend at Wikimania '14 in London we had a meeting were we discussed technical and social issues from several Wikisource language communities. One of the most serious issues was raised by the Belarusian community which uses 2 different scripts with no commercial OCR support. This means that the volunteers have to type each word manually. We wondered if it would be possible to train Tesseract to recognize these old texts using the text that has been already typed.

We would like to know if you would be interested in exploring collaboration possibilities. I imagine that with your guidance we could prepare training data not only in different languages, but also from different time periods, scripts, etc. At the moment it is not very clear how to achieve this.

Please let us know if you would like to have a hangout/skype conversation any day next week.

Cheers,
Micru

[1] https://meta.wikimedia.org/wiki/Wikisource_Community_User_Group
[2] https://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics
[3] http://stats.wikimedia.org/wikisource/EN/Sitemap.htm
[4] https://en.wikisource.org/wiki/Page%3ATyrannosaurus_and_Other_Cretaceous_Carnivorous_Dinosaurs.pdf/2
[5] https://en.wikisource.org/wiki/Index:Tyrannosaurus_and_Other_Cretaceous_Carnivorous_Dinosaurs.pdf
[6] https://en.wikisource.org/wiki/Tyrannosaurus_and_Other_Cretaceous_Carnivorous_Dinosaurs