Re: [Wikisource-l] [tesseract-ocr] Outreach from the Wikisource community - Wikisource-l

13 Aug 2014

PS: I forwarded Jim's message to one of the Belarusian Wikisourcers

On Tue, Aug 12, 2014 at 11:12 PM, Jim O'Regan &lt;joregan(a)gmail.com&gt; wrote:

...
  On 12 August 2014 17:25, Nick White
&lt;nick.white(a)durham.ac.uk&gt; wrote:
  Dear Wikisourcerers,

 It's good to hear from you. Wikisource is awesome, as far as I am
 concerned.

> One
> of the most serious issues was raised by the Belarusian community which  uses
2
 > different scripts with no commercial OCR
support. This means that the
> volunteers have to type each word manually. We wondered if it would be 
possible
 > to train Tesseract to recognize these old
texts using the text that has  been
   already
typed. 
 Actually, Tesseract should already have support for Russian and
 Belarussian "out of the box"; see the 'rus' and 'bel' training
data.

 'bel' contains Cyrillic; there is also a Latin script ('Łacinka') for
 Belarusian. (Russian is widely spoken in Belarus, but Russian texts
 would be added to the Russian Wikisource).

 The question I'd have for the Belarusian Wikisourcers is: can they be
 treated as having an exact mapping? (It doesn't need to be 1:1, I'm
 aware that, e.g., 'нь' maps to 'ń'). I ask because, as I remember it,
 there's very little text in Łacinka, and adapting Cyrillic material
 could be useful.

  One thing that wikisource could potentially do
for us would be
 provide loads of proofread, freely reusable "ground truth" data to
 test Tesseract with. Are there programatic ways of getting at the
 data, for example downloading all page images and corresponding text
 that is marked as green, for a specific language / script? 
 They're all added to a category, so that part should be pretty easy.

 --
 <Sefam> Are any of the mentors around?
 <jimregan> yes, they're the ones trolling you

-- 
Etiamsi omnes, ego non