PS: I forwarded Jim's message to one of the Belarusian Wikisourcers
On Tue, Aug 12, 2014 at 11:12 PM, Jim O'Regan <joregan(a)gmail.com> wrote:
> On 12 August 2014 17:25, Nick White <nick.white(a)durham.ac.uk> wrote:
> > Dear Wikisourcerers,
> >
> > It's good to hear from you. Wikisource is awesome, as far as I am
> > concerned.
> >
> >> One
> >> of the most serious issues was raised by the Belarusian community which
> uses 2
> >> different scripts with no commercial OCR support. This means that the
> >> volunteers have to type each word manually. We wondered if it would be
> possible
> >> to train Tesseract to recognize these old texts using the text that has
> been
> >> already typed.
> >
> > Actually, Tesseract should already have support for Russian and
> > Belarussian "out of the box"; see the 'rus' and 'bel' training data.
> >
>
> 'bel' contains Cyrillic; there is also a Latin script ('Łacinka') for
> Belarusian. (Russian is widely spoken in Belarus, but Russian texts
> would be added to the Russian Wikisource).
>
> The question I'd have for the Belarusian Wikisourcers is: can they be
> treated as having an exact mapping? (It doesn't need to be 1:1, I'm
> aware that, e.g., 'нь' maps to 'ń'). I ask because, as I remember it,
> there's very little text in Łacinka, and adapting Cyrillic material
> could be useful.
>
> > One thing that wikisource could potentially do for us would be
> > provide loads of proofread, freely reusable "ground truth" data to
> > test Tesseract with. Are there programatic ways of getting at the
> > data, for example downloading all page images and corresponding text
> > that is marked as green, for a specific language / script?
>
> They're all added to a category, so that part should be pretty easy.
>
> --
> <Sefam> Are any of the mentors around?
> <jimregan> yes, they're the ones trolling you
>
--
Etiamsi omnes, ego non