[Wikisource-l] Fwd: [tesseract-ocr] Outreach from the Wikisource community

13 Aug 2014


      Thanks a lot for this nice answer,
A technical answer to the question:
...
Are there programatic ways of getting at the data, for example downloading all page images and corresponding text that is marked as green, for a specific language / script?
Yes, you can get the list of Page: pages (the pages that contain the wikitext for a given scan image) using this API request: https://en.wikisource.org/w/api.php?action=query&generator=allpages&... for en.wikisource where the Page: namespace id is 104 (this id is not the same in all Wikisources) (doc: https://www.mediawiki.org/wiki/API:Allpages )
Then you can just retrieve the content of a "green" page (the ones with "quality": 4) using this API request https://en.wikisource.org/w/api.php?action=query&prop=revisions&titl... (doc: https://www.mediawiki.org/wiki/API:Properties#revisions_/_rv ).
To get the image of a given Page: page, just use this API request https://en.wikisource.org/w/api.php?action=query&titles=Image:Albert%20E... that retrieves the url of a file from his title (the Page: pages has as page title "Page:NAME_OF_THE_FILE" with sometime after a "/PAGE_NUMBER_IN_A_MULTIPAGE_FILE" so you have in NAME_OF_THE_FILE the name of the image to use.
Thanks again,
Thomas
...
From: Nick White nick.white@durham.ac.uk
Date: Tue, Aug 12, 2014 at 6:25 PM
Subject: Re: [tesseract-ocr] Outreach from the Wikisource community
To: tesseract-ocr@googlegroups.com
Cc: "discussion list for Wikisource, the free library" wikisource-l@lists.wikimedia.org, David Cuenca dacuetu@gmail.com
Dear Wikisourcerers,
It's good to hear from you. Wikisource is awesome, as far as I am
concerned.
...
One
of the most serious issues was raised by the Belarusian community which uses 2
different scripts with no commercial OCR support. This means that the
volunteers have to type each word manually. We wondered if it would be possible
to train Tesseract to recognize these old texts using the text that has been
already typed.
Actually, Tesseract should already have support for Russian and
Belarussian "out of the box"; see the 'rus' and 'bel' training data.
...
We would like to know if you would be interested in exploring collaboration
possibilities. I imagine that with your guidance we could prepare training data
The first thing to do would be to take a look at the results you get
from Tesseract with the rus and bel training sets already available,
and let us know if they aren't appropriate.
...
not only in different languages, but also from different time
periods, scripts, etc.
As to training for specific scripts, time periods, etc., in theory
that is super cool, in practise probably one training set should be
able to cover more-or-less everything (except very different
scripts, like fraktur). That has been my experience with training
Ancient Greek (for which I have been interested in recognising
printing from a variety of time periods).
So give Tesseract a whirl, and if it isn't appropriate, or doesn't
work for specific scripts, let us know and we can try to figure out
a plan.
...
At the moment it is not very clear how to achieve this.
My plan is to rewrite the training documentation very soon, so
things should hopefully become clearer on that front.
One thing that wikisource could potentially do for us would be
provide loads of proofread, freely reusable "ground truth" data to
test Tesseract with. Are there programatic ways of getting at the
data, for example downloading all page images and corresponding text
that is marked as green, for a specific language / script?
Thanks for getting in touch!
Nick
-- 
Etiamsi omnes, ego non

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Wikisource-l] Fwd: [tesseract-ocr] Outreach from the Wikisource community