PS: I forwarded Jim's message to one of the Belarusian Wikisourcers
On Tue, Aug 12, 2014 at 11:12 PM, Jim O'Regan <joregan(a)gmail.com> wrote:
> On 12 August 2014 17:25, Nick White <nick.white(a)durham.ac.uk> wrote:
> > Dear Wikisourcerers,
> >
> > It's good to hear from you. Wikisource is awesome, as far as I am
> > concerned.
> >
> >> One
> >> of the most serious issues was raised by the Belarusian community which
> uses 2
> >> different scripts with no commercial OCR support. This means that the
> >> volunteers have to type each word manually. We wondered if it would be
> possible
> >> to train Tesseract to recognize these old texts using the text that has
> been
> >> already typed.
> >
> > Actually, Tesseract should already have support for Russian and
> > Belarussian "out of the box"; see the 'rus' and 'bel' training data.
> >
>
> 'bel' contains Cyrillic; there is also a Latin script ('Łacinka') for
> Belarusian. (Russian is widely spoken in Belarus, but Russian texts
> would be added to the Russian Wikisource).
>
> The question I'd have for the Belarusian Wikisourcers is: can they be
> treated as having an exact mapping? (It doesn't need to be 1:1, I'm
> aware that, e.g., 'нь' maps to 'ń'). I ask because, as I remember it,
> there's very little text in Łacinka, and adapting Cyrillic material
> could be useful.
>
> > One thing that wikisource could potentially do for us would be
> > provide loads of proofread, freely reusable "ground truth" data to
> > test Tesseract with. Are there programatic ways of getting at the
> > data, for example downloading all page images and corresponding text
> > that is marked as green, for a specific language / script?
>
> They're all added to a category, so that part should be pretty easy.
>
> --
> <Sefam> Are any of the mentors around?
> <jimregan> yes, they're the ones trolling you
>
--
Etiamsi omnes, ego non
Thanks a lot for this nice answer,
A technical answer to the question:
> Are there programatic ways of getting at the data, for example downloading all page images and corresponding text that is marked as green, for a specific language / script?
Yes, you can get the list of Page: pages (the pages that contain the wikitext for a given scan image) using this API request: https://en.wikisource.org/w/api.php?action=query&generator=allpages&gapname… for en.wikisource where the Page: namespace id is 104 (this id is not the same in all Wikisources) (doc: https://www.mediawiki.org/wiki/API:Allpages )
Then you can just retrieve the content of a "green" page (the ones with "quality": 4) using this API request https://en.wikisource.org/w/api.php?action=query&prop=revisions&titles=Page… (doc: https://www.mediawiki.org/wiki/API:Properties#revisions_/_rv ).
To get the image of a given Page: page, just use this API request https://en.wikisource.org/w/api.php?action=query&titles=Image:Albert%20Eins… that retrieves the url of a file from his title (the Page: pages has as page title "Page:NAME_OF_THE_FILE" with sometime after a "/PAGE_NUMBER_IN_A_MULTIPAGE_FILE" so you have in NAME_OF_THE_FILE the name of the image to use.
Thanks again,
Thomas
> From: Nick White <nick.white(a)durham.ac.uk>
> Date: Tue, Aug 12, 2014 at 6:25 PM
> Subject: Re: [tesseract-ocr] Outreach from the Wikisource community
> To: tesseract-ocr(a)googlegroups.com
> Cc: "discussion list for Wikisource, the free library" <wikisource-l(a)lists.wikimedia.org>, David Cuenca <dacuetu(a)gmail.com>
>
>
> Dear Wikisourcerers,
>
> It's good to hear from you. Wikisource is awesome, as far as I am
> concerned.
>
> > One
> > of the most serious issues was raised by the Belarusian community which uses 2
> > different scripts with no commercial OCR support. This means that the
> > volunteers have to type each word manually. We wondered if it would be possible
> > to train Tesseract to recognize these old texts using the text that has been
> > already typed.
>
> Actually, Tesseract should already have support for Russian and
> Belarussian "out of the box"; see the 'rus' and 'bel' training data.
>
> > We would like to know if you would be interested in exploring collaboration
> > possibilities. I imagine that with your guidance we could prepare training data
>
> The first thing to do would be to take a look at the results you get
> from Tesseract with the rus and bel training sets already available,
> and let us know if they aren't appropriate.
>
> > not only in different languages, but also from different time
> > periods, scripts, etc.
>
> As to training for specific scripts, time periods, etc., in theory
> that is super cool, in practise probably one training set should be
> able to cover more-or-less everything (except very different
> scripts, like fraktur). That has been my experience with training
> Ancient Greek (for which I have been interested in recognising
> printing from a variety of time periods).
>
> So give Tesseract a whirl, and if it isn't appropriate, or doesn't
> work for specific scripts, let us know and we can try to figure out
> a plan.
>
> > At the moment it is not very clear how to achieve this.
>
> My plan is to rewrite the training documentation very soon, so
> things should hopefully become clearer on that front.
>
> One thing that wikisource could potentially do for us would be
> provide loads of proofread, freely reusable "ground truth" data to
> test Tesseract with. Are there programatic ways of getting at the
> data, for example downloading all page images and corresponding text
> that is marked as green, for a specific language / script?
>
> Thanks for getting in touch!
>
> Nick
>
>
>
> --
> Etiamsi omnes, ego non
Thanks to everyone who came to the Wikisource meetup at Wikimania 2014. We
had a very nice talk and it was wonderful to get to know each other in
person!
Some of the topics that were discussed at the first Wikisource Community
User Group meet up:
- lack of OCR for Belarusian and other languages -> I just sent an email to
the Tesseract list to see what could we do about it
- what to do with "wikisource.org" (oldwikisource) -> probably it is worth
creating a mul.* domain so we can connect it with Wikidata. It needs
broader discussion / promotion
- Wikidata -> not clear what to do with "Index:" pages, needs some thought
about how to move that metadata to Wikidata. Consider that the relationship
between the "index" (supporting/containing media, closer to Commons) and
the edition item (version of a work) is not always 1:1. Perhaps, using
"part of" or "published in" could be used here, but as said it needs
thinking and work on some use cases (let's be brave creating properties and
examples on test.wikidata.org)
- tabular data (csv) -> wikisource could host such sources if there is
enough interest and a technical solution to manage this kind of
information. Needs more discussion, and gather tech support.
- user engagement -> proposed to gamify proofreading. Identified potential
interested partners, if we don't manage to get support, it can be thought
as a GsoC next year.
- advocacy -> wikisourcerors are encouraged to contact their nearest
wikimedia chapter to look for support and raise awareness about the issues
that need attention. These kind of trust relationships are necessary to
organize the next proofreading contest. Chapters should be aware that with
just 100eur in prizes and help reaching out (for instance banners on
Wikipedia) a lot of things can be accomplished by volunteers like us.
- outreach -> tell GLAM members about wikisource so when they contact with
libraries they can promote wikisource as well (and not only
wikipedia/commons)
- open access -> identified as a high potential project for wikisource. It
needs some automatic text verification so we don't waste volunteers time
proofreading digital born documents.
- social -> more boldness needed! Let's not be afraid of changing things to
make the project more attractive for readers (for instance cover galleries
on the main page, or try collaborative or experimental tools). More contact
wished with the German community. Promote Wikisource on social platforms
and on GLAM newsletters.
- adminship -> it was proposed to promote Tpt to global steward so he can
help better new communities. It was also proposed that experienced members
could mentor wikisourcerors from new communities so they get to know all
the tools/hidden gems that we have.
Besides this year there were several wikisource related submissions
- Wikisource technical infrastructure-> Tpt presented his advances on the
visual editor integration and on the new beta feature "sister projects" to
be deployed in the next weeks [1]. Xelgen presented the advances of his
grant project for new Wikisource tools [2].
- Crowdsourcing the Digitization of Ben-Yehuda's Dictionary -> Asaf
presented his Hebrew crowd-sourced dictionary project [3]
- Wikisource panel -> Charles Matthews, Magnus Manske et al. explained what
is Wikisource about, possible options to gamify it, and its possible role
as a host of tabular data [4]
- Reform citation structure -> where several panelists explained the new
landscape of citations that is coming with Wikidata. I presented Wikisource
very briefly, to raise awareness of our project and the role it plays
supporting Wikipedia [5]
- From paper to digital book on Wikisource -> where Xelgen described the
best techniques to digitize a book and convert it into a digital one [6]
I hope I didn't forget anything... there was *a lot* of Wikisource this
year at Wikimania :)
Thanks
Micru
[1] https://www.mediawiki.org/wiki/Beta_Features/Other_projects_sidebar
[2]
https://meta.wikimedia.org/wiki/Grants:IEG/Tools_for_Armenian_Wikisource_an…
[3] https://wikimania2014.wikimedia.org/wiki/File:The_Old_New_Thing.pdf
[4]
https://wikimania2014.wikimedia.org/wiki/Submissions/Panel:_Wikisource,_fro…
[5]
https://wikimania2014.wikimedia.org/wiki/Submissions/Reform_of_citation_str…
[6]
https://wikimania2014.wikimedia.org/wiki/Submissions/From_paper_book_to_a_d…
Dear Tesseracters,
At Wikisource, the free digital library and sister project of Wikipedia, we
have founded a user group [1] to promote international coordination and
partnerships with fellow organizations. We have thousands of high quality
volunteer proofread pages [2] matched by scans in ca. 50 different
languages [3]. Our editing interface of one single page looks like this
[4], which has another view as "index" [5] or as text with all pages
together [6]. There are several verification levels, the most important are
"yellow" which means that one contributor proofread the page, and "green"
which means that a second person verified the proofread text.
This past weekend at Wikimania '14 in London we had a meeting were we
discussed technical and social issues from several Wikisource language
communities. One of the most serious issues was raised by the Belarusian
community which uses 2 different scripts with no commercial OCR support.
This means that the volunteers have to type each word manually. We wondered
if it would be possible to train Tesseract to recognize these old texts
using the text that has been already typed.
We would like to know if you would be interested in exploring collaboration
possibilities. I imagine that with your guidance we could prepare training
data not only in different languages, but also from different time periods,
scripts, etc. At the moment it is not very clear how to achieve this.
Please let us know if you would like to have a hangout/skype conversation
any day next week.
Cheers,
Micru
[1] https://meta.wikimedia.org/wiki/Wikisource_Community_User_Group
[2] https://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics
[3] http://stats.wikimedia.org/wikisource/EN/Sitemap.htm
[4]
https://en.wikisource.org/wiki/Page%3ATyrannosaurus_and_Other_Cretaceous_Ca…
[5]
https://en.wikisource.org/wiki/Index:Tyrannosaurus_and_Other_Cretaceous_Car…
[6]
https://en.wikisource.org/wiki/Tyrannosaurus_and_Other_Cretaceous_Carnivoro…