Re: [Wikisource-l] Wikicaptcha

21 Feb 2011

      Hi Seb, I answer personally since I'm the fellow most engaged into djvu
exploration in it.source group.
2011/2/20 Seb35 seb35wikipedia@gmail.com
...
Hi Andrea,
I saw VIGNERON and Jean-Frédéric today and we spoke about that. Jean-Fred
and I are a bit skeptical about the effective implementation of such a
system, here are some questions that I (or we) were asking: (the questions
are listed by order of importance.)

how much books have such coordinates? I know the Bnf-partnership-books

have such coordinates because originally in the OCR files (1057 books),
but on WS a lot of books have non-valid coordinates (word 0 0 1 1 "")
because Wikisourcians didn't know what was the meaning of these figures
(DjVu format is quite difficult to understand anyway); I don't know if
classical OCR have a function to indicate the coordinates of future
ocerized books
Coordinates come from OCR interpretation. All Internet Archive books have
them, both into djvu file layer and into djvu.xml file. You can verify the
presence of coordinates simply with djView; open the file, go into View,
select Display->Hidden text and, if coordinate esist, you'll see word text
superimposed to word images.
You can't get coordinates from a final user OCR program as FineReader 10;
you've to use professional versions, such as OCR engines written to mass,
automatized batch OCR routines.
...

what is the confidence in the coordinates? if you serve an half-word, it

will be difficult to recognize the entire word
The confidence of coordinates is extremely high. Coordinates calculation is
the first step of any OCR interpretation, so if you get a decent OCR
interpretation, this means that coordinate calculation is absolutely
perfect. Obviuosly you'll find wrong coordinates in any case where you find
a wrong OCR interpretation.
...

I am asking how you can validate the correctness of a given word for a

given person: a person (e.g.) creates an account on WS, a Captcha is asked
with a word, how do you know if his/her answer is correct? I aggree this
step disapears if you ask a pool of volunteers to answer to differents
captcha-word, but in this cas it resumes to the classical check of
Wikisourcians in a specialized way to treat particular cases
There are different strategies, all based on a complete automation of user
interpretation.
# classical: submit two words, one known as control, the other unknown.
Exact interpretation of known word validates the interpretation of the
unknown one.
# alternative: ask for more than one interpretaton of the unknown word from
different users/sessions/days. Validate the interpretation when matching.
...

you give the example of a ^ in a word, but how do you select the

OCR-mistakes? althought this is not really an issue since you can yet make
a list of current mistakes and it will be sufficient in a first time. I
know French Wikisourcians (at least, probably others also) already make a
list of frequent mistakes ( II->Il, 1l->Il, c->e ...), sometimes for a
given book (Trévoux in French of 1771 it seems to me).
FineReader OCR applications use the character ^ for uninterpretable
characters. Other tricks to find "probably wrong" words can be imagined,
matching words with a dictionary. Usual "scannos" are better managed with
different routines, by javascript or python; i.e. you can wrap them into a
Regex Menu Framework clean up routine (see Clean up routine used bu
[[en:User:Inductiveload]] or postOCR routine into RegexMenuFramework gadget
of it.source, just built from Inductiveload Clean up routine). Wikicaptcha
would manage unusual OCR mistakes, not usual ones.
...
But I know Google had a similar system for their digitization, but I don't
know exactly the details. For me there are a lot of details which makes
the global idea difficult to carry out (although I would prefer think the
contrary), but perhaps has you some answers.
Unluckily, Google doesn't share OCR mappings of its OCRs, it shares only the
"pure text". This is one of sound reasons that encourage to upload Google
pdfs into Internet Archive, so getting their "derivation", t.i. publication
of a djvu derived file with text layer from another (usually good) OCR
interpretation.
...
Sébastien
PS: I had another idea in a slightly different application field (roughtly
speaking automated validation of texts) but close of this one, I write an
email next week about that (already some notes in
http://wikisource.org/wiki/User:Seb35/Reverse_OCR).
I'll take a look with great interest.
Alex brollo

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Wikicaptcha