Hi Seb, I answer personally since I'm the fellow most engaged into djvu
exploration in it.source group.
2011/2/20 Seb35 <seb35wikipedia(a)gmail.com>
I saw VIGNERON and Jean-Frédéric today and we spoke about that. Jean-Fred
and I are a bit skeptical about the effective implementation of such a
system, here are some questions that I (or we) were asking: (the questions
are listed by order of importance.)
- how much books have such coordinates? I know the Bnf-partnership-books
have such coordinates because originally in the OCR files (1057 books),
but on WS a lot of books have non-valid coordinates (word 0 0 1 1 "")
because Wikisourcians didn't know what was the meaning of these figures
(DjVu format is quite difficult to understand anyway); I don't know if
classical OCR have a function to indicate the coordinates of future
Coordinates come from OCR interpretation. All Internet Archive books have
them, both into djvu file layer and into djvu.xml file. You can verify the
presence of coordinates simply with djView; open the file, go into View,
select Display->Hidden text and, if coordinate esist, you'll see word text
superimposed to word images.
You can't get coordinates from a final user OCR program as FineReader 10;
you've to use professional versions, such as OCR engines written to mass,
automatized batch OCR routines.
- what is the confidence in the coordinates? if you serve an half-word, it
will be difficult to recognize the entire word
The confidence of coordinates is extremely high. Coordinates calculation is
the first step of any OCR interpretation, so if you get a decent OCR
interpretation, this means that coordinate calculation is absolutely
perfect. Obviuosly you'll find wrong coordinates in any case where you find
a wrong OCR interpretation.
- I am asking how you can validate the correctness of a given word for a
given person: a person (e.g.) creates an account on WS, a Captcha is asked
with a word, how do you know if his/her answer is correct? I aggree this
step disapears if you ask a pool of volunteers to answer to differents
captcha-word, but in this cas it resumes to the classical check of
Wikisourcians in a specialized way to treat particular cases
There are different strategies, all based on a complete automation of user
# classical: submit two words, one known as control, the other unknown.
Exact interpretation of known word validates the interpretation of the
# alternative: ask for more than one interpretaton of the unknown word from
different users/sessions/days. Validate the interpretation when matching.
- you give the example of a ^ in a word, but how do you select the
OCR-mistakes? althought this is not really an issue since you can yet make
a list of current mistakes and it will be sufficient in a first time. I
know French Wikisourcians (at least, probably others also) already make a
list of frequent mistakes ( II->Il, 1l->Il, c->e ...), sometimes for a
given book (Trévoux in French of 1771 it seems to me).
FineReader OCR applications use the character ^ for uninterpretable
characters. Other tricks to find "probably wrong" words can be imagined,
matching words with a dictionary. Usual "scannos" are better managed with
Regex Menu Framework clean up routine (see Clean up routine used bu
[[en:User:Inductiveload]] or postOCR routine into RegexMenuFramework gadget
of it.source, just built from Inductiveload Clean up routine). Wikicaptcha
would manage unusual OCR mistakes, not usual ones.
But I know Google had a similar system for their digitization, but I don't
know exactly the details. For me there are a lot of details which makes
the global idea difficult to carry out (although I would prefer think the
contrary), but perhaps has you some answers.
Unluckily, Google doesn't share OCR mappings of its OCRs, it shares only the
"pure text". This is one of sound reasons that encourage to upload Google
pdfs into Internet Archive, so getting their "derivation", t.i. publication
of a djvu derived file with text layer from another (usually good) OCR
PS: I had another idea in a slightly different application field (roughtly
speaking automated validation of texts) but close of this one, I write an
email next week about that (already some notes in
I'll take a look with great interest.