Hi Seb, I answer personally since I'm the fellow most engaged into djvu exploration in it.source group.
Hi Andrea,
I saw VIGNERON and Jean-Frédéric today and we spoke about that. Jean-Fred
and I are a bit skeptical about the effective implementation of such a
system, here are some questions that I (or we) were asking: (the questions
are listed by order of importance.)
- how much books have such coordinates? I know the Bnf-partnership-books
have such coordinates because originally in the OCR files (1057 books),
but on WS a lot of books have non-valid coordinates (word 0 0 1 1 "")
because Wikisourcians didn't know what was the meaning of these figures
(DjVu format is quite difficult to understand anyway); I don't know if
classical OCR have a function to indicate the coordinates of future
ocerized books
- what is the confidence in the coordinates? if you serve an half-word, it
will be difficult to recognize the entire word
- I am asking how you can validate the correctness of a given word for a
given person: a person (e.g.) creates an account on WS, a Captcha is asked
with a word, how do you know if his/her answer is correct? I aggree this
step disapears if you ask a pool of volunteers to answer to differents
captcha-word, but in this cas it resumes to the classical check of
Wikisourcians in a specialized way to treat particular cases
- you give the example of a ^ in a word, but how do you select the
OCR-mistakes? althought this is not really an issue since you can yet make
a list of current mistakes and it will be sufficient in a first time. I
know French Wikisourcians (at least, probably others also) already make a
list of frequent mistakes ( II->Il, 1l->Il, c->e ...), sometimes for a
given book (Trévoux in French of 1771 it seems to me).
But I know Google had a similar system for their digitization, but I don't
know exactly the details. For me there are a lot of details which makes
the global idea difficult to carry out (although I would prefer think the
contrary), but perhaps has you some answers.
Sébastien
PS: I had another idea in a slightly different application field (roughtly
speaking automated validation of texts) but close of this one, I write an
email next week about that (already some notes in
<http://wikisource.org/wiki/User:Seb35/Reverse_OCR>).