Hi Andrea,
I saw VIGNERON and Jean-Frédéric today and we spoke about that. Jean-Fred and I are a bit skeptical about the effective implementation of such a system, here are some questions that I (or we) were asking: (the questions are listed by order of importance.)
- how much books have such coordinates? I know the Bnf-partnership-books have such coordinates because originally in the OCR files (1057 books), but on WS a lot of books have non-valid coordinates (word 0 0 1 1 "") because Wikisourcians didn't know what was the meaning of these figures (DjVu format is quite difficult to understand anyway); I don't know if classical OCR have a function to indicate the coordinates of future ocerized books
- what is the confidence in the coordinates? if you serve an half-word, it will be difficult to recognize the entire word
- I am asking how you can validate the correctness of a given word for a given person: a person (e.g.) creates an account on WS, a Captcha is asked with a word, how do you know if his/her answer is correct? I aggree this step disapears if you ask a pool of volunteers to answer to differents captcha-word, but in this cas it resumes to the classical check of Wikisourcians in a specialized way to treat particular cases
- you give the example of a ^ in a word, but how do you select the OCR-mistakes? althought this is not really an issue since you can yet make a list of current mistakes and it will be sufficient in a first time. I know French Wikisourcians (at least, probably others also) already make a list of frequent mistakes ( II->Il, 1l->Il, c->e ...), sometimes for a given book (Trévoux in French of 1771 it seems to me).
But I know Google had a similar system for their digitization, but I don't know exactly the details. For me there are a lot of details which makes the global idea difficult to carry out (although I would prefer think the contrary), but perhaps has you some answers.
Sébastien
PS: I had another idea in a slightly different application field (roughtly speaking automated validation of texts) but close of this one, I write an email next week about that (already some notes in http://wikisource.org/wiki/User:Seb35/Reverse_OCR).
Sat, 05 Feb 2011 15:14:57 +0100, Andrea Zanni zanni.andrea84@gmail.com wrote:
Dear wikisourcers, while exploring the djvu text layer, it.source community found interesting features that is good thing to share (SPOILER ALERT: Wikisource reCAPTCHA). (I added the technicalities in the footnotes, please look at them if you're interested.)
We discovered that when the text layer is extracted with djvuLibre djvused.exe tool [1] a text file is obtained, containing words and their absolute coordinates into the image of the page.
Here a some example rows of such txt file from a running test:
(line 402 2686 2424 2757 (word 402 2699 576 2756 "State.") (word 679 2698 892 2757 "Effects") (word 919 2698 991 2756 "of") (word 1007 2697 1467 2755 "Domestication") (word 1493 2698 1607 2755 "and") (word 1637 2697 1910 2757 "Climate.") (word 2000 2698 2132 2756 "The") (word 2155 2686 2424 2754 "Persians^"))
As you can see, the last word has a ^ character inside, that indicates a doubtful, unrecognized character by OCR software.
What's really interesting is that python script can select these words using the ^ character and produce automatically a file with the image of the word, since all needed parameters for a ddjvu.exe call can be obtained (please consider that this code comes from a rough, but *running* test script [2]).
So, in our it.source test script, a tiff image has been automatically produced, exactly contaning the image of "Persians^" doubtful OCR output. Its name is built as name-of-djvu-file+page number+coordinates into the page, that it is all what is needed to link unambiguously the image and the specific word into a specific page of a djvu file.
The image has been uploaded into Commons as http://commons.wikimedia.org/wiki/File:Word_image_from_wikicaptcha_project.t...
As you can easily imagine, this could be the core of a "wikicaptcha" project (as John Vandenberg called it), enabling us to produce our own Wikisource reCaptcha.
A djvu file could be uploaded into a server (into an "incubator"); a database of doubtful word images could be built; images could be presented to wiki users (both as a voluntary task or as a formal reCAPTCHA to confirm edits by unlogged contributors); resulting human interpretation could be validated somehow (i.e. by n repetitions of matching, different interpretations) then used to edit text layer of djvu file. Finally the edited djvu file could be uploaded to Commons for formal source managing.
Please contact us if you like to have a copy of running, test scripts. There's too a shared Dropbox folder with the complete environment where we are testing scripts.
Opinions, feedbacks or thoughts are more than welcome.
Aubrey Admin it.source WMI Board
[1] command='djvused name-of.file.djvu -e "select page-number;
print-txt" >text-file.txt os.system(command)
[2] if "^" in word: coord=key.split() #print coord w=str(eval(coord[3])-eval( coord[1])) h=str(eval(coord[4])-eval(coord[2])) x=coord[1] y=coord[2]
filetiff=fileDjvu.replace(".djvu","")+"-"+pag+"-"+"_".join(coord)+".tiff" segment="-segment WxH+X+Y".replace("W",w).replace("H",h).replace("X",x).replace("Y",y) command="ddjvu "+fileDjvu+" -page="+pag+" -format=tiff "+segment+" "+filetiff print command os.system(command)