Dear wikisourcers,<br>while exploring the djvu text layer, it.source community found interesting features that is good thing to share (SPOILER ALERT: Wikisource reCAPTCHA). <br>(I added the technicalities in the footnotes, please look at them if you're interested.)<br>
<br>We discovered that when the text layer is extracted with djvuLibre djvused.exe tool [1]<br>a text file is obtained, containing words and their absolute coordinates into the image of the page.<br><br>Here a some example rows of such txt file from a running test:<br>
<br> (line 402 2686 2424 2757<br> (word 402 2699 576 2756 "State.")<br> (word 679 2698 892 2757 "Effects")<br> (word 919 2698 991 2756 "of")<br> (word 1007 2697 1467 2755 "Domestication")<br>
(word 1493 2698 1607 2755 "and")<br> (word 1637 2697 1910 2757 "Climate.")<br> (word 2000 2698 2132 2756 "The")<br> (word 2155 2686 2424 2754 "Persians^"))<br><br>As you can see, the last word has a ^ character inside, that indicates a doubtful, unrecognized character by OCR software.<br>
<br>What's really interesting is that python script can select these words using the ^ character and produce automatically a file with the image of the word, since all needed parameters for a ddjvu.exe call can be obtained (please consider that this code comes from a rough, but *running* test script [2]).<br>
<br>So, in our it.source test script, a tiff image has been automatically produced, exactly contaning the image of "Persians^" doubtful OCR output. <br>Its name is built as name-of-djvu-file+page number+coordinates into the page, that it is all what is needed to link unambiguously the image and the specific word into a specific page of a djvu file. <br>
<br>The image has been uploaded into Commons as <a href="http://commons.wikimedia.org/wiki/File:Word_image_from_wikicaptcha_project.tiff">http://commons.wikimedia.org/wiki/File:Word_image_from_wikicaptcha_project.tiff</a><br>
<br>As you can easily imagine, this could be the core of a "wikicaptcha" project (as John Vandenberg called it), enabling us to produce our own Wikisource reCaptcha. <br><br>A djvu file could be uploaded into a server (into an "incubator"); a database of doubtful word images could be built; images could be presented to wiki users (both as a voluntary task or as a formal reCAPTCHA to confirm edits by unlogged contributors); resulting human interpretation could be validated somehow (i.e. by n repetitions of matching, different interpretations) then used to edit text layer of djvu file. Finally the edited djvu file could be uploaded to Commons for formal source managing.<br>
<br>Please contact us if you like to have a copy of running, test scripts. There's too a shared Dropbox folder with the complete environment where we are testing scripts.<br><br>Opinions, feedbacks or thoughts are more than welcome.<br>
<br><br>Aubrey <br>Admin it.source<br>WMI Board<br><br><br> [1] command='djvused name-of.file.djvu -e "select page-number; print-txt" >text-file.txt<br> os.system(command)<br><br> [2] if "^" in word:<br>
<br> coord=key.split()<br> #print coord<br> w=str(eval(coord[3])-eval(<br><br> coord[1]))<br> h=str(eval(coord[4])-eval(coord[2]))<br>
x=coord[1]<br> y=coord[2]<br> filetiff=fileDjvu.replace(".djvu","")+"-"+pag+"-"+"_".join(coord)+".tiff"<br>
segment="-segment WxH+X+Y".replace("W",w).replace("H",h).replace("X",x).replace("Y",y)<br> command="ddjvu "+fileDjvu+" -page="+pag+" -format=tiff "+segment+" "+filetiff<br>
print command<br> os.system(command)<br>