Dear wikisourcers,
while exploring the djvu text layer, it.source community found interesting features that is good thing to share (SPOILER ALERT: Wikisource reCAPTCHA).
(I added the technicalities in the footnotes, please look at them if you're interested.)
We discovered that when the text layer is extracted with djvuLibre djvused.exe tool [1]
a text file is obtained, containing words and their absolute coordinates into the image of the page.
Here a some example rows of such txt file from a running test:
(line 402 2686 2424 2757
(word 402 2699 576 2756 "State.")
(word 679 2698 892 2757 "Effects")
(word 919 2698 991 2756 "of")
(word 1007 2697 1467 2755 "Domestication")
(word 1493 2698 1607 2755 "and")
(word 1637 2697 1910 2757 "Climate.")
(word 2000 2698 2132 2756 "The")
(word 2155 2686 2424 2754 "Persians^"))
As you can see, the last word has a ^ character inside, that indicates a doubtful, unrecognized character by OCR software.
What's really interesting is that python script can select these words using the ^ character and produce automatically a file with the image of the word, since all needed parameters for a ddjvu.exe call can be obtained (please consider that this code comes from a rough, but *running* test script [2]).
So, in our it.source test script, a tiff image has been automatically produced, exactly contaning the image of "Persians^" doubtful OCR output.
Its name is built as name-of-djvu-file+page number+coordinates into the page, that it is all what is needed to link unambiguously the image and the specific word into a specific page of a djvu file.
The image has been uploaded into Commons as http://commons.wikimedia.org/wiki/File:Word_image_from_wikicaptcha_project.tiff
As you can easily imagine, this could be the core of a "wikicaptcha" project (as John Vandenberg called it), enabling us to produce our own Wikisource reCaptcha.
A djvu file could be uploaded into a server (into an "incubator"); a database of doubtful word images could be built; images could be presented to wiki users (both as a voluntary task or as a formal reCAPTCHA to confirm edits by unlogged contributors); resulting human interpretation could be validated somehow (i.e. by n repetitions of matching, different interpretations) then used to edit text layer of djvu file. Finally the edited djvu file could be uploaded to Commons for formal source managing.
Please contact us if you like to have a copy of running, test scripts. There's too a shared Dropbox folder with the complete environment where we are testing scripts.
Opinions, feedbacks or thoughts are more than welcome.
Aubrey
Admin it.source
WMI Board
[1] command='djvused name-of.file.djvu -e "select page-number; print-txt" >text-file.txt
os.system(command)
[2] if "^" in word:
coord=key.split()
#print coord
w=str(eval(coord[3])-eval(
coord[1]))
h=str(eval(coord[4])-eval(coord[2]))
x=coord[1]
y=coord[2]
filetiff=fileDjvu.replace(".djvu","")+"-"+pag+"-"+"_".join(coord)+".tiff"
segment="-segment WxH+X+Y".replace("W",w).replace("H",h).replace("X",x).replace("Y",y)
command="ddjvu "+fileDjvu+" -page="+pag+" -format=tiff "+segment+" "+filetiff
print command
os.system(command)