[Wikisource-l] Wikicaptcha

Andrea Zanni zanni.andrea84 at gmail.com
Sat Feb 5 14:14:57 UTC 2011


Dear wikisourcers,
while exploring the djvu text layer, it.source community found interesting
features that is good thing to share (SPOILER ALERT: Wikisource reCAPTCHA).
(I added the technicalities in the footnotes, please look at them if you're
interested.)

We discovered that when the text layer is extracted with djvuLibre
djvused.exe tool [1]
a text file is obtained, containing words and their absolute coordinates
into the image of the page.

Here a some example rows of such txt file from a running test:

    (line 402 2686 2424 2757
    (word 402 2699 576 2756 "State.")
    (word 679 2698 892 2757 "Effects")
    (word 919 2698 991 2756 "of")
    (word 1007 2697 1467 2755 "Domestication")
    (word 1493 2698 1607 2755 "and")
    (word 1637 2697 1910 2757 "Climate.")
    (word 2000 2698 2132 2756 "The")
    (word 2155 2686 2424 2754 "Persians^"))

As you can see, the last word has a ^ character inside, that indicates a
doubtful, unrecognized character by OCR software.

What's really interesting is that python script can select these words using
the ^ character and produce automatically a file with the image of the word,
since all needed parameters for a ddjvu.exe call can be obtained (please
consider that this code comes from a rough, but *running* test script [2]).

So, in our it.source test script, a tiff image has been automatically
produced, exactly contaning the image of "Persians^" doubtful OCR output.
Its name is built as name-of-djvu-file+page number+coordinates into the
page, that it is all what is needed to link unambiguously the image and the
specific word into a specific page of a djvu file.

The image has been uploaded into Commons as
http://commons.wikimedia.org/wiki/File:Word_image_from_wikicaptcha_project.tiff

As you can easily imagine, this could be the core of a "wikicaptcha" project
(as John Vandenberg called it), enabling us to produce our own Wikisource
reCaptcha.

A djvu file could be uploaded into a server (into an "incubator"); a
database of doubtful word images could be built; images could be presented
to wiki users (both as a voluntary task or as a formal reCAPTCHA to confirm
edits by unlogged contributors); resulting human interpretation could be
validated somehow (i.e. by n repetitions of matching, different
interpretations) then used to edit text layer of djvu file. Finally the
edited djvu file could be uploaded to Commons for formal source managing.

Please contact us if you like to have a copy of running, test scripts.
There's too a shared Dropbox folder with the complete environment where we
are testing scripts.

Opinions, feedbacks or thoughts are more than welcome.


Aubrey
Admin it.source
WMI Board


    [1] command='djvused name-of.file.djvu  -e "select page-number;
print-txt" >text-file.txt
    os.system(command)

    [2] if "^" in word:

                    coord=key.split()
                    #print coord
                    w=str(eval(coord[3])-eval(

        coord[1]))
                        h=str(eval(coord[4])-eval(coord[2]))
                        x=coord[1]
                        y=coord[2]

filetiff=fileDjvu.replace(".djvu","")+"-"+pag+"-"+"_".join(coord)+".tiff"
                        segment="-segment
WxH+X+Y".replace("W",w).replace("H",h).replace("X",x).replace("Y",y)
                        command="ddjvu "+fileDjvu+" -page="+pag+"
-format=tiff "+segment+" "+filetiff
                        print command
                        os.system(command)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/wikisource-l/attachments/20110205/a2623b2d/attachment-0001.htm 


More information about the Wikisource-l mailing list