Re: [Wikisource-l] Wikicaptcha

20 Feb 2011

      Hi Andrea,
I saw VIGNERON and Jean-Frédéric today and we spoke about that. Jean-Fred  
and I are a bit skeptical about the effective implementation of such a  
system, here are some questions that I (or we) were asking: (the questions  
are listed by order of importance.)
- how much books have such coordinates? I know the Bnf-partnership-books  
have such coordinates because originally in the OCR files (1057 books),  
but on WS a lot of books have non-valid coordinates (word 0 0 1 1 "")  
because Wikisourcians didn't know what was the meaning of these figures  
(DjVu format is quite difficult to understand anyway); I don't know if  
classical OCR have a function to indicate the coordinates of future  
ocerized books
- what is the confidence in the coordinates? if you serve an half-word, it  
will be difficult to recognize the entire word
- I am asking how you can validate the correctness of a given word for a  
given person: a person (e.g.) creates an account on WS, a Captcha is asked  
with a word, how do you know if his/her answer is correct? I aggree this  
step disapears if you ask a pool of volunteers to answer to differents  
captcha-word, but in this cas it resumes to the classical check of  
Wikisourcians in a specialized way to treat particular cases
- you give the example of a ^ in a word, but how do you select the  
OCR-mistakes? althought this is not really an issue since you can yet make  
a list of current mistakes and it will be sufficient in a first time. I  
know French Wikisourcians (at least, probably others also) already make a  
list of frequent mistakes ( II->Il, 1l->Il, c->e ...), sometimes for a  
given book (Trévoux in French of 1771 it seems to me).
But I know Google had a similar system for their digitization, but I don't  
know exactly the details. For me there are a lot of details which makes  
the global idea difficult to carry out (although I would prefer think the  
contrary), but perhaps has you some answers.
Sébastien
PS: I had another idea in a slightly different application field (roughtly  
speaking automated validation of texts) but close of this one, I write an  
email next week about that (already some notes in  
http://wikisource.org/wiki/User:Seb35/Reverse_OCR).
Sat, 05 Feb 2011 15:14:57 +0100, Andrea Zanni zanni.andrea84@gmail.com  
wrote:
...
Dear wikisourcers,
while exploring the djvu text layer, it.source community found  
interesting
features that is good thing to share (SPOILER ALERT: Wikisource  
reCAPTCHA).
(I added the technicalities in the footnotes, please look at them if  
you're
interested.)
We discovered that when the text layer is extracted with djvuLibre
djvused.exe tool [1]
a text file is obtained, containing words and their absolute coordinates
into the image of the page.
Here a some example rows of such txt file from a running test:
(line 402 2686 2424 2757
(word 402 2699 576 2756 "State.")
(word 679 2698 892 2757 "Effects")
(word 919 2698 991 2756 "of")
(word 1007 2697 1467 2755 "Domestication")
(word 1493 2698 1607 2755 "and")
(word 1637 2697 1910 2757 "Climate.")
(word 2000 2698 2132 2756 "The")
(word 2155 2686 2424 2754 "Persians^"))

As you can see, the last word has a ^ character inside, that indicates a
doubtful, unrecognized character by OCR software.
What's really interesting is that python script can select these words  
using
the ^ character and produce automatically a file with the image of the  
word,
since all needed parameters for a ddjvu.exe call can be obtained (please
consider that this code comes from a rough, but *running* test script  
[2]).
So, in our it.source test script, a tiff image has been automatically
produced, exactly contaning the image of "Persians^" doubtful OCR output.
Its name is built as name-of-djvu-file+page number+coordinates into the
page, that it is all what is needed to link unambiguously the image and  
the
specific word into a specific page of a djvu file.
The image has been uploaded into Commons as
http://commons.wikimedia.org/wiki/File:Word_image_from_wikicaptcha_project.t...
As you can easily imagine, this could be the core of a "wikicaptcha"  
project
(as John Vandenberg called it), enabling us to produce our own Wikisource
reCaptcha.
A djvu file could be uploaded into a server (into an "incubator"); a
database of doubtful word images could be built; images could be  
presented
to wiki users (both as a voluntary task or as a formal reCAPTCHA to  
confirm
edits by unlogged contributors); resulting human interpretation could be
validated somehow (i.e. by n repetitions of matching, different
interpretations) then used to edit text layer of djvu file. Finally the
edited djvu file could be uploaded to Commons for formal source managing.
Please contact us if you like to have a copy of running, test scripts.
There's too a shared Dropbox folder with the complete environment where  
we
are testing scripts.
Opinions, feedbacks or thoughts are more than welcome.
Aubrey
Admin it.source
WMI Board
[1] command='djvused name-of.file.djvu  -e "select page-number;

print-txt" >text-file.txt
    os.system(command)
[2] if "^" in word:

                coord=key.split()
                #print coord
                w=str(eval(coord[3])-eval(

    coord[1]))
                    h=str(eval(coord[4])-eval(coord[2]))
                    x=coord[1]
                    y=coord[2]

filetiff=fileDjvu.replace(".djvu","")+"-"+pag+"-"+"_".join(coord)+".tiff"
                        segment="-segment
WxH+X+Y".replace("W",w).replace("H",h).replace("X",x).replace("Y",y)
                        command="ddjvu "+fileDjvu+" -page="+pag+"
-format=tiff "+segment+" "+filetiff
                        print command
                        os.system(command)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Wikicaptcha