Re: [Wikisource-l] Wikicaptcha

20 Feb 2011

A quick link I just received:

   http://www.digitalkoot.fi (in English also)

It seems there are two Facebook games whose the aim is precisely to  
correct OCRs.

Sébastien

Sun, 20 Feb 2011 22:16:15 +0100, Seb35 &lt;seb35wikipedia(a)gmail.com&gt; wrote:
...
  Hi Andrea,

 I saw VIGNERON and Jean-Frédéric today and we spoke about that.  
 Jean-Fred and I are a bit skeptical about the effective implementation  
 of such a system, here are some questions that I (or we) were asking:  
 (the questions are listed by order of importance.)

 - how much books have such coordinates? I know the Bnf-partnership-books  
 have such coordinates because originally in the OCR files (1057 books),  
 but on WS a lot of books have non-valid coordinates (word 0 0 1 1 "")  
 because Wikisourcians didn't know what was the meaning of these figures  
 (DjVu format is quite difficult to understand anyway); I don't know if  
 classical OCR have a function to indicate the coordinates of future  
 ocerized books

 - what is the confidence in the coordinates? if you serve an half-word,  
 it will be difficult to recognize the entire word

 - I am asking how you can validate the correctness of a given word for a  
 given person: a person (e.g.) creates an account on WS, a Captcha is  
 asked with a word, how do you know if his/her answer is correct? I  
 aggree this step disapears if you ask a pool of volunteers to answer to  
 differents captcha-word, but in this cas it resumes to the classical  
 check of Wikisourcians in a specialized way to treat particular cases

 - you give the example of a ^ in a word, but how do you select the  
 OCR-mistakes? althought this is not really an issue since you can yet  
 make a list of current mistakes and it will be sufficient in a first  
 time. I know French Wikisourcians (at least, probably others also)  
 already make a list of frequent mistakes ( II->Il, 1l->Il, c->e ...),  
 sometimes for a given book (Trévoux in French of 1771 it seems to me).

 But I know Google had a similar system for their digitization, but I  
 don't know exactly the details. For me there are a lot of details which  
 makes the global idea difficult to carry out (although I would prefer  
 think the contrary), but perhaps has you some answers.

 Sébastien

 PS: I had another idea in a slightly different application field  
 (roughtly speaking automated validation of texts) but close of this one,  
 I write an email next week about that (already some notes in  
 <http://wikisource.org/wiki/User:Seb35/Reverse_OCR>).

 Sat, 05 Feb 2011 15:14:57 +0100, Andrea Zanni &lt;zanni.andrea84(a)gmail.com&gt;  
 wrote:
> Dear wikisourcers,
> while exploring the djvu text layer, it.source community found  
> interesting
> features that is good thing to share (SPOILER ALERT: Wikisource  
> reCAPTCHA).
> (I added the technicalities in the footnotes, please look at them if  
> you're
> interested.)
>
> We discovered that when the text layer is extracted with djvuLibre
> djvused.exe tool [1]
> a text file is obtained, containing words and their absolute coordinates
> into the image of the page.
>
> Here a some example rows of such txt file from a running test:
>
>     (line 402 2686 2424 2757
>     (word 402 2699 576 2756 "State.")
>     (word 679 2698 892 2757 "Effects")
>     (word 919 2698 991 2756 "of")
>     (word 1007 2697 1467 2755 "Domestication")
>     (word 1493 2698 1607 2755 "and")
>     (word 1637 2697 1910 2757 "Climate.")
>     (word 2000 2698 2132 2756 "The")
>     (word 2155 2686 2424 2754 "Persians^"))
>
> As you can see, the last word has a ^ character inside, that indicates a
> doubtful, unrecognized character by OCR software.
>
> What's really interesting is that python script can select these words  
> using
> the ^ character and produce automatically a file with the image of the  
> word,
> since all needed parameters for a ddjvu.exe call can be obtained (please
> consider that this code comes from a rough, but *running* test script  
> [2]).
>
> So, in our it.source test script, a tiff image has been automatically
> produced, exactly contaning the image of "Persians^" doubtful OCR  
> output.
> Its name is built as name-of-djvu-file+page number+coordinates into the
> page, that it is all what is needed to link unambiguously the image and  
> the
> specific word into a specific page of a djvu file.
>
> The image has been uploaded into Commons as
> http://commons.wikimedia.org/wiki/File:Word_image_from_wikicaptcha_project.…
>
> As you can easily imagine, this could be the core of a "wikicaptcha"  
> project
> (as John Vandenberg called it), enabling us to produce our own  
> Wikisource
> reCaptcha.
>
> A djvu file could be uploaded into a server (into an "incubator"); a
> database of doubtful word images could be built; images could be  
> presented
> to wiki users (both as a voluntary task or as a formal reCAPTCHA to  
> confirm
> edits by unlogged contributors); resulting human interpretation could be
> validated somehow (i.e. by n repetitions of matching, different
> interpretations) then used to edit text layer of djvu file. Finally the
> edited djvu file could be uploaded to Commons for formal source  
> managing.
>
> Please contact us if you like to have a copy of running, test scripts.
> There's too a shared Dropbox folder with the complete environment where  
> we
> are testing scripts.
>
> Opinions, feedbacks or thoughts are more than welcome.
>
>
> Aubrey
> Admin it.source
> WMI Board
>
>
>     [1] command='djvused name-of.file.djvu  -e "select page-number;
> print-txt" >text-file.txt
>     os.system(command)
>
>     [2] if "^" in word:
>
>                     coord=key.split()
>                     #print coord
>                     w=str(eval(coord[3])-eval(
>
>         coord[1]))
>                         h=str(eval(coord[4])-eval(coord[2]))
>                         x=coord[1]
>                         y=coord[2]
>
>
filetiff=fileDjvu.replace(".djvu","")+"-"+pag+"-"+"_".join(coord)+".tiff"
>                         segment="-segment
>
WxH+X+Y".replace("W",w).replace("H",h).replace("X",x).replace("Y",y)
>                         command="ddjvu "+fileDjvu+"
-page="+pag+"
> -format=tiff "+segment+" "+filetiff
>                         print command
>                         os.system(command) 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Wikicaptcha