2011/2/20 Seb35 <seb35wikipedia@gmail.com>

A quick link I just received:

http://www.digitalkoot.fi (in English also)

It seems there are two Facebook games whose the aim is precisely to
correct OCRs.

Sébastien

Sun, 20 Feb 2011 22:16:15 +0100, Seb35 <seb35wikipedia@gmail.com> wrote:
> Hi Andrea,
>
> I saw VIGNERON and Jean-Frédéric today and we spoke about that.
> Jean-Fred and I are a bit skeptical about the effective implementation
> of such a system, here are some questions that I (or we) were asking:
> (the questions are listed by order of importance.)
>
> - how much books have such coordinates? I know the Bnf-partnership-books
> have such coordinates because originally in the OCR files (1057 books),
> but on WS a lot of books have non-valid coordinates (word 0 0 1 1 "")
> because Wikisourcians didn't know what was the meaning of these figures
> (DjVu format is quite difficult to understand anyway); I don't know if
> classical OCR have a function to indicate the coordinates of future
> ocerized books
>
> - what is the confidence in the coordinates? if you serve an half-word,
> it will be difficult to recognize the entire word
>
> - I am asking how you can validate the correctness of a given word for a
> given person: a person (e.g.) creates an account on WS, a Captcha is
> asked with a word, how do you know if his/her answer is correct? I
> aggree this step disapears if you ask a pool of volunteers to answer to
> differents captcha-word, but in this cas it resumes to the classical
> check of Wikisourcians in a specialized way to treat particular cases
>
> - you give the example of a ^ in a word, but how do you select the
> OCR-mistakes? althought this is not really an issue since you can yet
> make a list of current mistakes and it will be sufficient in a first
> time. I know French Wikisourcians (at least, probably others also)
> already make a list of frequent mistakes ( II->Il, 1l->Il, c->e ...),
> sometimes for a given book (Trévoux in French of 1771 it seems to me).
>
> But I know Google had a similar system for their digitization, but I
> don't know exactly the details. For me there are a lot of details which
> makes the global idea difficult to carry out (although I would prefer
> think the contrary), but perhaps has you some answers.
>
> Sébastien
>
> PS: I had another idea in a slightly different application field
> (roughtly speaking automated validation of texts) but close of this one,
> I write an email next week about that (already some notes in
> <http://wikisource.org/wiki/User:Seb35/Reverse_OCR>).
>
> Sat, 05 Feb 2011 15:14:57 +0100, Andrea Zanni <zanni.andrea84@gmail.com>
> wrote:
>> Dear wikisourcers,
>> while exploring the djvu text layer, it.source community found
>> interesting
>> features that is good thing to share (SPOILER ALERT: Wikisource
>> reCAPTCHA).
>> (I added the technicalities in the footnotes, please look at them if
>> you're
>> interested.)
>>
>> We discovered that when the text layer is extracted with djvuLibre
>> djvused.exe tool [1]
>> a text file is obtained, containing words and their absolute coordinates
>> into the image of the page.
>>
>> Here a some example rows of such txt file from a running test:
>>
>> (line 402 2686 2424 2757
>> (word 402 2699 576 2756 "State.")
>> (word 679 2698 892 2757 "Effects")
>> (word 919 2698 991 2756 "of")
>> (word 1007 2697 1467 2755 "Domestication")
>> (word 1493 2698 1607 2755 "and")
>> (word 1637 2697 1910 2757 "Climate.")
>> (word 2000 2698 2132 2756 "The")
>> (word 2155 2686 2424 2754 "Persians^"))
>>
>> As you can see, the last word has a ^ character inside, that indicates a
>> doubtful, unrecognized character by OCR software.
>>
>> What's really interesting is that python script can select these words
>> using
>> the ^ character and produce automatically a file with the image of the
>> word,
>> since all needed parameters for a ddjvu.exe call can be obtained (please
>> consider that this code comes from a rough, but *running* test script
>> [2]).
>>
>> So, in our it.source test script, a tiff image has been automatically
>> produced, exactly contaning the image of "Persians^" doubtful OCR
>> output.
>> Its name is built as name-of-djvu-file+page number+coordinates into the
>> page, that it is all what is needed to link unambiguously the image and
>> the
>> specific word into a specific page of a djvu file.
>>
>> The image has been uploaded into Commons as
>> http://commons.wikimedia.org/wiki/File:Word_image_from_wikicaptcha_project.tiff
>>
>> As you can easily imagine, this could be the core of a "wikicaptcha"
>> project
>> (as John Vandenberg called it), enabling us to produce our own
>> Wikisource
>> reCaptcha.
>>
>> A djvu file could be uploaded into a server (into an "incubator"); a
>> database of doubtful word images could be built; images could be
>> presented
>> to wiki users (both as a voluntary task or as a formal reCAPTCHA to
>> confirm
>> edits by unlogged contributors); resulting human interpretation could be
>> validated somehow (i.e. by n repetitions of matching, different
>> interpretations) then used to edit text layer of djvu file. Finally the
>> edited djvu file could be uploaded to Commons for formal source
>> managing.
>>
>> Please contact us if you like to have a copy of running, test scripts.
>> There's too a shared Dropbox folder with the complete environment where
>> we
>> are testing scripts.
>>
>> Opinions, feedbacks or thoughts are more than welcome.
>>
>>
>> Aubrey
>> Admin it.source
>> WMI Board
>>
>>
>> [1] command='djvused name-of.file.djvu -e "select page-number;
>> print-txt" >text-file.txt
>> os.system(command)
>>
>> [2] if "^" in word:
>>
>> coord=key.split()
>> #print coord
>> w=str(eval(coord[3])-eval(
>>
>> coord[1]))
>> h=str(eval(coord[4])-eval(coord[2]))
>> x=coord[1]
>> y=coord[2]
>>
>> filetiff=fileDjvu.replace(".djvu","")+"-"+pag+"-"+"_".join(coord)+".tiff"
>> segment="-segment
>> WxH+X+Y".replace("W",w).replace("H",h).replace("X",x).replace("Y",y)
>> command="ddjvu "+fileDjvu+" -page="+pag+"
>> -format=tiff "+segment+" "+filetiff
>> print command
>> os.system(command)

_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l