Dear wikisourcers, while exploring the djvu text layer, it.source community found interesting features that is good thing to share (SPOILER ALERT: Wikisource reCAPTCHA). (I added the technicalities in the footnotes, please look at them if you're interested.)
We discovered that when the text layer is extracted with djvuLibre djvused.exe tool [1] a text file is obtained, containing words and their absolute coordinates into the image of the page.
Here a some example rows of such txt file from a running test:
(line 402 2686 2424 2757 (word 402 2699 576 2756 "State.") (word 679 2698 892 2757 "Effects") (word 919 2698 991 2756 "of") (word 1007 2697 1467 2755 "Domestication") (word 1493 2698 1607 2755 "and") (word 1637 2697 1910 2757 "Climate.") (word 2000 2698 2132 2756 "The") (word 2155 2686 2424 2754 "Persians^"))
As you can see, the last word has a ^ character inside, that indicates a doubtful, unrecognized character by OCR software.
What's really interesting is that python script can select these words using the ^ character and produce automatically a file with the image of the word, since all needed parameters for a ddjvu.exe call can be obtained (please consider that this code comes from a rough, but *running* test script [2]).
So, in our it.source test script, a tiff image has been automatically produced, exactly contaning the image of "Persians^" doubtful OCR output. Its name is built as name-of-djvu-file+page number+coordinates into the page, that it is all what is needed to link unambiguously the image and the specific word into a specific page of a djvu file.
The image has been uploaded into Commons as http://commons.wikimedia.org/wiki/File:Word_image_from_wikicaptcha_project.t...
As you can easily imagine, this could be the core of a "wikicaptcha" project (as John Vandenberg called it), enabling us to produce our own Wikisource reCaptcha.
A djvu file could be uploaded into a server (into an "incubator"); a database of doubtful word images could be built; images could be presented to wiki users (both as a voluntary task or as a formal reCAPTCHA to confirm edits by unlogged contributors); resulting human interpretation could be validated somehow (i.e. by n repetitions of matching, different interpretations) then used to edit text layer of djvu file. Finally the edited djvu file could be uploaded to Commons for formal source managing.
Please contact us if you like to have a copy of running, test scripts. There's too a shared Dropbox folder with the complete environment where we are testing scripts.
Opinions, feedbacks or thoughts are more than welcome.
Aubrey Admin it.source WMI Board
[1] command='djvused name-of.file.djvu -e "select page-number; print-txt" >text-file.txt os.system(command)
[2] if "^" in word:
coord=key.split() #print coord w=str(eval(coord[3])-eval(
coord[1])) h=str(eval(coord[4])-eval(coord[2])) x=coord[1] y=coord[2]
filetiff=fileDjvu.replace(".djvu","")+"-"+pag+"-"+"_".join(coord)+".tiff" segment="-segment WxH+X+Y".replace("W",w).replace("H",h).replace("X",x).replace("Y",y) command="ddjvu "+fileDjvu+" -page="+pag+" -format=tiff "+segment+" "+filetiff print command os.system(command)
Hi Andrea,
I saw VIGNERON and Jean-Frédéric today and we spoke about that. Jean-Fred and I are a bit skeptical about the effective implementation of such a system, here are some questions that I (or we) were asking: (the questions are listed by order of importance.)
- how much books have such coordinates? I know the Bnf-partnership-books have such coordinates because originally in the OCR files (1057 books), but on WS a lot of books have non-valid coordinates (word 0 0 1 1 "") because Wikisourcians didn't know what was the meaning of these figures (DjVu format is quite difficult to understand anyway); I don't know if classical OCR have a function to indicate the coordinates of future ocerized books
- what is the confidence in the coordinates? if you serve an half-word, it will be difficult to recognize the entire word
- I am asking how you can validate the correctness of a given word for a given person: a person (e.g.) creates an account on WS, a Captcha is asked with a word, how do you know if his/her answer is correct? I aggree this step disapears if you ask a pool of volunteers to answer to differents captcha-word, but in this cas it resumes to the classical check of Wikisourcians in a specialized way to treat particular cases
- you give the example of a ^ in a word, but how do you select the OCR-mistakes? althought this is not really an issue since you can yet make a list of current mistakes and it will be sufficient in a first time. I know French Wikisourcians (at least, probably others also) already make a list of frequent mistakes ( II->Il, 1l->Il, c->e ...), sometimes for a given book (Trévoux in French of 1771 it seems to me).
But I know Google had a similar system for their digitization, but I don't know exactly the details. For me there are a lot of details which makes the global idea difficult to carry out (although I would prefer think the contrary), but perhaps has you some answers.
Sébastien
PS: I had another idea in a slightly different application field (roughtly speaking automated validation of texts) but close of this one, I write an email next week about that (already some notes in http://wikisource.org/wiki/User:Seb35/Reverse_OCR).
Sat, 05 Feb 2011 15:14:57 +0100, Andrea Zanni zanni.andrea84@gmail.com wrote:
Dear wikisourcers, while exploring the djvu text layer, it.source community found interesting features that is good thing to share (SPOILER ALERT: Wikisource reCAPTCHA). (I added the technicalities in the footnotes, please look at them if you're interested.)
We discovered that when the text layer is extracted with djvuLibre djvused.exe tool [1] a text file is obtained, containing words and their absolute coordinates into the image of the page.
Here a some example rows of such txt file from a running test:
(line 402 2686 2424 2757 (word 402 2699 576 2756 "State.") (word 679 2698 892 2757 "Effects") (word 919 2698 991 2756 "of") (word 1007 2697 1467 2755 "Domestication") (word 1493 2698 1607 2755 "and") (word 1637 2697 1910 2757 "Climate.") (word 2000 2698 2132 2756 "The") (word 2155 2686 2424 2754 "Persians^"))
As you can see, the last word has a ^ character inside, that indicates a doubtful, unrecognized character by OCR software.
What's really interesting is that python script can select these words using the ^ character and produce automatically a file with the image of the word, since all needed parameters for a ddjvu.exe call can be obtained (please consider that this code comes from a rough, but *running* test script [2]).
So, in our it.source test script, a tiff image has been automatically produced, exactly contaning the image of "Persians^" doubtful OCR output. Its name is built as name-of-djvu-file+page number+coordinates into the page, that it is all what is needed to link unambiguously the image and the specific word into a specific page of a djvu file.
The image has been uploaded into Commons as http://commons.wikimedia.org/wiki/File:Word_image_from_wikicaptcha_project.t...
As you can easily imagine, this could be the core of a "wikicaptcha" project (as John Vandenberg called it), enabling us to produce our own Wikisource reCaptcha.
A djvu file could be uploaded into a server (into an "incubator"); a database of doubtful word images could be built; images could be presented to wiki users (both as a voluntary task or as a formal reCAPTCHA to confirm edits by unlogged contributors); resulting human interpretation could be validated somehow (i.e. by n repetitions of matching, different interpretations) then used to edit text layer of djvu file. Finally the edited djvu file could be uploaded to Commons for formal source managing.
Please contact us if you like to have a copy of running, test scripts. There's too a shared Dropbox folder with the complete environment where we are testing scripts.
Opinions, feedbacks or thoughts are more than welcome.
Aubrey Admin it.source WMI Board
[1] command='djvused name-of.file.djvu -e "select page-number;
print-txt" >text-file.txt os.system(command)
[2] if "^" in word: coord=key.split() #print coord w=str(eval(coord[3])-eval( coord[1])) h=str(eval(coord[4])-eval(coord[2])) x=coord[1] y=coord[2]
filetiff=fileDjvu.replace(".djvu","")+"-"+pag+"-"+"_".join(coord)+".tiff" segment="-segment WxH+X+Y".replace("W",w).replace("H",h).replace("X",x).replace("Y",y) command="ddjvu "+fileDjvu+" -page="+pag+" -format=tiff "+segment+" "+filetiff print command os.system(command)
A quick link I just received:
http://www.digitalkoot.fi (in English also)
It seems there are two Facebook games whose the aim is precisely to correct OCRs.
Sébastien
Sun, 20 Feb 2011 22:16:15 +0100, Seb35 seb35wikipedia@gmail.com wrote:
Hi Andrea,
I saw VIGNERON and Jean-Frédéric today and we spoke about that. Jean-Fred and I are a bit skeptical about the effective implementation of such a system, here are some questions that I (or we) were asking: (the questions are listed by order of importance.)
- how much books have such coordinates? I know the Bnf-partnership-books
have such coordinates because originally in the OCR files (1057 books), but on WS a lot of books have non-valid coordinates (word 0 0 1 1 "") because Wikisourcians didn't know what was the meaning of these figures (DjVu format is quite difficult to understand anyway); I don't know if classical OCR have a function to indicate the coordinates of future ocerized books
- what is the confidence in the coordinates? if you serve an half-word,
it will be difficult to recognize the entire word
- I am asking how you can validate the correctness of a given word for a
given person: a person (e.g.) creates an account on WS, a Captcha is asked with a word, how do you know if his/her answer is correct? I aggree this step disapears if you ask a pool of volunteers to answer to differents captcha-word, but in this cas it resumes to the classical check of Wikisourcians in a specialized way to treat particular cases
- you give the example of a ^ in a word, but how do you select the
OCR-mistakes? althought this is not really an issue since you can yet make a list of current mistakes and it will be sufficient in a first time. I know French Wikisourcians (at least, probably others also) already make a list of frequent mistakes ( II->Il, 1l->Il, c->e ...), sometimes for a given book (Trévoux in French of 1771 it seems to me).
But I know Google had a similar system for their digitization, but I don't know exactly the details. For me there are a lot of details which makes the global idea difficult to carry out (although I would prefer think the contrary), but perhaps has you some answers.
Sébastien
PS: I had another idea in a slightly different application field (roughtly speaking automated validation of texts) but close of this one, I write an email next week about that (already some notes in http://wikisource.org/wiki/User:Seb35/Reverse_OCR).
Sat, 05 Feb 2011 15:14:57 +0100, Andrea Zanni zanni.andrea84@gmail.com wrote:
Dear wikisourcers, while exploring the djvu text layer, it.source community found interesting features that is good thing to share (SPOILER ALERT: Wikisource reCAPTCHA). (I added the technicalities in the footnotes, please look at them if you're interested.)
We discovered that when the text layer is extracted with djvuLibre djvused.exe tool [1] a text file is obtained, containing words and their absolute coordinates into the image of the page.
Here a some example rows of such txt file from a running test:
(line 402 2686 2424 2757 (word 402 2699 576 2756 "State.") (word 679 2698 892 2757 "Effects") (word 919 2698 991 2756 "of") (word 1007 2697 1467 2755 "Domestication") (word 1493 2698 1607 2755 "and") (word 1637 2697 1910 2757 "Climate.") (word 2000 2698 2132 2756 "The") (word 2155 2686 2424 2754 "Persians^"))
As you can see, the last word has a ^ character inside, that indicates a doubtful, unrecognized character by OCR software.
What's really interesting is that python script can select these words using the ^ character and produce automatically a file with the image of the word, since all needed parameters for a ddjvu.exe call can be obtained (please consider that this code comes from a rough, but *running* test script [2]).
So, in our it.source test script, a tiff image has been automatically produced, exactly contaning the image of "Persians^" doubtful OCR output. Its name is built as name-of-djvu-file+page number+coordinates into the page, that it is all what is needed to link unambiguously the image and the specific word into a specific page of a djvu file.
The image has been uploaded into Commons as http://commons.wikimedia.org/wiki/File:Word_image_from_wikicaptcha_project.t...
As you can easily imagine, this could be the core of a "wikicaptcha" project (as John Vandenberg called it), enabling us to produce our own Wikisource reCaptcha.
A djvu file could be uploaded into a server (into an "incubator"); a database of doubtful word images could be built; images could be presented to wiki users (both as a voluntary task or as a formal reCAPTCHA to confirm edits by unlogged contributors); resulting human interpretation could be validated somehow (i.e. by n repetitions of matching, different interpretations) then used to edit text layer of djvu file. Finally the edited djvu file could be uploaded to Commons for formal source managing.
Please contact us if you like to have a copy of running, test scripts. There's too a shared Dropbox folder with the complete environment where we are testing scripts.
Opinions, feedbacks or thoughts are more than welcome.
Aubrey Admin it.source WMI Board
[1] command='djvused name-of.file.djvu -e "select page-number;
print-txt" >text-file.txt os.system(command)
[2] if "^" in word: coord=key.split() #print coord w=str(eval(coord[3])-eval( coord[1])) h=str(eval(coord[4])-eval(coord[2])) x=coord[1] y=coord[2]
filetiff=fileDjvu.replace(".djvu","")+"-"+pag+"-"+"_".join(coord)+".tiff" segment="-segment WxH+X+Y".replace("W",w).replace("H",h).replace("X",x).replace("Y",y) command="ddjvu "+fileDjvu+" -page="+pag+" -format=tiff "+segment+" "+filetiff print command os.system(command)
2011/2/20 Seb35 seb35wikipedia@gmail.com
A quick link I just received:
http://www.digitalkoot.fi (in English also)
It seems there are two Facebook games whose the aim is precisely to correct OCRs.
Sébastien
Sun, 20 Feb 2011 22:16:15 +0100, Seb35 seb35wikipedia@gmail.com wrote:
Hi Andrea,
I saw VIGNERON and Jean-Frédéric today and we spoke about that. Jean-Fred and I are a bit skeptical about the effective implementation of such a system, here are some questions that I (or we) were asking: (the questions are listed by order of importance.)
- how much books have such coordinates? I know the Bnf-partnership-books
have such coordinates because originally in the OCR files (1057 books), but on WS a lot of books have non-valid coordinates (word 0 0 1 1 "") because Wikisourcians didn't know what was the meaning of these figures (DjVu format is quite difficult to understand anyway); I don't know if classical OCR have a function to indicate the coordinates of future ocerized books
- what is the confidence in the coordinates? if you serve an half-word,
it will be difficult to recognize the entire word
- I am asking how you can validate the correctness of a given word for a
given person: a person (e.g.) creates an account on WS, a Captcha is asked with a word, how do you know if his/her answer is correct? I aggree this step disapears if you ask a pool of volunteers to answer to differents captcha-word, but in this cas it resumes to the classical check of Wikisourcians in a specialized way to treat particular cases
- you give the example of a ^ in a word, but how do you select the
OCR-mistakes? althought this is not really an issue since you can yet make a list of current mistakes and it will be sufficient in a first time. I know French Wikisourcians (at least, probably others also) already make a list of frequent mistakes ( II->Il, 1l->Il, c->e ...), sometimes for a given book (Trévoux in French of 1771 it seems to me).
But I know Google had a similar system for their digitization, but I don't know exactly the details. For me there are a lot of details which makes the global idea difficult to carry out (although I would prefer think the contrary), but perhaps has you some answers.
Sébastien
PS: I had another idea in a slightly different application field (roughtly speaking automated validation of texts) but close of this one, I write an email next week about that (already some notes in http://wikisource.org/wiki/User:Seb35/Reverse_OCR).
Sat, 05 Feb 2011 15:14:57 +0100, Andrea Zanni zanni.andrea84@gmail.com wrote:
Dear wikisourcers, while exploring the djvu text layer, it.source community found interesting features that is good thing to share (SPOILER ALERT: Wikisource reCAPTCHA). (I added the technicalities in the footnotes, please look at them if you're interested.)
We discovered that when the text layer is extracted with djvuLibre djvused.exe tool [1] a text file is obtained, containing words and their absolute coordinates into the image of the page.
Here a some example rows of such txt file from a running test:
(line 402 2686 2424 2757 (word 402 2699 576 2756 "State.") (word 679 2698 892 2757 "Effects") (word 919 2698 991 2756 "of") (word 1007 2697 1467 2755 "Domestication") (word 1493 2698 1607 2755 "and") (word 1637 2697 1910 2757 "Climate.") (word 2000 2698 2132 2756 "The") (word 2155 2686 2424 2754 "Persians^"))
As you can see, the last word has a ^ character inside, that indicates a doubtful, unrecognized character by OCR software.
What's really interesting is that python script can select these words using the ^ character and produce automatically a file with the image of the word, since all needed parameters for a ddjvu.exe call can be obtained (please consider that this code comes from a rough, but *running* test script [2]).
So, in our it.source test script, a tiff image has been automatically produced, exactly contaning the image of "Persians^" doubtful OCR output. Its name is built as name-of-djvu-file+page number+coordinates into the page, that it is all what is needed to link unambiguously the image and the specific word into a specific page of a djvu file.
The image has been uploaded into Commons as
http://commons.wikimedia.org/wiki/File:Word_image_from_wikicaptcha_project.t...
As you can easily imagine, this could be the core of a "wikicaptcha" project (as John Vandenberg called it), enabling us to produce our own Wikisource reCaptcha.
A djvu file could be uploaded into a server (into an "incubator"); a database of doubtful word images could be built; images could be presented to wiki users (both as a voluntary task or as a formal reCAPTCHA to confirm edits by unlogged contributors); resulting human interpretation could be validated somehow (i.e. by n repetitions of matching, different interpretations) then used to edit text layer of djvu file. Finally the edited djvu file could be uploaded to Commons for formal source managing.
Please contact us if you like to have a copy of running, test scripts. There's too a shared Dropbox folder with the complete environment where we are testing scripts.
Opinions, feedbacks or thoughts are more than welcome.
Aubrey Admin it.source WMI Board
[1] command='djvused name-of.file.djvu -e "select page-number;
print-txt" >text-file.txt os.system(command)
[2] if "^" in word: coord=key.split() #print coord w=str(eval(coord[3])-eval( coord[1])) h=str(eval(coord[4])-eval(coord[2])) x=coord[1] y=coord[2]
filetiff=fileDjvu.replace(".djvu","")+"-"+pag+"-"+"_".join(coord)+".tiff"
segment="-segment
WxH+X+Y".replace("W",w).replace("H",h).replace("X",x).replace("Y",y) command="ddjvu "+fileDjvu+" -page="+pag+" -format=tiff "+segment+" "+filetiff print command os.system(command)
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Hi Seb, I answer personally since I'm the fellow most engaged into djvu exploration in it.source group.
2011/2/20 Seb35 seb35wikipedia@gmail.com
Hi Andrea,
I saw VIGNERON and Jean-Frédéric today and we spoke about that. Jean-Fred and I are a bit skeptical about the effective implementation of such a system, here are some questions that I (or we) were asking: (the questions are listed by order of importance.)
- how much books have such coordinates? I know the Bnf-partnership-books
have such coordinates because originally in the OCR files (1057 books), but on WS a lot of books have non-valid coordinates (word 0 0 1 1 "") because Wikisourcians didn't know what was the meaning of these figures (DjVu format is quite difficult to understand anyway); I don't know if classical OCR have a function to indicate the coordinates of future ocerized books
Coordinates come from OCR interpretation. All Internet Archive books have them, both into djvu file layer and into djvu.xml file. You can verify the presence of coordinates simply with djView; open the file, go into View, select Display->Hidden text and, if coordinate esist, you'll see word text superimposed to word images.
You can't get coordinates from a final user OCR program as FineReader 10; you've to use professional versions, such as OCR engines written to mass, automatized batch OCR routines.
- what is the confidence in the coordinates? if you serve an half-word, it
will be difficult to recognize the entire word
The confidence of coordinates is extremely high. Coordinates calculation is the first step of any OCR interpretation, so if you get a decent OCR interpretation, this means that coordinate calculation is absolutely perfect. Obviuosly you'll find wrong coordinates in any case where you find a wrong OCR interpretation.
- I am asking how you can validate the correctness of a given word for a
given person: a person (e.g.) creates an account on WS, a Captcha is asked with a word, how do you know if his/her answer is correct? I aggree this step disapears if you ask a pool of volunteers to answer to differents captcha-word, but in this cas it resumes to the classical check of Wikisourcians in a specialized way to treat particular cases
There are different strategies, all based on a complete automation of user interpretation. # classical: submit two words, one known as control, the other unknown. Exact interpretation of known word validates the interpretation of the unknown one. # alternative: ask for more than one interpretaton of the unknown word from different users/sessions/days. Validate the interpretation when matching.
- you give the example of a ^ in a word, but how do you select the
OCR-mistakes? althought this is not really an issue since you can yet make a list of current mistakes and it will be sufficient in a first time. I know French Wikisourcians (at least, probably others also) already make a list of frequent mistakes ( II->Il, 1l->Il, c->e ...), sometimes for a given book (Trévoux in French of 1771 it seems to me).
FineReader OCR applications use the character ^ for uninterpretable characters. Other tricks to find "probably wrong" words can be imagined, matching words with a dictionary. Usual "scannos" are better managed with different routines, by javascript or python; i.e. you can wrap them into a Regex Menu Framework clean up routine (see Clean up routine used bu [[en:User:Inductiveload]] or postOCR routine into RegexMenuFramework gadget of it.source, just built from Inductiveload Clean up routine). Wikicaptcha would manage unusual OCR mistakes, not usual ones.
But I know Google had a similar system for their digitization, but I don't know exactly the details. For me there are a lot of details which makes the global idea difficult to carry out (although I would prefer think the contrary), but perhaps has you some answers.
Unluckily, Google doesn't share OCR mappings of its OCRs, it shares only the "pure text". This is one of sound reasons that encourage to upload Google pdfs into Internet Archive, so getting their "derivation", t.i. publication of a djvu derived file with text layer from another (usually good) OCR interpretation.
Sébastien
PS: I had another idea in a slightly different application field (roughtly speaking automated validation of texts) but close of this one, I write an email next week about that (already some notes in http://wikisource.org/wiki/User:Seb35/Reverse_OCR).
I'll take a look with great interest.
Alex brollo
2011/2/20 Seb35 seb35wikipedia@gmail.com
Hi Andrea,
PS: I had another idea in a slightly different application field (roughtly speaking automated validation of texts) but close of this one, I write an email next week about that (already some notes in http://wikisource.org/wiki/User:Seb35/Reverse_OCR).
I'll take a look with great interest.
I took a look at that mainly interesting page and I added some preliminary comments. The field is a large, and promising one! Perhaps a specific, dedicated space is needed to share ideas and scripts! Some user is working about here and there, but perhaps a meeting point is needed.
PS: in our it.wiki talks, we call "Wikisource djvu" the same idea that you call "Reverse_OCR". :-)
Alex brollo
Mon, 21 Feb 2011 11:23:18 +0100, Alex Brollo alex.brollo@gmail.com wrote:
I took a look at that mainly interesting page and I added some preliminary comments. The field is a large, and promising one! Perhaps a specific, dedicated space is needed to share ideas and scripts! Some user is working about here and there, but perhaps a meeting point is needed.
Perhaps we can open a page/space on meta or wikisource.org about research and tools around Wikisource and OCRs (or perhaps it is already existing).
http://wikisource.org/wiki/Wikisource:Tools ? (not created)
PS: in our it.wiki talks, we call "Wikisource djvu" the same idea that you call "Reverse_OCR". :-)
I worked on a Python implementation 3-4 months ago but image processing is not really advanced (particularly creation of images of words, I began to write a wrapper of FreeType (more complete than the existing one) but it was quite long and I'm not a professionnal developer) and I had to create a particle filter in Python (not really complicated for me (it's my thesis research topic), but...)
I switched then to a C++ implementation to use directly FreeType and a particle filter is available on the English WP links. But I have no more time since about 1-2 months, I should share my code(s) on the toolserver SVN to show what I've done.
Sébastien
2011/2/21 Seb35 seb35wikipedia@gmail.com
I switched then to a C++ implementation to use directly FreeType and a particle filter is available on the English WP links. But I have no more time since about 1-2 months, I should share my code(s) on the toolserver SVN to show what I've done.
My suggestion is, to open a page with a more specific name, something like "Wikisource:Djvu text layer management" or a similar one. The first step could be a list of links to existing scripts and tools (i.e. en.source contains interesting tips & tricks)
PS: If I would not be really bold, I'd been confused and intimidated by your software competence.... I hardly know what C++ and toolserver SVN is. :-). I don't use at all python imaging as PIL, I only use djvuLibre routines, using python to call them, to manage their input/output features and to manage plain text produced by them!
Alex brollo
I took al fast look to Help:Djvu pages (with different names into different projects) in Commons:, en.source, fr.source; I found interesting suggestions about djvu files but I didn't find a detailed help about djvu text layer, nor about its manipulation. So, I imagine that such a page could be really written by scratch to share needed details. Where? What are your suggestions for its name? I found that wikisource.org has a poor setting of help pages; I presume that not so many users browse it; is perhaps Commons the best project to collect scripts, tricks, ideas?
Alex
Mon, 21 Feb 2011 15:42:57 +0100, Alex Brollo alex.brollo@gmail.com wrote:
I took al fast look to Help:Djvu pages (with different names into different projects) in Commons:, en.source, fr.source; I found interesting suggestions about djvu files but I didn't find a detailed help about djvu text layer, nor about its manipulation. So, I imagine that such a page could be really written by scratch to share needed details. Where? What are your suggestions for its name? I found that wikisource.org has a poor setting of help pages; I presume that not so many users browse it; is perhaps Commons the best project to collect scripts, tricks, ideas?
I would say wikisource.org since DjVu are mainly used on WS (there are a few DjVuS outside WS but not many many, perhaps inside institutions). If you are interested by the DjVu, you can browse http://www.djvu.org/resources/, but DjVu specifications are quite unreadable. Briefly, there are many layers, whose text is one layer, there are also annotation layers, many layers for the image, etc. DjVu is really powerful but really badly documented. So if you write some documentation from scratch, it will be quite the first documentation readable (I thought some time ago write a wikibook about DjVuS but didn't done it).
Sébastien
2011/2/21 Seb35 seb35wikipedia@gmail.com
I would say wikisource.org since DjVu are mainly used on WS (there are a few DjVuS outside WS but not many many, perhaps inside institutions). If you are interested by the DjVu, you can browse http://www.djvu.org/resources/, but DjVu specifications are quite unreadable. Briefly, there are many layers, whose text is one layer, there are also annotation layers, many layers for the image, etc. DjVu is really powerful but really badly documented. So if you write some documentation from scratch, it will be quite the first documentation readable (I thought some time ago write a wikibook about DjVuS but didn't done it).
Sébastien
OK for wikisource.org. I'll post a message in their Village Pump just to explain the idea to the community, then I'll open the page, considering too any suggestion from them.
Alex
2011/2/21 Seb35 seb35wikipedia@gmail.com
I would say wikisource.org since DjVu are mainly used on WS (there are a few DjVuS outside WS but not many many, perhaps inside institutions).
All is ready to start: http://wikisource.org/wiki/Wikisource:Scriptorium#Dragging_into_djvu_text_la...
See you there if you like!
Alex
wikisource-l@lists.wikimedia.org