Le Mon, 06 Aug 2012 16:38:20 +0200, Andrea Zanni zanni.andrea84@gmail.com a écrit:
If someone is interested, Alex Brollo is digging into the djvu layer issue, we have a Dropbox folder with all the files. If you are interested in working on that, please drop me a mail.
What we can show you right now is this: https://www.dropbox.com/s/lu6re2a02xp0nyc/Dialogo%20della%20salute%20djvu%20...
As you can see, the text is not mapped again into the djvu, but it is "stored" all togheter in a region of the djvu page (in this case, left angle below). It is very difficult to re-map the text, for example because when we use the tag <ref> for footnotes we destroy the pattern :-(
The cool thing is that the text inside is already formatted in wikitext! https://www.dropbox.com/s/s2c0op5e9jeu47o/Dialogo%20della%20salute%20WS%20ss... Alex assures me this is easy and just uses few scripts from djvulibre (which is already installed in toolserver). The same could be made uploading wiki-rendered HTML into text layer.
This could be very interesting for other websites: they could just copy-and-paste the HTML file, or extract it with a simple python script calling for djvuLibre routines, and then use the Commons file as a benchmark. We could, maybe, give back some of our books to the Gutenberg project. Or, maybe, give it back to GLAMs.
What do you think?
Aubrey and Alex
[long message]
I share here some things about my experience with DjVu when we worked with for the Gallica project (@Andrea: yes, I completely aggree there were too much books, on the other side it gave the community a learning about too big partnerships :).
For Gallica, we had to do the reverse operation: translate the specific XML (ALTO, by the LoC [1]) to some format usable for Wikisource into the text layer of the DjVu. The source XML was very rich: there was coordinates of each word, some semantics about paragraphs, fonts, and hyphens. We wondered some time what to put exactly in the text layer, we searched if there was some standard, but (apart the very bad state of the documentation about DjVu) it seems it is completely free. So we choose to keep the semantics about paragraphs, headers and bottoms of the pages (very useful) and hyphens.
Doing this we lose a part of the semantics but WS don’t handle coordinates, so there was no pratical way to keep it. And references was not recognized in ALTO, so there was no way to format them correctly (and they stayed on the bottom).
The BnF asked during the partnership if there was some way to retrieve the proofreaded text, but there was no easy way to reconstruct afterwards the ALTO format, particularly the coordinates. I had a project to retrieve it but I didn’t have time [2].
I find, for the exports of WS texts (whose the DjVu text layer), we should try to convert our syntax into some fixed syntaxe(s): * raw wikitext (I don’t like it for standard export in the text layer because the syntax is not fixed (there is not Wikitext 1.0), not closed — I mean there are external templates — and there are bad semantics e.g. for headers in noinclude sections; * raw text, without any semantics; some syntaxes can not be correctly handled like arrays; * TEI [3], I like it because I have the sensation it corretly handle the semantics, but I guess there is not enough information in WS texts to create such XML without important efforts; * ALTO, for raw texts but we don’t have coordinates (or we can also create ALTO without coordinates); * others?
Sébastien
[1] http://www.loc.gov/standards/alto/ [2] https://wikisource.org/wiki/User:Seb35/Reverse_OCR [3] https://en.wikipedia.org/wiki/Text_Encoding_Initiative