Le Mon, 06 Aug 2012 16:38:20 +0200, Andrea Zanni
<zanni.andrea84(a)gmail.com> a écrit:
If someone is interested,
Alex Brollo is digging into the djvu layer issue,
we have a Dropbox folder with all the files.
If you are interested in working on that, please drop me a mail.
What we can show you right now is this:
https://www.dropbox.com/s/lu6re2a02xp0nyc/Dialogo%20della%20salute%20djvu%2…
As you can see, the text is not mapped again into the djvu, but it is
"stored" all togheter in a region of the djvu page (in this case, left
angle below).
It is very difficult to re-map the text, for example because when we use
the tag <ref> for footnotes we destroy the pattern :-(
The cool thing is that the text inside is already formatted in wikitext!
https://www.dropbox.com/s/s2c0op5e9jeu47o/Dialogo%20della%20salute%20WS%20s…
Alex assures me this is easy and just uses few scripts from djvulibre
(which is already installed in toolserver).
The same could be made uploading wiki-rendered HTML into text layer.
This could be very interesting for other websites: they could just
copy-and-paste the HTML file, or extract it with a simple python script
calling for djvuLibre routines, and then use the Commons file as a
benchmark.
We could, maybe, give back some of our books to the Gutenberg project.
Or, maybe, give it back to GLAMs.
What do you think?
Aubrey and Alex
[long message]
I share here some things about my experience with DjVu when we worked with
for the Gallica project (@Andrea: yes, I completely aggree there were too
much books, on the other side it gave the community a learning about too
big partnerships :).
For Gallica, we had to do the reverse operation: translate the specific
XML (ALTO, by the LoC [1]) to some format usable for Wikisource into the
text layer of the DjVu. The source XML was very rich: there was
coordinates of each word, some semantics about paragraphs, fonts, and
hyphens. We wondered some time what to put exactly in the text layer, we
searched if there was some standard, but (apart the very bad state of the
documentation about DjVu) it seems it is completely free. So we choose to
keep the semantics about paragraphs, headers and bottoms of the pages
(very useful) and hyphens.
Doing this we lose a part of the semantics but WS don’t handle
coordinates, so there was no pratical way to keep it. And references was
not recognized in ALTO, so there was no way to format them correctly (and
they stayed on the bottom).
The BnF asked during the partnership if there was some way to retrieve the
proofreaded text, but there was no easy way to reconstruct afterwards the
ALTO format, particularly the coordinates. I had a project to retrieve it
but I didn’t have time [2].
I find, for the exports of WS texts (whose the DjVu text layer), we should
try to convert our syntax into some fixed syntaxe(s):
* raw wikitext (I don’t like it for standard export in the text layer
because the syntax is not fixed (there is not Wikitext 1.0), not closed —
I mean there are external templates — and there are bad semantics e.g. for
headers in noinclude sections;
* raw text, without any semantics; some syntaxes can not be correctly
handled like arrays;
* TEI [3], I like it because I have the sensation it corretly handle the
semantics, but I guess there is not enough information in WS texts to
create such XML without important efforts;
* ALTO, for raw texts but we don’t have coordinates (or we can also create
ALTO without coordinates);
* others?
Sébastien
[1]
http://www.loc.gov/standards/alto/
[2]
https://wikisource.org/wiki/User:Seb35/Reverse_OCR
[3]
https://en.wikipedia.org/wiki/Text_Encoding_Initiative