Message: 5
Date: Thu, 3 May 2012 08:33:45 +0200
From: Alex Brollo alex.brollo@gmail.com
To: Wikimedia developers wikitech-l@lists.wikimedia.org
Subject: [Wikitech-l] Full support for djvu files
Message-ID:
CAH_M_mPXxD9LeMjHCm65CRAvoqN5W45O5dGO+TeH1C0f_hc4rg@mail.gmail.com
Content-Type: text/plain; charset=ISO-8859-1
Djvu files are the wikisource standard supporting proofreading. They have
very interesting features, being fully "open" in structure and layering,
and allowing a fast and effective sharing into the web, when they are
stored in their "indirect" mode. Most interesting, their text layer - which
can be easily extracted - contains both the mapped text from OCR and
metadata. A free library - divuLibre - allows full command line access to
any file content.
Presently, djvu files structure and features are minimally used. Indirect
mode is IMHO not supported at all, there's no mean to access to mapped text
layer nor to metadata, and only the "full text" can be accessed once, when
creating a new page into Page namespace.
It would be great IMHO:
to support indirect mode as the standard;
to allow free, easy access to the full text layer content from wikisource
user interface.
Alex
Text layer is stored in img_metadata, which means it can be retrieved by the API (using ?action=query&prop=imageinfo&iiprop=metadata). However when I tried to test this, it didn't seem to work. Maybe trying to return the entire text layer hit some max api result size limit or something. (It'd be really nice if we had some nicer place to store information about files, especially for huge things like the text layer which we don't generally want to load the entire thing all the time. There's a bug about that somewhere in bugzilla land).
Indirect mode (From what I can find out from google) is when you have an index djvu file that has links to all the pages making up the djvu file, so you can start viewing immediately and pages are only downloaded as needed. I'm not sure how such a format would work in terms of uploading it. Unless we convert it on the server side, how would we upload all the constitutiant files (I suppose we could tell people to upload tarballs. Then we have to make sure to validate the contents, and communicate to people that the tarball is only for uploaded djvu files). [Of course until 5 minutes ago I'd never heard of an indirect djvu file, so I could be misunderstanding]
-bawolff