> Message: 5
>
> Date: Thu, 3 May 2012 08:33:45 +0200
>
> From: Alex Brollo <alex.brollo(a)gmail.com>
>
> To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
>
> Subject: [Wikitech-l] Full support for djvu files
>
> Message-ID:
>
> <CAH_M_mPXxD9LeMjHCm65CRAvoqN5W45O5dGO+TeH1C0f_hc4rg(a)mail.gmail.com>
>
> Content-Type: text/plain; charset=ISO-8859-1
>
>
> Djvu files are the wikisource standard supporting proofreading. They have
>
> very interesting features, being fully "open" in structure and layering,
>
> and allowing a fast and effective sharing into the web, when they are
>
> stored in their "indirect" mode. Most interesting, their text layer - which
>
> can be easily extracted - contains both the mapped text from OCR and
>
> metadata. A free library - divuLibre - allows full command line access to
>
> any file content.
>
>
> Presently, djvu files structure and features are minimally used. Indirect
>
> mode is IMHO not supported at all, there's no mean to access to mapped text
>
> layer nor to metadata, and only the "full text" can be accessed once, when
>
> creating a new page into Page namespace.
>
>
> It would be great IMHO:
>
> * to support indirect mode as the standard;
>
> * to allow free, easy access to the full text layer content from wikisource
>
> user interface.
>
>
> Alex
>
Text layer is stored in img_metadata, which means it can be retrieved
by the API (using ?action=query&prop=imageinfo&iiprop=metadata).
However when I tried to test this, it didn't seem to work. Maybe
trying to return the entire text layer hit some max api result size
limit or something. (It'd be really nice if we had some nicer place to
store information about files, especially for huge things like the
text layer which we don't generally want to load the entire thing all
the time. There's a bug about that somewhere in bugzilla land).
Indirect mode (From what I can find out from google) is when you have
an index djvu file that has links to all the pages making up the djvu
file, so you can start viewing immediately and pages are only
downloaded as needed. I'm not sure how such a format would work in
terms of uploading it. Unless we convert it on the server side, how
would we upload all the constitutiant files (I suppose we could tell
people to upload tarballs. Then we have to make sure to validate the
contents, and communicate to people that the tarball is only for
uploaded djvu files). [Of course until 5 minutes ago I'd never heard
of an indirect djvu file, so I could be misunderstanding]
-bawolff