[Wikisource-l] On linking Wikisource with page images

Mon Jan 21 16:39:18 UTC 2008

Greetings.

Off and on for many months been working on a project to import a large
collection of public domain historic scientific documents into
Wikimedia's collection.

My standing plan has been to pre-organize and catalog the collection,
then upload the document images as DJVU files (which are utterly tiny
compared to tiffs or pdfs) to commons including a OCRed Text layer
(for search and copy and paste).

I would then begin importing documents into Wikisource, starting with
the OCR but eventually having a full marked up output.  From there the
documents could be extensively linked and referenced from the other
Wikimedia projects.

Most of the delays in my work have been waiting for free software OCR
technology to be able to handle documents from the 18th century. With
the recent beta releases of Ocropus and Tesseract from Google I feel
the results are finally good enough to move forward.

I do have some open questions though.

I'd really like it if the corrected text in wikisource could be
imported back into the djvu document images.  What I'd like to do is
leave invisible markup generated by the ocr software in the page text,
like this:

The first
experiments were made on the absorption of carbonic acid gas by water:
and here a singular disagreement was observed in the first trials
made under exactly the same circumstances. It