Greetings.
Off and on for many months been working on a project to import a large collection of public domain historic scientific documents into Wikimedia's collection.
My standing plan has been to pre-organize and catalog the collection, then upload the document images as DJVU files (which are utterly tiny compared to tiffs or pdfs) to commons including a OCRed Text layer (for search and copy and paste).
I would then begin importing documents into Wikisource, starting with the OCR but eventually having a full marked up output. From there the documents could be extensively linked and referenced from the other Wikimedia projects.
Most of the delays in my work have been waiting for free software OCR technology to be able to handle documents from the 18th century. With the recent beta releases of Ocropus and Tesseract from Google I feel the results are finally good enough to move forward.
I do have some open questions though.
I'd really like it if the corrected text in wikisource could be imported back into the djvu document images. What I'd like to do is leave invisible markup generated by the ocr software in the page text, like this:
<span class='ocr_line' title='bbox 551 4202 2666 4278 1'>The first experiments were made on the absorption of carbonic</span> <span class='ocr_line' title='bbox 474 4281 2668 4355 1'>acid gas by water: and here a singular disagreement was observed</span> <span class='ocr_line' title='bbox 471 4360 2668 4433 1'>in the first trials made under exactly the same circumstances. It</span>
From this the ocred text could be corrected, and markup could be
added, but I could still take the output and apply it back to the original document. If people feel this would frustrate editing too much we could make some Javascript hacks to the edit box to reduce the span tags to nothing more than an immutable <S marker.
Would this be acceptable?