Alexander Klauer wrote:
on how to upload scanned texts:
it would be great if the MediaWiki DjVu inline renderer and the ProofreadPage extension could be made to work together. Then one could upload texts as DjVu with all its benefits (plain text/image mixing, efficient storage, only one single file upload), but one would still be able to extract single pages into Wikisource's Page: namespace.
Ultimately, upload and download should be possible in DjVu, PDF, TIFF, and ZIP archive. All of those formats are capable of storing many pages in one file. As far as I know, DjVu and PDF are capable of mixing image and (OCR) text in one file, including the mapping of individual words to positions in the image. In a ZIP archive, you could store the scanned image in 0001.jpg (or .png or .tif) together with OCR text in 0001.txt, etc.
A download (e.g. in PDF format, for facsimile printing) should be possible for all pages in a volume or for all pages belonging to a chapter.
Currently, pages in fr.wikisource have names such as [[Page:Fermat - Livre 1-000008.jpg]] so "Fermat - Livre 1" could be the ZIP filename, and 000008.jpg would be the image contained within the ZIP archive. Instead of the dash, one might consider "/" for subpages here.
Next challenge: If the OCR text holds the position of each word in the image, can you mix this with Javascript (AJAX?) to highlight (in yellow) in the image the word you are currently wiki-editing? And how do you update that position when you move text around?
How does commercial PDF/DjVu proofreading software handle this?
There is still a lot of programming to be done for this.