Re: [Wikisource-l] Wikisource & DJVu at Commons

16 Nov 2006


      Alexander Klauer wrote:
...
on how to upload scanned texts:
it would be great if the MediaWiki DjVu inline renderer and the 
ProofreadPage extension could be made to work together. Then one 
could upload texts as DjVu with all its benefits (plain 
text/image mixing, efficient storage, only one single file 
upload), but one would still be able to extract single pages 
into Wikisource's Page: namespace.
Ultimately, upload and download should be possible in DjVu, PDF, 
TIFF, and ZIP archive.  All of those formats are capable of 
storing many pages in one file.  As far as I know, DjVu and PDF 
are capable of mixing image and (OCR) text in one file, including 
the mapping of individual words to positions in the image.  In a 
ZIP archive, you could store the scanned image in 0001.jpg (or 
.png or .tif) together with OCR text in 0001.txt, etc.
A download (e.g. in PDF format, for facsimile printing) should be 
possible for all pages in a volume or for all pages belonging to a 
chapter.
Currently, pages in fr.wikisource have names such as
[[Page:Fermat - Livre 1-000008.jpg]]
so "Fermat - Livre 1" could be the ZIP filename, and 000008.jpg
would be the image contained within the ZIP archive.  Instead of 
the dash, one might consider "/" for subpages here.
Next challenge: If the OCR text holds the position of each word in 
the image, can you mix this with Javascript (AJAX?) to highlight 
(in yellow) in the image the word you are currently wiki-editing?
And how do you update that position when you move text around?
How does commercial PDF/DjVu proofreading software handle this?
There is still a lot of programming to be done for this.
-- 
  Lars Aronsson (lars@aronsson.se)
  Aronsson Datateknik - http://aronsson.se

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Wikisource & DJVu at Commons