Hi Gregory and everyone,
On Jan 22, 2008 11:22 AM, Gregory Maxwell gmaxwell@gmail.com wrote:
On Jan 21, 2008 7:16 PM, Jesse Martin (Pathoschild) pathoschild@gmail.com wrote:
That's a good point. How about a much cleaner syntax that can be used to generate the OCR markup? With your example text: {{ocr line| The first experiments were made on the absorption of carbonic }} {{ocr line| acid gas by water: and here a singular disagreement was observed }} {{ocr line| in the first trials made under exactly the same circumstances. It }}
This is much easier to read, you know where the line breaks go, and it's immediately clear even to someone stumbling across the text that we're specifically keeping track of lines (so they don't helpfully remove unneeded line breaks). Since single line breaks are ignored by MediaWiki, we can just use the same line width so the template syntax lines up for easier ignoring.
Oh that gets it most of the way there.. but could I still smuggle in the coords? ;) like:
{{ocr line|551-4202-2666-4278-1|The first experiments were made on the absorption of carbonic}}
I suppose I could also make the coords base 60 or so.. so they would be shorter.
I dont understand why the HTML output needs to have the DJVU markers; it could be in the raw text. Would it be acceptable to have one line per printed line, and hidden comments as required. i.e.
--- The first experiments were made on the absorption of carbonic <!-- DJVU position: 551-4202-2666-4278-1 --> acid gas by water: and here a singular disagreement was observed <!-- DJVU position: ... --> in the first trials made under exactly the same circumstances. It <!-- DJVU position: ... --> ---
How will words that are broken across two lines be handled ?
I understand that these DJVU files will probably have a lot of corrections initially. Are you planning on updating the DJVU file on commons incrementally, or after the entire DJVU has been proof-read ?
-- John