Hi Gregory and everyone,
On Jan 22, 2008 11:22 AM, Gregory Maxwell <gmaxwell(a)gmail.com> wrote:
On Jan 21, 2008 7:16 PM, Jesse Martin (Pathoschild)
<pathoschild(a)gmail.com> wrote:
That's a good point. How about a much cleaner
syntax that can be used
to generate the OCR markup? With your example text:
{{ocr line| The first experiments were made on the absorption of carbonic }}
{{ocr line| acid gas by water: and here a singular disagreement was observed }}
{{ocr line| in the first trials made under exactly the same circumstances. It }}
This is much easier to read, you know where the line breaks go, and
it's immediately clear even to someone stumbling across the text that
we're specifically keeping track of lines (so they don't helpfully
remove unneeded line breaks). Since single line breaks are ignored by
MediaWiki, we can just use the same line width so the template syntax
lines up for easier ignoring.
Oh that gets it most of the way there.. but could I still smuggle in
the coords? ;) like:
{{ocr line|551-4202-2666-4278-1|The first experiments were made on the
absorption of carbonic}}
I suppose I could also make the coords base 60 or so.. so they would be shorter.
I dont understand why the HTML output needs to have the DJVU markers;
it could be in the raw text. Would it be acceptable to have one line
per printed line, and hidden comments as required. i.e.
---
The first experiments were made on the absorption of carbonic <!--
DJVU position: 551-4202-2666-4278-1 -->
acid gas by water: and here a singular disagreement was observed <!--
DJVU position: ... -->
in the first trials made under exactly the same circumstances. It <!--
DJVU position: ... -->
---
How will words that are broken across two lines be handled ?
I understand that these DJVU files will probably have a lot of
corrections initially. Are you planning on updating the DJVU file on
commons incrementally, or after the entire DJVU has been proof-read ?
--
John