Jesse Martin (Pathoschild) wrote:
That's a good point. How about a much cleaner
syntax that can be used
to generate the OCR markup? With your example
text:
{{ocr line| The first experiments were made on
the
absorption of carbonic }}
{{ocr line| acid gas by water: and here a
singular
disagreement was observed }}
{{ocr line| in the first trials made under
exactly
the same circumstances. It }}
This is much easier to read, you know where the
line breaks go, and
it's immediately clear even to someone
stumbling
across the text that
we're specifically keeping track of lines (so
they
don't helpfully
remove unneeded line breaks). Since single line
breaks are ignored by
MediaWiki, we can just use the same line width so
the template syntax
lines up for easier ignoring.
I'm
still skeptical about what this will accomplish,
but will address
that later. The above does not address the
treatment of hyphens. When
MediaWiki wraps single line breaks it ignores the
hyphens that break up
a word at the end of the line, and treats the word
as though it were two.
Ec
I am agreeing with EC here. I think you are trying to
do far too much with the same piece of text.
Perfectly readable/editable wikimarkup and exactly
macthing OCR text are not possible with the same text.
I suggest you find a way to hack having the text
existing twice in the proofreading page. Something
like below:
<!-- Here is text with OCR breaks and hyphens which
matches the printed page-->
Here is the wikimarkup text that is trancluded to the
WS page
Of course this means both sets of text need to be
proofread, but I think a script should be able
highlight all the differences between them making it
simple to proofread one from the other. If you really
want to have only one version of the text, it will
have to have the exactness of OCR sacrificed. People
will always go through the markup "fixing" the
hyphens.
Birgitte SB
____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now.