On Jan 23, 2008 2:26 AM, Gregory Maxwell <gmaxwell(a)gmail.com> wrote:
On Jan 22, 2008 9:26 AM, Birgitte SB
<birgitte_sb(a)yahoo.com> wrote:
I am agreeing with EC here. I think you are
trying to
do far too much with the same piece of text.
Perfectly readable/editable wikimarkup and exactly
macthing OCR text are not possible with the same text.
I suggest you find a way to hack having the text
existing twice in the proofreading page. Something
like below:
Okay. Thats a bit outside of the realm of the work I'm interested in
doing. I'll just focus on the document images and leave the rest to
whomever else is interested.
The hyphen is a difficult problem, but it doesnt need to be a deal breaker.
Bidirectional djvu/pdf <-> wiki should be our goal, if we are to
migrate to having all content backed by images (which is a strict
policy on the German Wikisource project), but there are a few hurdles.
The two big ones are:
1. hyphens
This may be easily solved by replacing hyphens that appear at the end
of a line with a soft-hyphen in the initial OCR output; any places
where a hard-hyphens or non-breaking hyphen is required, they will be
fixed in proof reading. Most browsers handle this correctly by simply
discarding the soft-hyphen, and after years of waiting Firefox 3
should render this correctly <
https://bugzilla.mozilla.org/show_bug.cgi?id=9101 >.
Unless there are some surprises in mediawiki's handling of
soft-hyphens, the wikitext would look like (in this example, ­
could be the Unicode equivalent)
<!-- any OCR sync information -->line one with a extra­<!-- any
OCR sync information -->ordinary size words that flow onto line two
<!-- any OCR sync information --> and here is line three.
i.e. original line one and two would need to be on a single line in
order to prevent a newline to be emitted in the HTML, which would
invalidate the soft-hyphen.
If we wanted to get fancy, the dev's could enhance mediawiki so that a
line ending with a Unicode soft-hyphen is not followed by a new line
character in the HTML. I cant see any draw back in doing that, except
for the overhead; if this is significant, it could be done in a
Wikisource only extension.
2. wiki markup that isnt in the original
This could be simply ignored by mandating that we don't add additional
markup until after the text has been proof-read, and the changes have
been fed back into the DJVU file. Any improvement on that position
depends on improvements in the wikitext -> DJVU process.
We keep a very close eye on Recentchanges and have revision patrolling
for changes by non-admins, so any changes that may effect the ability
to slurp improvements back into the DJVU can be managed.
--
John