On Jan 23, 2008 2:26 AM, Gregory Maxwell gmaxwell@gmail.com wrote:
On Jan 22, 2008 9:26 AM, Birgitte SB birgitte_sb@yahoo.com wrote:
I am agreeing with EC here. I think you are trying to do far too much with the same piece of text. Perfectly readable/editable wikimarkup and exactly macthing OCR text are not possible with the same text. I suggest you find a way to hack having the text existing twice in the proofreading page. Something like below:
Okay. Thats a bit outside of the realm of the work I'm interested in doing. I'll just focus on the document images and leave the rest to whomever else is interested.
The hyphen is a difficult problem, but it doesnt need to be a deal breaker.
Bidirectional djvu/pdf <-> wiki should be our goal, if we are to migrate to having all content backed by images (which is a strict policy on the German Wikisource project), but there are a few hurdles.
The two big ones are:
1. hyphens
This may be easily solved by replacing hyphens that appear at the end of a line with a soft-hyphen in the initial OCR output; any places where a hard-hyphens or non-breaking hyphen is required, they will be fixed in proof reading. Most browsers handle this correctly by simply discarding the soft-hyphen, and after years of waiting Firefox 3 should render this correctly < https://bugzilla.mozilla.org/show_bug.cgi?id=9101 >.
Unless there are some surprises in mediawiki's handling of soft-hyphens, the wikitext would look like (in this example, ­ could be the Unicode equivalent)
<!-- any OCR sync information -->line one with a extra­<!-- any OCR sync information -->ordinary size words that flow onto line two <!-- any OCR sync information --> and here is line three.
i.e. original line one and two would need to be on a single line in order to prevent a newline to be emitted in the HTML, which would invalidate the soft-hyphen.
If we wanted to get fancy, the dev's could enhance mediawiki so that a line ending with a Unicode soft-hyphen is not followed by a new line character in the HTML. I cant see any draw back in doing that, except for the overhead; if this is significant, it could be done in a Wikisource only extension.
2. wiki markup that isnt in the original
This could be simply ignored by mandating that we don't add additional markup until after the text has been proof-read, and the changes have been fed back into the DJVU file. Any improvement on that position depends on improvements in the wikitext -> DJVU process.
We keep a very close eye on Recentchanges and have revision patrolling for changes by non-admins, so any changes that may effect the ability to slurp improvements back into the DJVU can be managed.
-- John