Birgitte SB wrote:
I think you are trying to
do far too much with the same piece of text.
Perfectly readable/editable wikimarkup and exactly
macthing OCR text are not possible with the same text.
I suggest you find a way to hack having the text
existing twice in the proofreading page. Something
like below:
<!-- Here is text with OCR breaks and hyphens which
matches the printed page-->
Here is the wikimarkup text that is trancluded to the
WS page
Of course this means both sets of text need to be
proofread, but I think a script should be able
highlight all the differences between them making it
simple to proofread one from the other. If you really
want to have only one version of the text, it will
have to have the exactness of OCR sacrificed. People
will always go through the markup "fixing" the
hyphens.
The whole proposal seems to come into the realm of biting off more than
we can chew. I can give ThomasV's approach to having all material
backed up by page scans full marks for what it sets out to do, but that
still doesn't change the fact that some editors still find it more
convenient to sub-optimally upload entire books from Project Gutenberg
with little more additional effort than breaking off chapters into
separate pages and adding headers. Unless we can get real people to do
tedious but relatively non-technical tasks such as proofreading, how can
we ever convince them to remain consistent with technical tasks whose
benefits are far fom obvious.
Eighteenth century scientific texts may have done well with only a
single printing, but more popular works that had multiple editions
present a challenge unless we can declare a particular printing to be
canonical. The best printing for this may not be easily or cheaply
available. As an example, I have an alomost complete set of the Ticknor
and Fields version of the works of Thomas De Quincey. In the course of
putting this together I ended up with apparently duplicate volumes. In
the case of the second volume of the "Theological Essays" I have both an
1854 and an 1864 printing. The 1854 edition goes to page 276 and the
1864 edition to page 315. The later edition adds an essay missing from
the earlier.
The first three lines of page 71 of the 1864 printing from "Toilette of
the Hebrew Lady" and ending a paragraph read
"the precious stones; and at other times, the pearls
were strung two and two, and their beautiful white-
ness relieved by the interposition of red coral."
In the 1854 printing the same text appears as lines 27-9 of page 69,
except that "whiteness" now appears fully on the middle line without
hyphenation. Footnotes that were at the end of an essay in 1854 are
moved to the proper page in 1864.
At one time, if a second printing was needed, it was easier and cheaper
to reset the type, with all the attedent errors that one might imagine.
Labour was cheap, and manufactured type very expensive.
Ec