On 07/17/2013 12:57 PM, Alex Brollo wrote:
FineReader OCR stores an incredibly detailed information in [...] abbyy.xml
In the other end, Wikisource is a wiki that edits wiki text. Sure, you could insert the XML there and let users edit the XML, but that would scare more users away and allow for more mistakes.
For example, if proofreading Hamlet,
To be or not to bc, that is the question,
anybody can easily spot "bc" and correct that. In the XML version,
<word x=1 y=1>To</word> <word x=5 y=1>be</word> <word x=8 y=1>or</word>
someone might think that "or" should be a litte more to the right, so one user inserts a space between the tag "<word x=8 y=1>" and "or", while another user adjusts the tag to "<word x=9 y=1>". All the tags make it harder to spot the OCR error "bc".
Even if you replace manual XML editing with some graphic tool, you get the same ambiguity between adding whitespace and adjusting coordinates.
This is a nightmare that we avoid by throwing away all the coordinates and just proofreading the plain text. It is not the perfect system, it's a compromise, in order to get some useful work done.