On Fri, Jul 19, 2013 at 8:13 AM, Thibaut Horel thibaut.horel@gmail.comwrote:
I don't see the possibility of directly editing the ABBYY xml file happening any time soon. In theory, it should be possible, since that is somewhat similar to what Visual Editor is doing: providing a WYSIWYG interface to edit structured data (html+rdf in VE's case). But that's a (very) long-term plan, and its relevance is not even clear to me. In this regard, I agree with what David and Alex said.
Still, there are two things we could do with these xml files:
- extract information beyond the raw text to do some pre-formatting prior
to the page creation: this could include paragraphs, centered texts etc. Some good OCR/layout detection softwares are even able to detect font information, like bold or italic. However, and I could be wrong here, it seems to me that the impact of such pre-formatting would be limited: when proofreading, most of the time is spent correcting OCR mistakes, the formatting can be made on-the-go and has an almost negligible time cost.
I still think that doing most of the work automatically (if possible) would be a good idea. I actually like formatting (eg bold, italics) much more than I like proofreading OCR, but I also think that the less burden we give our proofreaders the better it is. I mean, if I'm proofreading a text, and I see the text is already well formatted, it saves time: if it's formatted badly, I can still correct it, right?
- import the proofread text back into the xml file. By doing so, we would
recover the position of words across the page for the proofread text. This would allow us to provide PDFs with a curated text layer. Such PDFs would be truly and fully searchable, which I think would be highly valuable for bibliophiles. This task somehow requires to align two texts: map each word in the proofread text to one word in the original ABBY file (this is not entirely accurate since two words are sometimes recognized as a single word by the OCR, and vice versa). I have a few ideas on how to properly solve this problem: it is actually very similar (and even simpler!) to the so-called "phrase alignment" problem found in machine translation and natural language processing and the probabilistic models it uses could easily be adapted to our problem. I know that some attempts have been made in the past to tackle this problem, but I don't have a clear view of what has been tried exactly, and how successful the attempts were. I would highly appreciate any information you could have about this.
I think Seb35 studied a bit the subject few years ago, with all the
probabilistic things and markovian chains and funny stuff you all like :-) (I always amazes me how many mathematicians or like are involved in Wikisource. My conclusion is that we like to put order in abstract spaces.
Aubrey
Thibaut
On 07/17/2013 10:13 PM, David Cuenca wrote:
I agree with Alex, the xml is not about getting editors to work with it, but to improve the output of the text. If it can be combined with the Visual Editor to add some pre-formatting and maybe signaling which words are unclear, that would be already a big improvement.
If in addition to that, it can be used to compare proofread text with ocr text for remapping purposes, even better.
Micru
On Wed, Jul 17, 2013 at 3:26 PM, Alex Brollo alex.brollo@gmail.comwrote:
Perhaps there's a misinterpretation - I mentioned abbyy.xml but with no project to import it as-it-is; abbyy.xml is only a surprising data container from which extract anything useful to speed up proofreading (and formatting) - nothing more than this.
Just an example: vertical djvu coordinates of lines can be used to get font-size; horizontal coordinates of lines can be used to recognize centered text; paragraphs splitting is valuable; coolumns can be recognized; margin too; with some effort probably poems can pop up.
Far from simply importing coordinates, it's a matter of use them at our best; no data, no data information contents.
Alex
2013/7/17 Lars Aronsson lars@aronsson.se
On 07/17/2013 12:57 PM, Alex Brollo wrote:
FineReader OCR stores an incredibly detailed information in [...] abbyy.xml
In the other end, Wikisource is a wiki that edits wiki text. Sure, you could insert the XML there and let users edit the XML, but that would scare more users away and allow for more mistakes.
For example, if proofreading Hamlet,
To be or not to bc, that is the question,
anybody can easily spot "bc" and correct that. In the XML version,
<word x=1 y=1>To</word> <word x=5 y=1>be</word> <word x=8 y=1>or</word>
someone might think that "or" should be a litte more to the right, so one user inserts a space between the tag "<word x=8 y=1>" and "or", while another user adjusts the tag to "<word x=9 y=1>". All the tags make it harder to spot the OCR error "bc".
Even if you replace manual XML editing with some graphic tool, you get the same ambiguity between adding whitespace and adjusting coordinates.
This is a nightmare that we avoid by throwing away all the coordinates and just proofreading the plain text. It is not the perfect system, it's a compromise, in order to get some useful work done.
-- Lars Aronsson (lars@aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
-- Etiamsi omnes, ego non
Wikisource-l mailing listWikisource-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l