Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

List overview All Threads
Download

newer

older

Editing Wikisource from Mobile

DPLAFest: Wikisource and Wikidata...

David Cuenca

16 Jul 2013 16 Jul '13

4:35 p.m.

Hi Aubrey, Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he worked on the djvu text extraction/merging and he was interested in following-up on that. Maybe he has some fresh ideas about it. Micru On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni <zanni.andrea84(a)gmail.com>wrote;wrote:

...

Hi David, Aarti, thibaud and Tpt, please look at this thread: http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext especially the last message. It seems George Orwell III knows his stuff about Djvu and Proofread extension, and it's probably worth digging into this "layer text" djvu thing. Even if I might dream of an ideal solution (a "layered structure" for wikisource, in which text can marked up several times in different layers) that is probably very far away. But it's still important to pave the way for further improvements, I guess: losing all the information from a formatted, mapped IA djvu it's not a good thing to do, IMHO. And the Visual Editor could help us, in the future, to keep some of that information (italics, bold, etc.) I know Aarti spoke with Alex about abbyy.xml: is it possible to do something with it? Aubrey

-- Etiamsi omnes, ego non

Attachments:

attachment.htm (text/html — 1.8 KB)

Show replies by date

Alex Brollo

17 Jul 17 Jul

12:57 p.m.

New subject: Proofread extension "extraction" of OCR text in Djvu

Just a brief comment about djvu text layer, using IA files to digging deeper the topic. FineReader OCR stores an incredibly detailed information in a proprietary format; then, various FineReader versions export something of this extremely rich set of information into different outputs - one of them being djvu text layer. It's worth to note that even if any information stored into djvu text layer can be extracted and used, the set of information wrapped into djvu text layer (both in lisp-like format or in xml format) is only a minor subset of original OCR information. If someone is interested to get much more information, it can find it into abbyy.xml output; and Internet Archive gives it as abbyy.gz into the list of exportable files. It's a very heavy and complex xml structure but it is possible to parse it, end to extract from it any information wrapped into djvu text layer and much more - most interestingly, wortPenalty, that is, word by word, the resume of degree of incertainty of OCR recognition of the whole word. We (I and Aarti) are digging into this mess, with fast preliminary results; you can see into [[it:w:Utente:Alex brollo/Sandbox]] some brief pieces of text extracted from abbyy.gx, where doubtful words (in the opinion of OCR software) are red. They can be easily managed by VisualEditor - caming simply from a simple span tag. Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage will run, it would be possible to extract text by bot from abbyy.gz (if the work comes from IA) and to upload such text as OCR. Alex 2013/7/16 David Cuenca <dacuetu(a)gmail.com>

...

-- Etiamsi omnes, ego non _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

David Cuenca

5:12 p.m.

New subject: Proofread extension "extraction" of OCR text in Djvu

I'm forwarding this message by George Orwell III on en-ws [1]. I think it is extremely important as it offers an insight about what is wrong with Djvu handling on Wikisource. "We/you are losing the X-min, Y-min, X-Max & Y-max (mapping coordinates) because the original PHP contributing a-hole for the DjVu routine on our servers never bothered to finish the part where the internal DjVu text layer is converted to a (coordinate rich) XML file using the existing DjVuLibre software package because, at the time, the software had issues. "That faulty DjVuLibre version was the equivalent of 4,317 versions ago and the issue has been long fixed now EXCEPT that the .DTD file needed to base the plain-text to XML conversion on still has the wrong 'folder path' on local DjVuLibre installs (if this is true on server installs as well, I cannot say for sure). Once I copied the folder to the [wrong] folder path, I was able to generate the XMLs all day long. These XMLs are just like the ones IA generates during their process (in addition to the XML that AABBY generates for them). "So its not that we as a community decided not to follow through with (coordinate rich) XML generation but got stuck with the plain-text dump workaround due to a DjVuLibre problem that no longer exists. Plus, the guy who created the beginnings of this fabulous disaster was like tick with an attention span deficit and moved on to conjuring up some other blasted thing or another instead of following up on his own workaround & finish the XML coding portion once DjVuLibre glitch was fixed. -- 15:16, 15 July 2013 (UTC) [1] http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext On Wed, Jul 17, 2013 at 6:57 AM, Alex Brollo <alex.brollo(a)gmail.com> wrote:

...

-- Etiamsi omnes, ego non _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

-- Etiamsi omnes, ego non

Lars Aronsson

6:31 p.m.

New subject: Proofread extension "extraction" of OCR text in Djvu

On 07/17/2013 12:57 PM, Alex Brollo wrote:

...

FineReader OCR stores an incredibly detailed information in [...] abbyy.xml

In the other end, Wikisource is a wiki that edits wiki text. Sure, you could insert the XML there and let users edit the XML, but that would scare more users away and allow for more mistakes. For example, if proofreading Hamlet, To be or not to bc, that is the question, anybody can easily spot "bc" and correct that. In the XML version, <word x=1 y=1>To</word> <word x=5 y=1>be</word> <word x=8 y=1>or</word> someone might think that "or" should be a litte more to the right, so one user inserts a space between the tag "<word x=8 y=1>" and "or", while another user adjusts the tag to "<word x=9 y=1>". All the tags make it harder to spot the OCR error "bc". Even if you replace manual XML editing with some graphic tool, you get the same ambiguity between adding whitespace and adjusting coordinates. This is a nightmare that we avoid by throwing away all the coordinates and just proofreading the plain text. It is not the perfect system, it's a compromise, in order to get some useful work done. -- Lars Aronsson (lars(a)aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/

Alex Brollo

9:26 p.m.

New subject: Proofread extension "extraction" of OCR text in Djvu

Perhaps there's a misinterpretation - I mentioned abbyy.xml but with no project to import it as-it-is; abbyy.xml is only a surprising data container from which extract anything useful to speed up proofreading (and formatting) - nothing more than this. Just an example: vertical djvu coordinates of lines can be used to get font-size; horizontal coordinates of lines can be used to recognize centered text; paragraphs splitting is valuable; coolumns can be recognized; margin too; with some effort probably poems can pop up. Far from simply importing coordinates, it's a matter of use them at our best; no data, no data information contents. Alex 2013/7/17 Lars Aronsson <lars(a)aronsson.se>

...

On 07/17/2013 12:57 PM, Alex Brollo wrote:

FineReader OCR stores an incredibly detailed information in [...] abbyy.xml

David Cuenca

10:13 p.m.

New subject: Proofread extension "extraction" of OCR text in Djvu

...

On 07/17/2013 12:57 PM, Alex Brollo wrote:

FineReader OCR stores an incredibly detailed information in [...] abbyy.xml

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

-- Etiamsi omnes, ego non

Thibaut Horel

19 Jul 19 Jul

8:13 a.m.

New subject: Proofread extension "extraction" of OCR text in Djvu

I don't see the possibility of directly editing the ABBYY xml file happening any time soon. In theory, it should be possible, since that is somewhat similar to what Visual Editor is doing: providing a WYSIWYG interface to edit structured data (html+rdf in VE's case). But that's a (very) long-term plan, and its relevance is not even clear to me. In this regard, I agree with what David and Alex said. Still, there are two things we could do with these xml files: * extract information beyond the raw text to do some pre-formatting prior to the page creation: this could include paragraphs, centered texts etc. Some good OCR/layout detection softwares are even able to detect font information, like bold or italic. However, and I could be wrong here, it seems to me that the impact of such pre-formatting would be limited: when proofreading, most of the time is spent correcting OCR mistakes, the formatting can be made on-the-go and has an almost negligible time cost. * import the proofread text back into the xml file. By doing so, we would recover the position of words across the page for the proofread text. This would allow us to provide PDFs with a curated text layer. Such PDFs would be truly and fully searchable, which I think would be highly valuable for bibliophiles. This task somehow requires to align two texts: map each word in the proofread text to one word in the original ABBY file (this is not entirely accurate since two words are sometimes recognized as a single word by the OCR, and vice versa). I have a few ideas on how to properly solve this problem: it is actually very similar (and even simpler!) to the so-called "phrase alignment" problem found in machine translation and natural language processing and the probabilistic models it uses could easily be adapted to our problem. I know that some attempts have been made in the past to tackle this problem, but I don't have a clear view of what has been tried exactly, and how successful the attempts were. I would highly appreciate any information you could have about this. Thibaut On 07/17/2013 10:13 PM, David Cuenca wrote:

...

I agree with Alex, the xml is not about getting editors to work with it, but to improve the output of the text. If it can be combined with the Visual Editor to add some pre-formatting and maybe signaling which words are unclear, that would be already a big improvement. If in addition to that, it can be used to compare proofread text with ocr text for remapping purposes, even better. Micru On Wed, Jul 17, 2013 at 3:26 PM, Alex Brollo <alex.brollo(a)gmail.com <mailto:alex.brollo@gmail.com>> wrote: Perhaps there's a misinterpretation - I mentioned abbyy.xml but with no project to import it as-it-is; abbyy.xml is only a surprising data container from which extract anything useful to speed up proofreading (and formatting) - nothing more than this. Just an example: vertical djvu coordinates of lines can be used to get font-size; horizontal coordinates of lines can be used to recognize centered text; paragraphs splitting is valuable; coolumns can be recognized; margin too; with some effort probably poems can pop up. Far from simply importing coordinates, it's a matter of use them at our best; no data, no data information contents. Alex 2013/7/17 Lars Aronsson <lars(a)aronsson.se <mailto:lars@aronsson.se>> On 07/17/2013 12:57 PM, Alex Brollo wrote: FineReader OCR stores an incredibly detailed information in [...] abbyy.xml In the other end, Wikisource is a wiki that edits wiki text. Sure, you could insert the XML there and let users edit the XML, but that would scare more users away and allow for more mistakes. For example, if proofreading Hamlet, To be or not to bc, that is the question, anybody can easily spot "bc" and correct that. In the XML version, <word x=1 y=1>To</word> <word x=5 y=1>be</word> <word x=8 y=1>or</word> someone might think that "or" should be a litte more to the right, so one user inserts a space between the tag "<word x=8 y=1>" and "or", while another user adjusts the tag to "<word x=9 y=1>". All the tags make it harder to spot the OCR error "bc". Even if you replace manual XML editing with some graphic tool, you get the same ambiguity between adding whitespace and adjusting coordinates. This is a nightmare that we avoid by throwing away all the coordinates and just proofreading the plain text. It is not the perfect system, it's a compromise, in order to get some useful work done. -- Lars Aronsson (lars(a)aronsson.se <mailto:lars@aronsson.se>) Project Runeberg - free Nordic literature - http://runeberg.org/ _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org <mailto:Wikisource-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikisource-l _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org <mailto:Wikisource-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikisource-l -- Etiamsi omnes, ego non _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Andrea Zanni

9:19 a.m.

New subject: Proofread extension "extraction" of OCR text in Djvu

On Fri, Jul 19, 2013 at 8:13 AM, Thibaut Horel <thibaut.horel(a)gmail.com>wrote;wrote:

...

I still think that doing most of the work automatically (if possible) would be a good idea. I actually like formatting (eg bold, italics) much more than I like proofreading OCR, but I also think that the less burden we give our proofreaders the better it is. I mean, if I'm proofreading a text, and I see the text is already well formatted, it saves time: if it's formatted badly, I can still correct it, right?

...

* import the proofread text back into the xml file. By doing so, we would recover the position of words across the page for the proofread text. This would allow us to provide PDFs with a curated text layer. Such PDFs would be truly and fully searchable, which I think would be highly valuable for bibliophiles. This task somehow requires to align two texts: map each word in the proofread text to one word in the original ABBY file (this is not entirely accurate since two words are sometimes recognized as a single word by the OCR, and vice versa). I have a few ideas on how to properly solve this problem: it is actually very similar (and even simpler!) to the so-called "phrase alignment" problem found in machine translation and natural language processing and the probabilistic models it uses could easily be adapted to our problem. I know that some attempts have been made in the past to tackle this problem, but I don't have a clear view of what has been tried exactly, and how successful the attempts were. I would highly appreciate any information you could have about this. I think Seb35 studied a bit the subject few years ago, with all the

probabilistic things and markovian chains and funny stuff you all like :-) (I always amazes me how many mathematicians or like are involved in Wikisource. My conclusion is that we like to put order in abstract spaces. Aubrey

...

Thibaut On 07/17/2013 10:13 PM, David Cuenca wrote: I agree with Alex, the xml is not about getting editors to work with it, but to improve the output of the text. If it can be combined with the Visual Editor to add some pre-formatting and maybe signaling which words are unclear, that would be already a big improvement. If in addition to that, it can be used to compare proofread text with ocr text for remapping purposes, even better. Micru On Wed, Jul 17, 2013 at 3:26 PM, Alex Brollo <alex.brollo(a)gmail.com>wrote;wrote:

On 07/17/2013 12:57 PM, Alex Brollo wrote:

FineReader OCR stores an incredibly detailed information in [...] abbyy.xml

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

-- Etiamsi omnes, ego non _______________________________________________ Wikisource-l mailing listWikisource-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikisource-l _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

La Vallen

25 Jul 25 Jul

10:56 a.m.

New subject: Edwardsbot

Vad var det för ett ljushuve som kom på den briljanta iden att skicka ut ett meddelande till varenda Användardisk på hela Wikisource? /Ronnie

Andrea Zanni

11:31 a.m.

New subject: Edwardsbot

Hi Ronnie, the message for joining the Wikisource User Group, among other news, was prepared by me (Aubrey) and Micru and translated by other fellow users in different languages. We did not send the message to all users of all wikisources, but we used Wikisource user statistics to select active users. You can find the list here: http://meta.wikimedia.org/wiki/Global_message_delivery/Targets/Wikisource_c… The message sent is here: http://meta.wikimedia.org/wiki/Wikisource_User_Group/Invitation Unfortunately (from what I can understand), EdwardsBot sign himslef as a bot, so this probably led to some confusion. sorry for that. Aubrey 2013/7/25 La Vallen <la.vallen(a)yahoo.se>

...

Vad var det för ett ljushuve som kom på den briljanta iden att skicka ut ett meddelande till varenda Användardisk på hela Wikisource? /Ronnie _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Nahum Wengrov

1:20 p.m.

New subject: Edwardsbot

It's also confusing that the message is translated, leading recipients to assume the list is interlingual.It would be better to send the main message in English, adding a short message at its beginning in the recipients' languages that this message is for English speakers, and apology if recipient doesn't read it. Just my 2 cents. On Thu, Jul 25, 2013 at 12:31 PM, Andrea Zanni <zanni.andrea84(a)gmail.com>wrote;wrote:

...

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Andrea Zanni

2:05 p.m.

New subject: Edwardsbot

Well, the list can be interlingual if we want it :-) Ronnie just wrote in Swedish, but Google Translate helps a lot :-) Aubrey On Thu, Jul 25, 2013 at 1:20 PM, Nahum Wengrov <novartza(a)gmail.com> wrote:

...

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

3928

days inactive

3937

days old

wikisource-l@lists.wikimedia.org

Manage subscription

11 comments

7 participants

tags (0)

participants (7)

Alex Brollo
Andrea Zanni
David Cuenca
La Vallen
Lars Aronsson
Nahum Wengrov
Thibaut Horel