Greetings.
Off and on for many months been working on a project to import a large collection of public domain historic scientific documents into Wikimedia's collection.
My standing plan has been to pre-organize and catalog the collection, then upload the document images as DJVU files (which are utterly tiny compared to tiffs or pdfs) to commons including a OCRed Text layer (for search and copy and paste).
I would then begin importing documents into Wikisource, starting with the OCR but eventually having a full marked up output. From there the documents could be extensively linked and referenced from the other Wikimedia projects.
Most of the delays in my work have been waiting for free software OCR technology to be able to handle documents from the 18th century. With the recent beta releases of Ocropus and Tesseract from Google I feel the results are finally good enough to move forward.
I do have some open questions though.
I'd really like it if the corrected text in wikisource could be imported back into the djvu document images. What I'd like to do is leave invisible markup generated by the ocr software in the page text, like this:
<span class='ocr_line' title='bbox 551 4202 2666 4278 1'>The first experiments were made on the absorption of carbonic</span> <span class='ocr_line' title='bbox 474 4281 2668 4355 1'>acid gas by water: and here a singular disagreement was observed</span> <span class='ocr_line' title='bbox 471 4360 2668 4433 1'>in the first trials made under exactly the same circumstances. It</span>
From this the ocred text could be corrected, and markup could be
added, but I could still take the output and apply it back to the original document. If people feel this would frustrate editing too much we could make some Javascript hacks to the edit box to reduce the span tags to nothing more than an immutable <S marker.
Would this be acceptable?
Nice idea. Note that we now have an ocr server running Tesseract. It is linked to Proofreadpage (and it works erratically)
Questions : are the bbox coordinates generated by the ocr engine ? in that case, what happense if the ocr outputs an incorrect number of lines ?
also, I think you do need a javascript hack for the edit box; what happens if the user creates a new line ?
Thomas
Gregory Maxwell wrote:
What I'd like to do is leave invisible markup generated by the ocr software in the page text, like this:
<span class='ocr_line' title='bbox 551 4202 2666 4278 1'>The first experiments were made on the absorption of carbonic</span> <span class='ocr_line' title='bbox 474 4281 2668 4355 1'>acid gas by water: and here a singular disagreement was observed</span> <span class='ocr_line' title='bbox 471 4360 2668 4433 1'>in the first trials made under exactly the same circumstances. It</span>
From this the ocred text could be corrected, and markup could be
added, but I could still take the output and apply it back to the original document. If people feel this would frustrate editing too much we could make some Javascript hacks to the edit box to reduce the span tags to nothing more than an immutable <S marker.
Would this be acceptable?
On Jan 21, 2008 11:59 AM, ThomasV thomasV1@gmx.de wrote:
Nice idea. Note that we now have an ocr server running Tesseract. It is linked to Proofreadpage (and it works erratically)
I've found tesseract alone to be fairly erratic for real documents. Ocropus makes it behave much better.
Questions : are the bbox coordinates generated by the ocr engine ?
Yep.
in that case, what happense if the ocr outputs an incorrect number of lines ?
You could manually correct the corrds, or simply add your text to the nearest line.. which would be incorrect but better than no markup at all.
also, I think you do need a javascript hack for the edit box; what happens if the user creates a new line ?
The user can do whatever he wants... if the results don't match reality the djvus will act a bit weird. I could easily enough make a bot that will scan documents for document body text outside of line-spans and tag the pages for OCR markup improvements.
With the current ocropus code on these documents I'm unable to find any totally missed lines. While I'm sure they will happen, I wouldn't want to do the imports unless they were rare enough that the inconvenience of dealing with them is a deal breaker.
On Jan 21, 2008 2:39 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
Greetings.
Off and on for many months been working on a project to import a large collection of public domain historic scientific documents into Wikimedia's collection.
My standing plan has been to pre-organize and catalog the collection, then upload the document images as DJVU files (which are utterly tiny compared to tiffs or pdfs) to commons including a OCRed Text layer (for search and copy and paste).
I would then begin importing documents into Wikisource, starting with the OCR but eventually having a full marked up output. From there the documents could be extensively linked and referenced from the other Wikimedia projects.
(...)
wow, nice!
Just for curiosity: these works are on what languages?
On Jan 21, 2008 6:06 PM, Luiz Augusto lugusto@gmail.com wrote:
wow, nice!
Just for curiosity: these works are on what languages?
English. For now. There are opportunities for other languages down the road. (One point on ocropus, its language modeling is not yet as multilingual as tesseract)
Hello,
I'm just wondering, would it be feasible to convert wiki text (without OCR markup) back into OCR markup? A script might strip or convert markup, diff the original OCR text with the wiki text to determine what goes where, and generate the markup from scratch.
You could thus cleanly convert from OCR markup to wiki markup and back without unreadable OCR markup on the wiki, and this could also be used to provide some other very useful features (I would love to accurately diff an entire Wikisource text with OCR scans of different printed documents, for example).
On Jan 21, 2008 6:27 PM, Jesse Martin (Pathoschild) pathoschild@gmail.com wrote:
Hello, I'm just wondering, would it be feasible to convert wiki text (without OCR markup) back into OCR markup? A script might strip or convert markup, diff the original OCR text with the wiki text to determine what goes where, and generate the markup from scratch.
You could thus cleanly convert from OCR markup to wiki markup and back without unreadable OCR markup on the wiki, and this could also be used to provide some other very useful features (I would love to accurately diff an entire Wikisource text with OCR scans of different printed documents, for example).
Getting the edge cases right would be hard to impossible... For example the OCR reads "the ball bounced and" "saw it hit the floor" .. you fix a missing word: "the ball bounced and I saw it hit the floor" ... What line is the I on in the OCR output?
The nice things about spans are they are invisible in the output.. you should be able to preserve them while doing all the markup you want. The bad thing is that they are visible while editing (but could be hidden), and are a pain to fix if the OCR was very wrong.
That's a good point. How about a much cleaner syntax that can be used to generate the OCR markup? With your example text: {{ocr line| The first experiments were made on the absorption of carbonic }} {{ocr line| acid gas by water: and here a singular disagreement was observed }} {{ocr line| in the first trials made under exactly the same circumstances. It }}
This is much easier to read, you know where the line breaks go, and it's immediately clear even to someone stumbling across the text that we're specifically keeping track of lines (so they don't helpfully remove unneeded line breaks). Since single line breaks are ignored by MediaWiki, we can just use the same line width so the template syntax lines up for easier ignoring.
On Jan 21, 2008 7:16 PM, Jesse Martin (Pathoschild) pathoschild@gmail.com wrote:
That's a good point. How about a much cleaner syntax that can be used to generate the OCR markup? With your example text: {{ocr line| The first experiments were made on the absorption of carbonic }} {{ocr line| acid gas by water: and here a singular disagreement was observed }} {{ocr line| in the first trials made under exactly the same circumstances. It }}
This is much easier to read, you know where the line breaks go, and it's immediately clear even to someone stumbling across the text that we're specifically keeping track of lines (so they don't helpfully remove unneeded line breaks). Since single line breaks are ignored by MediaWiki, we can just use the same line width so the template syntax lines up for easier ignoring.
Oh that gets it most of the way there.. but could I still smuggle in the coords? ;) like:
{{ocr line|551-4202-2666-4278-1|The first experiments were made on the absorption of carbonic}}
I suppose I could also make the coords base 60 or so.. so they would be shorter.
Hi Gregory and everyone,
On Jan 22, 2008 11:22 AM, Gregory Maxwell gmaxwell@gmail.com wrote:
On Jan 21, 2008 7:16 PM, Jesse Martin (Pathoschild) pathoschild@gmail.com wrote:
That's a good point. How about a much cleaner syntax that can be used to generate the OCR markup? With your example text: {{ocr line| The first experiments were made on the absorption of carbonic }} {{ocr line| acid gas by water: and here a singular disagreement was observed }} {{ocr line| in the first trials made under exactly the same circumstances. It }}
This is much easier to read, you know where the line breaks go, and it's immediately clear even to someone stumbling across the text that we're specifically keeping track of lines (so they don't helpfully remove unneeded line breaks). Since single line breaks are ignored by MediaWiki, we can just use the same line width so the template syntax lines up for easier ignoring.
Oh that gets it most of the way there.. but could I still smuggle in the coords? ;) like:
{{ocr line|551-4202-2666-4278-1|The first experiments were made on the absorption of carbonic}}
I suppose I could also make the coords base 60 or so.. so they would be shorter.
I dont understand why the HTML output needs to have the DJVU markers; it could be in the raw text. Would it be acceptable to have one line per printed line, and hidden comments as required. i.e.
--- The first experiments were made on the absorption of carbonic <!-- DJVU position: 551-4202-2666-4278-1 --> acid gas by water: and here a singular disagreement was observed <!-- DJVU position: ... --> in the first trials made under exactly the same circumstances. It <!-- DJVU position: ... --> ---
How will words that are broken across two lines be handled ?
I understand that these DJVU files will probably have a lot of corrections initially. Are you planning on updating the DJVU file on commons incrementally, or after the entire DJVU has been proof-read ?
-- John
Jesse Martin (Pathoschild) wrote:
That's a good point. How about a much cleaner syntax that can be used to generate the OCR markup? With your example text: {{ocr line| The first experiments were made on the absorption of carbonic }} {{ocr line| acid gas by water: and here a singular disagreement was observed }} {{ocr line| in the first trials made under exactly the same circumstances. It }}
This is much easier to read, you know where the line breaks go, and it's immediately clear even to someone stumbling across the text that we're specifically keeping track of lines (so they don't helpfully remove unneeded line breaks). Since single line breaks are ignored by MediaWiki, we can just use the same line width so the template syntax lines up for easier ignoring.
I'm still skeptical about what this will accomplish, but will address that later. The above does not address the treatment of hyphens. When MediaWiki wraps single line breaks it ignores the hyphens that break up a word at the end of the line, and treats the word as though it were two.
Ec
--- Ray Saintonge saintonge@telus.net wrote:
Jesse Martin (Pathoschild) wrote:
That's a good point. How about a much cleaner
syntax that can be used
to generate the OCR markup? With your example
text:
{{ocr line| The first experiments were made on the
absorption of carbonic }}
{{ocr line| acid gas by water: and here a singular
disagreement was observed }}
{{ocr line| in the first trials made under exactly
the same circumstances. It }}
This is much easier to read, you know where the
line breaks go, and
it's immediately clear even to someone stumbling
across the text that
we're specifically keeping track of lines (so they
don't helpfully
remove unneeded line breaks). Since single line
breaks are ignored by
MediaWiki, we can just use the same line width so
the template syntax
lines up for easier ignoring.
I'm still skeptical about what this will accomplish, but will address that later. The above does not address the treatment of hyphens. When MediaWiki wraps single line breaks it ignores the hyphens that break up a word at the end of the line, and treats the word as though it were two.
Ec
I am agreeing with EC here. I think you are trying to do far too much with the same piece of text. Perfectly readable/editable wikimarkup and exactly macthing OCR text are not possible with the same text. I suggest you find a way to hack having the text existing twice in the proofreading page. Something like below:
<!-- Here is text with OCR breaks and hyphens which matches the printed page-->
Here is the wikimarkup text that is trancluded to the WS page
Of course this means both sets of text need to be proofread, but I think a script should be able highlight all the differences between them making it simple to proofread one from the other. If you really want to have only one version of the text, it will have to have the exactness of OCR sacrificed. People will always go through the markup "fixing" the hyphens.
Birgitte SB
____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
On Jan 22, 2008 9:26 AM, Birgitte SB birgitte_sb@yahoo.com wrote:
I am agreeing with EC here. I think you are trying to do far too much with the same piece of text. Perfectly readable/editable wikimarkup and exactly macthing OCR text are not possible with the same text. I suggest you find a way to hack having the text existing twice in the proofreading page. Something like below:
Okay. Thats a bit outside of the realm of the work I'm interested in doing. I'll just focus on the document images and leave the rest to whomever else is interested.
Cheers
On Jan 23, 2008 2:26 AM, Gregory Maxwell gmaxwell@gmail.com wrote:
On Jan 22, 2008 9:26 AM, Birgitte SB birgitte_sb@yahoo.com wrote:
I am agreeing with EC here. I think you are trying to do far too much with the same piece of text. Perfectly readable/editable wikimarkup and exactly macthing OCR text are not possible with the same text. I suggest you find a way to hack having the text existing twice in the proofreading page. Something like below:
Okay. Thats a bit outside of the realm of the work I'm interested in doing. I'll just focus on the document images and leave the rest to whomever else is interested.
The hyphen is a difficult problem, but it doesnt need to be a deal breaker.
Bidirectional djvu/pdf <-> wiki should be our goal, if we are to migrate to having all content backed by images (which is a strict policy on the German Wikisource project), but there are a few hurdles.
The two big ones are:
1. hyphens
This may be easily solved by replacing hyphens that appear at the end of a line with a soft-hyphen in the initial OCR output; any places where a hard-hyphens or non-breaking hyphen is required, they will be fixed in proof reading. Most browsers handle this correctly by simply discarding the soft-hyphen, and after years of waiting Firefox 3 should render this correctly < https://bugzilla.mozilla.org/show_bug.cgi?id=9101 >.
Unless there are some surprises in mediawiki's handling of soft-hyphens, the wikitext would look like (in this example, ­ could be the Unicode equivalent)
<!-- any OCR sync information -->line one with a extra­<!-- any OCR sync information -->ordinary size words that flow onto line two <!-- any OCR sync information --> and here is line three.
i.e. original line one and two would need to be on a single line in order to prevent a newline to be emitted in the HTML, which would invalidate the soft-hyphen.
If we wanted to get fancy, the dev's could enhance mediawiki so that a line ending with a Unicode soft-hyphen is not followed by a new line character in the HTML. I cant see any draw back in doing that, except for the overhead; if this is significant, it could be done in a Wikisource only extension.
2. wiki markup that isnt in the original
This could be simply ignored by mandating that we don't add additional markup until after the text has been proof-read, and the changes have been fed back into the DJVU file. Any improvement on that position depends on improvements in the wikitext -> DJVU process.
We keep a very close eye on Recentchanges and have revision patrolling for changes by non-admins, so any changes that may effect the ability to slurp improvements back into the DJVU can be managed.
-- John
I might be misunderstanding what is being asked, but could someone explain to me why the span tags with the OCR block information needs to be permanent? Would it suffice to have the span tags, proof read the OCR'd text till it perfectly matches the scans, feed it back into the DJVU file and then remove all the span tags to have a clean wikitext?
I would imagine once the proofed text becomes the text layer to the DJVU file, that would be the last time we would have to even touch the text anyway, so there would be no more modifications we would need to make to either the DJVU or the wikitext at all. At that point we could make the text 100% clean.
Zhaladshar
Ryan Dabler wrote:
I might be misunderstanding what is being asked, but could someone explain to me why the span tags with the OCR block information needs to be permanent? Would it suffice to have the span tags, proof read the OCR'd text till it perfectly matches the scans, feed it back into the DJVU file and then remove all the span tags to have a clean wikitext?
I would imagine once the proofed text becomes the text layer to the DJVU file, that would be the last time we would have to even touch the text anyway, so there would be no more modifications we would need to make to either the DJVU or the wikitext at all. At that point we could make the text 100% clean.
We would not want to change the text, but we might want to change some of it into wikilinks.
Ec
John Vandenberg wrote:
- wiki markup that isnt in the original
This could be simply ignored by mandating that we don't add additional markup until after the text has been proof-read, and the changes have been fed back into the DJVU file. Any improvement on that position depends on improvements in the wikitext -> DJVU process.
That means we could be waiting a long time.
Ec
Birgitte SB wrote:
I think you are trying to do far too much with the same piece of text. Perfectly readable/editable wikimarkup and exactly macthing OCR text are not possible with the same text. I suggest you find a way to hack having the text existing twice in the proofreading page. Something like below:
<!-- Here is text with OCR breaks and hyphens which matches the printed page-->
Here is the wikimarkup text that is trancluded to the WS page
Of course this means both sets of text need to be proofread, but I think a script should be able highlight all the differences between them making it simple to proofread one from the other. If you really want to have only one version of the text, it will have to have the exactness of OCR sacrificed. People will always go through the markup "fixing" the hyphens.
The whole proposal seems to come into the realm of biting off more than we can chew. I can give ThomasV's approach to having all material backed up by page scans full marks for what it sets out to do, but that still doesn't change the fact that some editors still find it more convenient to sub-optimally upload entire books from Project Gutenberg with little more additional effort than breaking off chapters into separate pages and adding headers. Unless we can get real people to do tedious but relatively non-technical tasks such as proofreading, how can we ever convince them to remain consistent with technical tasks whose benefits are far fom obvious.
Eighteenth century scientific texts may have done well with only a single printing, but more popular works that had multiple editions present a challenge unless we can declare a particular printing to be canonical. The best printing for this may not be easily or cheaply available. As an example, I have an alomost complete set of the Ticknor and Fields version of the works of Thomas De Quincey. In the course of putting this together I ended up with apparently duplicate volumes. In the case of the second volume of the "Theological Essays" I have both an 1854 and an 1864 printing. The 1854 edition goes to page 276 and the 1864 edition to page 315. The later edition adds an essay missing from the earlier.
The first three lines of page 71 of the 1864 printing from "Toilette of the Hebrew Lady" and ending a paragraph read
"the precious stones; and at other times, the pearls were strung two and two, and their beautiful white- ness relieved by the interposition of red coral."
In the 1854 printing the same text appears as lines 27-9 of page 69, except that "whiteness" now appears fully on the middle line without hyphenation. Footnotes that were at the end of an essay in 1854 are moved to the proper page in 1864.
At one time, if a second printing was needed, it was easier and cheaper to reset the type, with all the attedent errors that one might imagine. Labour was cheap, and manufactured type very expensive.
Ec
Gregory Maxwell wrote:
I'd really like it if the corrected text in wikisource could be imported back into the djvu document images.
Some thoughts:
1. The easy way to do OCR is not to do OCR. If you download books scanned by the Internet Archive / Open Content Alliance, they are already OCRed. Both images and raw OCR text are contained in the djvu files. I think IA uses OCR technology from H-P that isn't open sourced.
2. It is nice to have pixel coordinates for each word or line of text, but this requires that the image is kept unchanged. If the scanned image is uploaded to Wikimedia Commons, some helpful user might touch it up, deskew it, improve the contrast and upload a new version, after which all pixel coordinates might be ruined.
3. As you mentioned, there are now some open sourced OCR engines. I haven't tried them, but I assume they will improve and become useful. The traditional use for OCR is to read an image and output raw text, but proofreading has traditionally been a one-person process with very limited feedback. When collaborative proofreading (as in PGDP.net or Wikisource) is combined with open sourced OCR software, we have a new potential feedback loop. Instead of finding the words in an image, we could need a routine that takes a scanned image and an already proofread text, and tries to find the pixel coordinates for these words. If that sort of software existed, we wouldn't need to preserve coordinates during proofreading, because we could reconstruct them afterwards. This might be a suitable summer-of-code project for the right person, who is already familiar with the OCR software.
On Jan 27, 2008 9:55 AM, Lars Aronsson lars@aronsson.se wrote:
Gregory Maxwell wrote:
I'd really like it if the corrected text in wikisource could be imported back into the djvu document images.
Some thoughts:
- The easy way to do OCR is not to do OCR. If you download books
scanned by the Internet Archive / Open Content Alliance, they are already OCRed. Both images and raw OCR text are contained in the djvu files. I think IA uses OCR technology from H-P that isn't open sourced.
The software that Gregory Maxwell is planning to use is the same that is used for Google Books. It is open source.
- It is nice to have pixel coordinates for each word or line of
text, but this requires that the image is kept unchanged. If the scanned image is uploaded to Wikimedia Commons, some helpful user might touch it up, deskew it, improve the contrast and upload a new version, after which all pixel coordinates might be ruined.
The page scans will be uploaded to Commons as DJVU files, which are huge, and we dont really want regular updates to them.
I think the way this would be handled is as separate images.
e.g. If I cleaned up page 1 of [[Image:35 Sonnets by Fernando Pessoa.djvu]], I save it as [[Image:35 Sonnets by Fernando Pessoa.djvu-page1.jpg]], and I update [[Index:35 Sonnets]] to use the standalone image instead of the DJVU. Once completed, any standaline images would then be used to rebuild the DJVU file.
- As you mentioned, there are now some open sourced OCR engines.
I haven't tried them, but I assume they will improve and become useful. The traditional use for OCR is to read an image and output raw text, but proofreading has traditionally been a one-person process with very limited feedback. When collaborative proofreading (as in PGDP.net or Wikisource) is combined with open sourced OCR software, we have a new potential feedback loop. Instead of finding the words in an image, we could need a routine that takes a scanned image and an already proofread text, and tries to find the pixel coordinates for these words. If that sort of software existed, we wouldn't need to preserve coordinates during proofreading, because we could reconstruct them afterwards. This might be a suitable summer-of-code project for the right person, who is already familiar with the OCR software.
Finding the location of a given text on an image is a novel idea. It is an interesting project that might even be suitable as a research project for a post-grad.
If I understand correctly, you are suggesting that Greg uploads the DJVU without a text layer, and we all use whatever means we have available to build create the text, and then we feed the proofread text into the DJVU once complete (using vaporware software? :-) ) This has the distinct advantage of allowing the images to be improved as well as the text. We may even be able to push an improved DJVU file onto the Commons front page as a featured image.
As crazy as it sounds, it is quite sane. OCR software will improve over the next year, and we want to be taking advantage of those improvements as we progress through the volumes. ThomasV has already set up the framework for user requested bot scans; we may need to extend this to handle different configurations to suit each DJVU file, so that the OCR software can "learn" as it progresses.
Also, distributing the low quality OCR text in the DJVU file initially results in a many potential contributors not joining the ranks because the OCR text is "good enough". By not including the OCR text, it will encourage people to work with Wikisource to finish each volume.
-- John
wikisource-l@lists.wikimedia.org