As you know, many of us use archive.org to OCR their books: https://en.wikisource.org/wiki/Help:DjVu_files#The_Internet_Archive For a while, they've been stuck with FineReader 8.0. I've just noticed the last OCR processes use 9.0, which has 5 more languages and 2 more dictionaries: http://www.abbyy.com/support/finereader_90_ts/RecognitionLanguages/ http://www.abbyy.com/support/finereader_80_ts/RecognitionLanguages/
I think it's worth re-doing OCR on any archive.org DjVu you're using (and you definitely must do so if it's one of those languages). I'm a (limited) admin there, so feel free to give me on my talk lists of items where to update OCR: https://wikisource.org/wiki/User_talk:Nemo_bis
Nemo
Hi Nemo, that's great news.
I wonder though how would be worth to redo the OCR on the archive djvu, as it will be on the archive.org but not on Commons... Do you imply that we would need to re-upload the djvu on Commons?
BTW, I think it's past time that Archive.org and Wikimedia start a real partnership/collaboration. With Micru, some months ago, we tried to draft a possible model: https://docs.google.com/file/d/0B1PNcNlN2oqvajVfOEFuM29sbzg/edit?usp=sharing
But I think the discussion died (as did many others). One of the things we could do is a project similar to this: https://www.mediawiki.org/wiki/Possible_projects#Google_Books_.3E_Internet_A...
Aubrey
On Tue, Oct 1, 2013 at 6:25 PM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
As you know, many of us use archive.org to OCR their books: < https://en.wikisource.org/**wiki/Help:DjVu_files#The_**Internet_Archivehttps://en.wikisource.org/wiki/Help:DjVu_files#The_Internet_Archive
For a while, they've been stuck with FineReader 8.0. I've just noticed the last OCR processes use 9.0, which has 5 more languages and 2 more dictionaries: http://www.abbyy.com/support/**finereader_90_ts/**RecognitionLanguages/http://www.abbyy.com/support/finereader_90_ts/RecognitionLanguages/ http://www.abbyy.com/support/**finereader_80_ts/**RecognitionLanguages/http://www.abbyy.com/support/finereader_80_ts/RecognitionLanguages/
I think it's worth re-doing OCR on any archive.org DjVu you're using (and you definitely must do so if it's one of those languages). I'm a (limited) admin there, so feel free to give me on my talk lists of items where to update OCR: https://wikisource.org/wiki/**User_talk:Nemo_bishttps://wikisource.org/wiki/User_talk:Nemo_bis
Nemo
______________________________**_________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.**org Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikisource-lhttps://lists.wikimedia.org/mailman/listinfo/wikisource-l
Andrea Zanni, 03/10/2013 13:19:
Hi Nemo, that's great news.
I wonder though how would be worth to redo the OCR on the archive djvu, as it will be on the archive.org http://archive.org but not on Commons... Do you imply that we would need to re-upload the djvu on Commons?
Of course, it will need to be reuploaded. Reuploading is cheap, while correcting OCR errors consumes precious volunteer time.
Nemo
BTW, I think it's past time that Archive.org and Wikimedia start a real partnership/collaboration. With Micru, some months ago, we tried to draft a possible model: https://docs.google.com/file/d/0B1PNcNlN2oqvajVfOEFuM29sbzg/edit?usp=sharing
But I think the discussion died (as did many others). One of the things we could do is a project similar to this: https://www.mediawiki.org/wiki/Possible_projects#Google_Books_.3E_Internet_A...
Aubrey
On Tue, Oct 1, 2013 at 6:25 PM, Federico Leva (Nemo) <nemowiki@gmail.com mailto:nemowiki@gmail.com> wrote:
As you know, many of us use archive.org <http://archive.org> to OCR their books: <https://en.wikisource.org/__wiki/Help:DjVu_files#The___Internet_Archive <https://en.wikisource.org/wiki/Help:DjVu_files#The_Internet_Archive>> For a while, they've been stuck with FineReader 8.0. I've just noticed the last OCR processes use 9.0, which has 5 more languages and 2 more dictionaries: http://www.abbyy.com/support/__finereader_90_ts/__RecognitionLanguages/ <http://www.abbyy.com/support/finereader_90_ts/RecognitionLanguages/> http://www.abbyy.com/support/__finereader_80_ts/__RecognitionLanguages/ <http://www.abbyy.com/support/finereader_80_ts/RecognitionLanguages/> I think it's worth re-doing OCR on any archive.org <http://archive.org> DjVu you're using (and you definitely must do so if it's one of those languages). I'm a (limited) admin there, so feel free to give me on my talk lists of items where to update OCR: https://wikisource.org/wiki/__User_talk:Nemo_bis <https://wikisource.org/wiki/User_talk:Nemo_bis> Nemo _________________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.__org <mailto:Wikisource-l@lists.wikimedia.org> https://lists.wikimedia.org/__mailman/listinfo/wikisource-l <https://lists.wikimedia.org/mailman/listinfo/wikisource-l>
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
On Thu, Oct 3, 2013 at 4:41 PM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Of course, it will need to be reuploaded. Reuploading is cheap, while correcting OCR errors consumes precious volunteer time.
Ok, so now my point is: could we automatize the re-upload of a better OCR-red djvu? We could do that automatically for all the djvu on Commons that aren't transcluded on a Wikisource, for example.
Moreover, I wonder if changing the djvu can modify anything on Wikisource: if a book has already been started transcribing and proofreading, for those pages nothing will change, right? The only pages for that will change are the ones that haven't been touched yet. Am I right?
Aubrey
Andrea Zanni, 03/10/2013 18:53:
On Thu, Oct 3, 2013 at 4:41 PM, Federico Leva (Nemo) wrote:
Of course, it will need to be reuploaded. Reuploading is cheap, while correcting OCR errors consumes precious volunteer time.
Ok, so now my point is: could we automatize the re-upload of a better OCR-red djvu?
If there's demand of many reuploads...
We could do that automatically for all the djvu on Commons that aren't transcluded on a Wikisource, for example.
Moreover, I wonder if changing the djvu can modify anything on Wikisource: if a book has already been started transcribing and proofreading, for those pages nothing will change, right? The only pages for that will change are the ones that haven't been touched yet. Am I right?
Yes. That's why I asked to tell me if there are books you're interested in. I can't guess. I could maybe do some complex queries but personally I'm not going to.
Nemo
Dispenser kindly made a list of DjVu files on Commons linking an IA item, with some information like global usage: https://toolserver.org/~dispenser/temp/djvu2archive.org.txt (just change the extension to csv to open it as a spreadsheet, tab-separated). It's about 5000 books with 6-200 global usages and 5000 outside that range (which probably means completely unused apart some talk pages or whatever, or with most text already living on wiki pages). If I manage to convince a "slash-admin", I'll get those 5000 re-OCR'd, otherwise I need to do it manually so suggestions on priorities are welcome. :)
Nemo
P.s.: The used query http://pastebin.com/L4EXDY5F and another one http://pastebin.com/avg3LYti
-- 2> /dev/null; date; echo ' SELECT /*SLOW_OK*/ CONCAT("[[:File:", REPLACE(img_name, "_", " "), "]]") AS "File", el_from, img_size, REPLACE(img_user_text, "_", " ") AS "Uploader", img_timestamp AS "Timestamp", el_to, (SELECT COUNT(*) FROM globalimagelinks WHERE gil_to=page_title) AS "Usage", (SELECT COUNT(*) FROM oldimage WHERE oi_name=page_title) AS "Reuploads", SUBSTRING_INDEX(el_to,"/",-1) AS "Archive_Name" FROM image JOIN page ON page_namespace=6 AND page_title=CONVERT(img_name USING latin1) JOIN externallinks ON el_from=page_id WHERE img_media_type="BITMAP" AND img_major_mime="image" AND img_minor_mime="vnd.djvu" AND (el_index LIKE "http://org.archive.%/details/_%" OR el_index LIKE "https://org.archive.%/details/_%") ;-- ' | sql -r commonswiki_p > ~/public_html/temp/djvu2archive.org.txt; date;
Federico Leva (Nemo), 11/10/2013 08:48:
Dispenser kindly made a list of DjVu files on Commons linking an IA item, with some information like global usage: https://toolserver.org/~dispenser/temp/djvu2archive.org.txt (just change the extension to csv to open it as a spreadsheet, tab-separated). It's about 5000 books with 6-200 global usages and 5000 outside that range (which probably means completely unused apart some talk pages or whatever, or with most text already living on wiki pages). If I manage to convince a "slash-admin", I'll get those 5000 re-OCR'd, otherwise I need to do it manually so suggestions on priorities are welcome. :)
Jeff at the Internet Archive tells me they haven't tested the new OCR extensively yet, so they won't re-OCR en masse yet. I'll select a few test cases, reupload to different items and see what difference the new OCR makes: I'd use some help comparing the results for non-romance languages though... I'll also try some books in the newly supported languages: Hebrew and Thai (now with dictionary), Chinese (traditional and simplified) and Japanese.
Nemo
wikisource-l@lists.wikimedia.org