Wikisource-l November 2011

wikisource-l@lists.wikimedia.org

6 participants
7 discussions

Re: [Wikisource-l] [cultural-partners] ABBYY Finereader 11 on Toolserver: do we like it?

by Andrea Zanni

Wow, thank you all for the quick responses. I'll try to reply in-line. 2011/11/28 Mathias Schindler <mathias.schindler(a)gmail.com> > I recommend sticking and supporting open source technology that has > been made available by third parties, such as > http://code.google.com/p/ocropus/ / > http://code.google.com/p/tesseract-ocr/ > This is true, and this would be the optimal way, but apparently it failed. I don't know way the OCR button is not running anymore, it seems to me that when ThomasV left things were not updated or something like that. >From my experience (I have used these software for professional projects) the quality and the usability of the software is very different. Of course, having Tesseract is better than having nothing. 2011/11/28 Lars Aronsson <mathias.schindler(a)gmail.com> > I think this is what the Internet Archive uses, as well as > several European libraries. We could look into establishing > a cooperation with the Internet Archive or perhaps with > Europeana in this area. Maybe the Internet Archive can > open up an API for OCR-ing a single page at a time? > This would be awesome :-) I don't have a clue about technicalities here, if you want to aske them be my guest :-) @Tomasz i think that Federico has a point in the approach he suggested: I'm wondering in fact how did IA get his license, we should ask them. Do we have any contact with Internet Archive? I know we could use directly IA for uploading PDFs (we do it already for getting DjVus) but still it's not the more usable way to handle with institutions or simple users... Aubrey

12 years, 4 months

Re: [Wikisource-l] [cultural-partners] ABBYY Finereader 11 on Toolserver: do we like it?

by Lars Aronsson

On 11/28/2011 01:59 PM, Mathias Schindler wrote: > I recommend sticking and supporting open source technology that has > been made available by third parties, such as > http://code.google.com/p/ocropus/ / > http://code.google.com/p/tesseract-ocr/ Do you recommend this based on experience, or based on free software ideology? Apparently the Internet Archive tried and gave up, because Finereader was far better. Are there any good examples where free software has been used for good OCR quality? Wikisource does provide feedback on quality: After OCR, when a page has been proofread, the OCR software could learn from the diff. But is there any OCR software that can take this kind of input? When running OCR as an engine/server/API, what do we do when it misinterprets columns in a page, and reads long lines across the page? Is there a way to manually indicate where columns are, and resubmit the page for new OCR? -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se

12 years, 4 months

Re: [Wikisource-l] [cultural-partners] ABBYY Finereader 11 on Toolserver: do we like it?

by Andrea Zanni

2011/11/28 Gerard Meijssen <gerard.meijssen(a)gmail.com> > Hoi, > What scripts does it support ? > Thanks, > GerardM > Hi Gerard, could you be more explicit? I don't understand the question. Aubrey

12 years, 4 months

ABBYY Finereader 11 on Toolserver: do we like it?

by Andrea Zanni

Dear all, it's a lot aof time I'm wondering a project that could help Wikisource (and some GLAMs too), and the idea is simply to install ABBYY Finereader 11 on toolserver, as a tool for all Wikisource users. For those who don't know, ABBYY Finereader is an OCR software: it is proprietary and fairly expensive, but it is accurate and works really, really well. Plus, its 11 version can save files in DjVu. Now, in my mind having such a software on toolserver could take us to: - restore our beloved OCR button, with a much more accurate OCR - use Finereader for transforming PDF/TIFF/JPG from Commons directly in OCRred DjVus. - others things I've not thought yet Issues are many too: - Cost: I don't know how much this could cost. Many WM chapters do give money to toolserver, and the status of the thing is a bit fuzzy a the moment, but, for example, Wikimedia Italy has frozen 5000 euros for the toolserver, and maybe we can use those money for the license (I'm in WMI Board, and I've asked, they say it's OK); - Technical: afaik, toolserver run Solaris, and apparently Finereader is Windows only (I think we can solve easly this if we want, though) - Ethic: this is proprietary software, and I don't know if we *want* to use it on Wikimedia projects... - Resources: i think this is probably the main issue: we need skilled people to set this up technically, and at least one toolserver operator (Phe, maybe?) Below, the mail I sent to ABBYY Europe, to see it the thing was feasible. They simply replied they want a phone call. Of course, if the thing would be too expensive the projects collapse immediately, but I think it's worth to discuss. If nobody wants it, I can drop it right now. Please, forward this may to everyone possibly interested, I don't thin it's a good idea to scatter discussions in every ws Village Pump. Cheers Aubrey *From:* Andrea Zanni <andrea.zanni(a)wikimedia.it> *Sent:* Friday, November 25, 2011 10:20 AM *To:* support_eu(a)abbyy.com *Subject:* Questions about server licenses Dear ABBYY Europe, my name is Andrea Zanni, and I'm a Board member of Wikimedia Italy, the Italian chapter of Wikimedia movement. We are a no-profit association which promotes and sustains Wikimedia project, as the online encyclopedia Wikipedia. I'm writing you because I'm interested in knowing about "server licences" of your new Finereader 11. As far as I know, your product save files in DjVu, and this is an interesting feature that could help some of our project. Maybe you know Wikisource, a multilingual digital library in which the community upload, transcribe and proofread books. This is the english version (http://en.wikisource.org/wiki/Main_Page). In each page of each book (which are uploaded in DjVu), we have a little button "OCR" which used to call a tesseract bot and ocr the page. Right now, the bot doesn't work for lack of maintainance. My idea would be to substitute the tesseract with Finereader, and also have the possibility to use other features, as taking a PDF/JPEG file and saving it as a OCRred DjVu, or as choosing the language of the OCR from project to project. Now, I do not have an estimate of how much this engine could be used (I understand this is a crucial factor for the price of a server license). I would count few hundreds of pages OCRred per day (maybe more, if this thing works), and a few dozens file conversions (any to DjVu) per day. So, my questions are: - do you have a rough idea how much this license would cost? - do you know if it is possible to run FR11 in other os than Windows (we actually run Solaris)? - do you know if is possible to have all these feature via API or something? Thank you for your time, regards Andrea Zanni -- Wikimedia Italia Board Sostieni la cultura, dona a Wikimedia Italia. http://sostienilacultura.it

12 years, 4 months

Scientia (babelian-wikisourcian interproject)

by Andrea Zanni

Dear all, I'm pleased to announce that the Italian Wikisource has started a collaboration with AlmaDL, the digital library of University of Bologna ( http://amshistorica.cib.unibo.it). The project, called Wikiproject Scientia ( http://it.wikisource.org/wiki/Progetto:Scientia), aims to take on Wikisource(s) around 40 issues of "Scientia", a scientific journals published at the begginning of the century. The journal has been published in 4 different languages, and includes (original) articles from scientists from all around the world, as G. Peano, Enrico Fermi, Bertrand Russell, E. Rutherford, H. Lorentz, Sigmund Freud, Henri Poincaré, Ernst Mach, Albert Einstein, Werner Heisenberg; Rudolf Carnap, Otto Neurath, and many, many more. The whole journal has been published here: http://amshistorica.cib.unibo.it/6 http://amshistorica.cib.unibo.it/7 but we are still uploading the bundled djvus on Commons (and doing the OCR with ABBYY). We have completed and formatted one issue ( http://it.wikisource.org/wiki/Indice:Rivista_di_Scienza_-_Vol._I.djvu), to have a clue of the complexity of the work, and it is definetely complex. You can take a look here: http://it.wikisource.org/wiki/Indice:Scientia_-_Vol._VII.djvu http://it.wikisource.org/wiki/Indice:Scientia_-_Vol._VIII.djvu http://it.wikisource.org/wiki/Indice:Scientia_-_Vol._IX.djvu http://it.wikisource.org/wiki/Indice:Scientia_-_Vol._X.djvu The biggest challenge is to set up the transclusion of the articles in all the Wikisources interested, because every issue contains articles in at least 3 languages (sometimes 4: French is the most used language of the journal). So, this mail is to inform all the potential wikimedians interested in participating, and helping us to set up books and indexes in the respective wikisources. Thank you, Aubrey NB: I work for AlmaDL, so I "persuaded" my boss to try releasing some djvus and see what the community of Wikisource could do. This does not mean I am a full time "Wikisourcian in residence", but I certanly use some of my work time for this and I can definitely help with original scans, metadata and even the OCR with ABBYY Finereader. I soon discovered that the project is much more bigger than I expected, especially for the multilingual issue.

12 years, 4 months

Wikiguides with subs on Youtube

by Andrea Zanni

[sorry for cross-posting] Hi all, I just wanted to announce that the wikiguides produced by Wikimedia Italy are now on Youtube with subs in different languages. If you are interested in translation, just contact me. The videos have proven themselves as a good introduction to the wiki world (a fourth video on Wikiquote is under production), and (at least in Wikisource) we have seen a significant improvement in access and use of the website. Hope you will enjoy too. *Wikisource *http://www.youtube.com/watch?v=cR0g5ACaC-g *Wikipedia* http://www.youtube.com/watch?v=qoLBZ7_vY-k * Commons* http://www.youtube.com/watch?v=AOTlhuokVDs Aubrey

12 years, 5 months

Persisting wrong thumbs of djvu pages in view mode

by Alex Brollo

We fixed a djvu file adding some missing pages. Now, something strange appears. Take a look at http://it.wikisource.org/wiki/Pagina:Delle_strade_ferrate_e_della_loro_futu…. Then enter in edit mode. The thumb showed in view mode comes from the old, wrong page; the thumb showed in edit mode is the right one. Can we force purging of such old thumb/image, presumably saved somewhere from the old djvu file? Alex

12 years, 5 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Wikisource-l November 2011