Lars Aronsson a écrit :
ThomasV wrote:
the problem is that djvu pages on common do not have a parsable format.
Many of them are. For example, http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf contains:
{{Information |Description=Swedish patent 14: Mjölqvarn |Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se] |Date=January 23, 1885 |Author=R. Setz, J. Schweiter, Clus, Switzerland |Permission= |other_versions= }}
This looks very formalized and parsable to me. I filled it in when I uploaded the file to Commons, and exactly the same fields need to be filled in manually in the newly created Index page.
Maybe I should design a tool or bot that asks for these fields once, and then uploads the file and creates the Index page, based on that information. And my question was: Has anybody already done that?
not to my knowledge. it is possible to request the text with the api, why dont you try it.
What you describe is _already_ implemented : when a page is created, its text is extracted from the text layer of the corresponding djvu or pdf. All you need to do is create djvu files with a proper text layer.
You are correct, it does indeed work, but only after I action=purge the PDF file on Commons. It never worked for me on the first try, without any purge. And I was misled by an earlier bug where action=purge didn't help, so it took me a while before I tested this.
So why is the purge necessary? If OCR text extraction ever fails, why is this not detected and automatically retried?
purge is necessary only for files that were uploaded previously, when text extraction was not performed. Note that text layer extraction for pdf files is new.
When I try to create http://sv.wikisource.org/wiki/Sida:Swedish_patent_14_Mj%C3%B6lqvarn.pdf/1 there is a character encoding error in the OCR text. It looks as if the PDF contains 8-bit data, which is loaded into the UTF-8 form without conversion. Cut-and-paste from Acrobat Reader works fine.
yes there is a conversion problem with pdf; it works better with djvu.