ThomasV wrote:
the problem is that djvu pages on common do not have a parsable format.
Many of them are. For example, http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf contains:
{{Information |Description=Swedish patent 14: Mjölqvarn |Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se] |Date=January 23, 1885 |Author=R. Setz, J. Schweiter, Clus, Switzerland |Permission= |other_versions= }}
This looks very formalized and parsable to me. I filled it in when I uploaded the file to Commons, and exactly the same fields need to be filled in manually in the newly created Index page.
Maybe I should design a tool or bot that asks for these fields once, and then uploads the file and creates the Index page, based on that information. And my question was: Has anybody already done that?
What you describe is _already_ implemented : when a page is created, its text is extracted from the text layer of the corresponding djvu or pdf. All you need to do is create djvu files with a proper text layer.
You are correct, it does indeed work, but only after I action=purge the PDF file on Commons. It never worked for me on the first try, without any purge. And I was misled by an earlier bug where action=purge didn't help, so it took me a while before I tested this.
So why is the purge necessary? If OCR text extraction ever fails, why is this not detected and automatically retried?
When I try to create http://sv.wikisource.org/wiki/Sida:Swedish_patent_14_Mj%C3%B6lqvarn.pdf/1 there is a character encoding error in the OCR text. It looks as if the PDF contains 8-bit data, which is loaded into the UTF-8 form without conversion. Cut-and-paste from Acrobat Reader works fine.