the problem is that djvu pages on common
do not have a parsable format.
Many of them are. For example,
|Description=Swedish patent 14: Mjölqvarn
|Source=Digitized by [http://nordiskapatent.se/
|Date=January 23, 1885
|Author=R. Setz, J. Schweiter, Clus, Switzerland
This looks very formalized and parsable to me.
I filled it in when I uploaded the file to
Commons, and exactly the same fields need to be
filled in manually in the newly created Index page.
Maybe I should design a tool or bot that asks for
these fields once, and then uploads the file and
creates the Index page, based on that information.
And my question was: Has anybody already done that?
What you describe is _already_ implemented :
when a page is created, its text is extracted
from the text layer of the corresponding djvu or pdf.
All you need to do is create djvu files with a proper text layer.
You are correct, it does indeed work, but only
after I action=purge the PDF file on Commons.
It never worked for me on the first try,
without any purge. And I was misled by an
earlier bug where action=purge didn't help,
so it took me a while before I tested this.
So why is the purge necessary? If OCR text
extraction ever fails, why is this not detected
and automatically retried?
When I try to create
there is a character encoding error in the OCR text.
It looks as if the PDF contains 8-bit data, which
is loaded into the UTF-8 form without conversion.
Cut-and-paste from Acrobat Reader works fine.
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se