ThomasV wrote:
the problem is that djvu pages on common
do not have a parsable format.
Many of them are. For example,
http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf
contains:
{{Information
|Description=Swedish patent 14: Mjölqvarn
|Source=Digitized by [
http://nordiskapatent.se/ NordiskaPatent.se]
|Date=January 23, 1885
|Author=R. Setz, J. Schweiter, Clus, Switzerland
|Permission=
|other_versions=
}}
This looks very formalized and parsable to me.
I filled it in when I uploaded the file to
Commons, and exactly the same fields need to be
filled in manually in the newly created Index page.
Maybe I should design a tool or bot that asks for
these fields once, and then uploads the file and
creates the Index page, based on that information.
And my question was: Has anybody already done that?
What you describe is _already_ implemented :
when a page is created, its text is extracted
from the text layer of the corresponding djvu or pdf.
All you need to do is create djvu files with a proper text layer.
You are correct, it does indeed work, but only
after I action=purge the PDF file on Commons.
It never worked for me on the first try,
without any purge. And I was misled by an
earlier bug where action=purge didn't help,
so it took me a while before I tested this.
So why is the purge necessary? If OCR text
extraction ever fails, why is this not detected
and automatically retried?
When I try to create
http://sv.wikisource.org/wiki/Sida:Swedish_patent_14_Mj%C3%B6lqvarn.pdf/1
there is a character encoding error in the OCR text.
It looks as if the PDF contains 8-bit data, which
is loaded into the UTF-8 form without conversion.
Cut-and-paste from Acrobat Reader works fine.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se