Lars Aronsson a écrit :
ThomasV wrote:
the problem is that djvu pages on common
do not have a parsable format.
Many of them are. For example,
http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf
contains:
{{Information
|Description=Swedish patent 14: Mjölqvarn
|Source=Digitized by [
http://nordiskapatent.se/ NordiskaPatent.se]
|Date=January 23, 1885
|Author=R. Setz, J. Schweiter, Clus, Switzerland
|Permission=
|other_versions=
}}
This looks very formalized and parsable to me.
I filled it in when I uploaded the file to
Commons, and exactly the same fields need to be
filled in manually in the newly created Index page.
Maybe I should design a tool or bot that asks for
these fields once, and then uploads the file and
creates the Index page, based on that information.
And my question was: Has anybody already done that?
not to my knowledge.
it is possible to request the text with the api, why dont you try it.
What you describe is _already_ implemented :
when a page is created, its text is extracted
from the text layer of the corresponding djvu or pdf.
All you need to do is create djvu files with a proper text layer.
You are correct, it does indeed work, but only
after I action=purge the PDF file on Commons.
It never worked for me on the first try,
without any purge. And I was misled by an
earlier bug where action=purge didn't help,
so it took me a while before I tested this.
So why is the purge necessary? If OCR text
extraction ever fails, why is this not detected
and automatically retried?
purge is necessary only for files that were uploaded previously, when
text extraction was not performed. Note that text layer extraction for
pdf files is new.
When I try to create
http://sv.wikisource.org/wiki/Sida:Swedish_patent_14_Mj%C3%B6lqvarn.pdf/1
there is a character encoding error in the OCR text.
It looks as if the PDF contains 8-bit data, which
is loaded into the UTF-8 form without conversion.
Cut-and-paste from Acrobat Reader works fine.
yes there is a conversion problem with pdf; it works better with djvu.