2010/4/13 ThomasV <thomasV1@gmx.de>

Lars Aronsson a écrit :

> ThomasV wrote:
>
>> the problem is that djvu pages on common
>> do not have a parsable format.
>>
>
> Many of them are. For example,
> http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf
> contains:
>
> {{Information
> |Description=Swedish patent 14: Mjölqvarn
> |Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se]
> |Date=January 23, 1885
> |Author=R. Setz, J. Schweiter, Clus, Switzerland
> |Permission=
> |other_versions=
> }}
>
> This looks very formalized and parsable to me.
> I filled it in when I uploaded the file to
> Commons, and exactly the same fields need to be
> filled in manually in the newly created Index page.
>
> Maybe I should design a tool or bot that asks for
> these fields once, and then uploads the file and
> creates the Index page, based on that information.
> And my question was: Has anybody already done that?
>
>

not to my knowledge.
it is possible to request the text with the api, why dont you try it.

Two problems.

1. I don't think all projects are using this template as it does not really fit for books: At least German Wikisource has special templates for books, single pages, djvu-Files, pdf-Files and so on. It uses mostly those for its Commons-uploads and not the non-specific information-template as this template does not have the parameters needed to describe book data (author, publisher, place of publishing, year of first publishing, publishing version, year of publishing of this version, ...). And AFAIK de.WS is not the only project which uses specialized templates for its Commons-uploads.

2. I'm not sure about this but I think the index-file has the same fields in all projects which use the extension. That would mean that the Information-template does not contain the correct data for filling the index-page. At least the index-file on de.WS has separate fields for author and publisher and year of publishing and place of publishing and we usually also link locally to the author and the title-page. So at least from your example up there only one parameter (the source) is really useable (and at least I usually linked to the Commons-file in the source-parameter on the WS-indexPage). I'm not sure how much time the parse-request would need but it does not look really worth the time (both programming it and later using it) considering its useable return values.

IMO you could create a lot of index-files in the time you spend figuring out if it is possible to extract, parse and interpret the data from Commons. There are too many uploads which do not use any template, other templates or while using this template are filled out in an unuseable way (as everybody has a little bit a different style even when using templates), and even then it lacks half the needed information while the rest still needs formatting. The benefit is quite small compared to the amount of work it requires.

But hey, if you have spare time it would be still interesting to know if you can get the data in a way that would not slow down users with not-so-fast internet-connections.

Cecil