Re: [Wikisource-l] PDF/Djvu to Index

13 Apr 2010

2010/4/13 ThomasV &lt;thomasV1(a)gmx.de&gt;

...
  Lars Aronsson a écrit :
  ThomasV wrote:

  the problem is that djvu pages on common
 do not have a parsable format.

 Many of them are. For example,

 http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf
  contains:

 {{Information
 |Description=Swedish patent 14: Mjölqvarn
 |Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se]
 |Date=January 23, 1885
 |Author=R. Setz, J. Schweiter, Clus, Switzerland
 |Permission=
 |other_versions=
 }}

 This looks very formalized and parsable to me.
 I filled it in when I uploaded the file to
 Commons, and exactly the same fields need to be
 filled in manually in the newly created Index page.

 Maybe I should design a tool or bot that asks for
 these fields once, and then uploads the file and
 creates the Index page, based on that information.
 And my question was: Has anybody already done that?

  not to my knowledge.
 it is possible to request the text with the api, why dont you try it.

Two problems.

1. I don't think all projects are using this template as it does not really
fit for books: At least German Wikisource has special templates for books,
single pages, djvu-Files, pdf-Files and so on. It uses mostly those for its
Commons-uploads and not the non-specific information-template as this
template does not have the parameters needed to describe book data (author,
publisher, place of publishing, year of first publishing, publishing
version, year of publishing of this version, ...). And AFAIK de.WS is not
the only project which uses specialized templates for its Commons-uploads.

2. I'm not sure about this but I think the index-file has the same fields in
all projects which use the extension. That would mean that the
Information-template does not contain the correct data for filling the
index-page. At least the index-file on de.WS has separate fields for author
and publisher and year of publishing and place of publishing and we usually
also link locally to the author and the title-page. So at least from your
example up there only one parameter (the source) is really useable (and at
least I usually linked to the Commons-file in the source-parameter on the
WS-indexPage). I'm not sure how much time the parse-request would need but
it does not look really worth the time (both programming it and later using
it) considering its useable return values.

IMO you could create a lot of index-files in the time you spend figuring out
if it is possible to extract, parse and interpret the data from Commons.
There are too many uploads which do not use any template, other templates or
while using this template are filled out in an unuseable way (as everybody
has a little bit a different style even when using templates), and even then
it lacks half the needed information while the rest still needs formatting.
The benefit is quite small compared to the amount of work it requires.

But hey, if you have spare time it would be still interesting to know if you
can get the data in a way that would not slow down users with not-so-fast
internet-connections.

Cecil

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] PDF/Djvu to Index