Re: [Wikisource-l] Wikisource-l Digest, Vol 43, Issue 1 - Wikisource-l

22 Apr 2010

      A quick comment on Lars's last post:
I spoke at the Museums and the Web conference last Tuesday in Denver,
and someone from the Library of Congress came.  They aren't a Museum,
but they are working on a small-scale  wikipedia project to help
illustrate articles in WP with deep citations into some of their more
interesting public domain holdings (which are a bit museum-like).
She mentioned that she was most interested in pursuing a further
wikisource project, as they have many unique works (journals,
manuscripts and the like) which are only available in their original
language -- say, a journal of a French botanist -- and which deserve
to be translated into other languages.
They would like to help publish digital scans and existing
native-language text (cleaned up OCR) to wikisource, in the hopes that
translations can be made, into English and other languages.
They are interested in identifying the 'most interesting works' in
their untranslated collections, and happy to have some discussion
about what make works interesting.  One of the things I like about
this idea is that as curators of one of the world's largest
international libraries they have a broad sense of 'notability' and
interest in a wikisource sense which we currently lack on the
Project... so this could help drive style guide improvements as well.
Thoughts?  If anyone is specifically interested in this collaboration,
let me know and I'll put you in touch with the organizer.  Of course
once this is more than a pipe dream there will be a public project
page... but early interest now could help frame the initial proposal.
SJ
On Wed, Apr 21, 2010 at 8:56 AM,
wikisource-l-request@lists.wikimedia.org wrote:
...
Send Wikisource-l mailing list submissions to
       wikisource-l@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit
       https://lists.wikimedia.org/mailman/listinfo/wikisource-l
or, via email, send a message with subject or body 'help' to
       wikisource-l-request@lists.wikimedia.org
You can reach the person managing the list at
       wikisource-l-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Wikisource-l digest..."
Today's Topics:
1. Re: PDF/Djvu to Index (ThomasV)
  2. Re: PDF/Djvu to Index (Cecil)
  3. Strategic Planning Office Hours (Philippe Beaudette)
  4. Experience, funding, outreach (Lars Aronsson)
  5. Re: Experience, funding, outreach (Sydney Poore)

Message: 1
Date: Tue, 13 Apr 2010 14:56:38 +0200
From: ThomasV thomasV1@gmx.de
Subject: Re: [Wikisource-l] PDF/Djvu to Index
To: "discussion list for Wikisource,    the free library"
       wikisource-l@lists.wikimedia.org
Message-ID: 4BC46A06.3070506@gmx.de
Content-Type: text/plain; charset=ISO-8859-1
Lars Aronsson a ?crit :
...
ThomasV wrote:
...
the problem is that djvu pages on common
do not have a parsable format.
Many of them are. For example,
http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf
contains:
{{Information
|Description=Swedish patent 14: Mj?lqvarn
|Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se]
|Date=January 23, 1885
|Author=R. Setz, J. Schweiter, Clus, Switzerland
|Permission=
|other_versions=
}}
This looks very formalized and parsable to me.
I filled it in when I uploaded the file to
Commons, and exactly the same fields need to be
filled in manually in the newly created Index page.
Maybe I should design a tool or bot that asks for
these fields once, and then uploads the file and
creates the Index page, based on that information.
And my question was: Has anybody already done that?
not to my knowledge.
it is possible to request the text with the api, why dont you try it.
...
...
What you describe is _already_ implemented :
when a page is created, its text is extracted
from the text layer of the corresponding djvu or pdf.
All you need to do is create djvu files with a proper text layer.
You are correct, it does indeed work, but only
after I action=purge the PDF file on Commons.
It never worked for me on the first try,
without any purge. And I was misled by an
earlier bug where action=purge didn't help,
so it took me a while before I tested this.
So why is the purge necessary? If OCR text
extraction ever fails, why is this not detected
and automatically retried?
purge is necessary only for files that were uploaded previously, when
text extraction was not performed. Note that text layer extraction for
pdf files is new.
...
When I try to create
http://sv.wikisource.org/wiki/Sida:Swedish_patent_14_Mj%C3%B6lqvarn.pdf/1
there is a character encoding error in the OCR text.
It looks as if the PDF contains 8-bit data, which
is loaded into the UTF-8 form without conversion.
Cut-and-paste from Acrobat Reader works fine.
yes there is a conversion problem with pdf; it works better with djvu.

Message: 2
Date: Tue, 13 Apr 2010 16:43:47 +0300
From: Cecil cecilatwp@gmail.com
Subject: Re: [Wikisource-l] PDF/Djvu to Index
To: "discussion list for Wikisource,    the free library"
       wikisource-l@lists.wikimedia.org
Message-ID:
       y2ucec8cb611004130643jfbc6bd1cq6b4f561c5c8e3929@mail.gmail.com
Content-Type: text/plain; charset="iso-8859-1"
2010/4/13 ThomasV thomasV1@gmx.de
...
Lars Aronsson a ?crit :
...
ThomasV wrote:
...
the problem is that djvu pages on common
do not have a parsable format.
Many of them are. For example,
http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf
...
contains:
{{Information
|Description=Swedish patent 14: Mj?lqvarn
|Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se]
|Date=January 23, 1885
|Author=R. Setz, J. Schweiter, Clus, Switzerland
|Permission=
|other_versions=
}}
This looks very formalized and parsable to me.
I filled it in when I uploaded the file to
Commons, and exactly the same fields need to be
filled in manually in the newly created Index page.
Maybe I should design a tool or bot that asks for
these fields once, and then uploads the file and
creates the Index page, based on that information.
And my question was: Has anybody already done that?
not to my knowledge.
it is possible to request the text with the api, why dont you try it.
Two problems.

I don't think all projects are using this template as it does not really

fit for books: At least German Wikisource has special templates for books,
single pages, djvu-Files, pdf-Files and so on. It uses mostly those for its
Commons-uploads and not the non-specific information-template as this
template does not have the parameters needed to describe book data (author,
publisher, place of publishing, year of first publishing, publishing
version, year of publishing of this version, ...). And AFAIK de.WS is not
the only project which uses specialized templates for its Commons-uploads.

I'm not sure about this but I think the index-file has the same fields in

all projects which use the extension. That would mean that the
Information-template does not contain the correct data for filling the
index-page. At least the index-file on de.WS has separate fields for author
and publisher and year of publishing and place of publishing and we usually
also link locally to the author and the title-page. So at least from your
example up there only one parameter (the source) is really useable (and at
least I usually linked to the Commons-file in the source-parameter on the
WS-indexPage). I'm not sure how much time the parse-request would need but
it does not look really worth the time (both programming it and later using
it) considering its useable return values.
IMO you could create a lot of index-files in the time you spend figuring out
if it is possible to extract, parse and interpret the data from Commons.
There are too many uploads which do not use any template, other templates or
while using this template are filled out in an unuseable way (as everybody
has a little bit a different style even when using templates), and even then
it lacks half the needed information while the rest still needs formatting.
The benefit is quite small compared to the amount of work it requires.
But hey, if you have spare time it would be still interesting to know if you
can get the data in a way that would not slow down users with not-so-fast
internet-connections.
Cecil