Re: [Wikisource-l] Wikisource-l Digest, Vol 43, Issue 1 - Wikisource-l

22 Apr 2010

A quick comment on Lars's last post:

I spoke at the Museums and the Web conference last Tuesday in Denver,
and someone from the Library of Congress came.  They aren't a Museum,
but they are working on a small-scale  wikipedia project to help
illustrate articles in WP with deep citations into some of their more
interesting public domain holdings (which are a bit museum-like).

She mentioned that she was most interested in pursuing a further
wikisource project, as they have many unique works (journals,
manuscripts and the like) which are only available in their original
language -- say, a journal of a French botanist -- and which deserve
to be translated into other languages.

They would like to help publish digital scans and existing
native-language text (cleaned up OCR) to wikisource, in the hopes that
translations can be made, into English and other languages.

They are interested in identifying the 'most interesting works' in
their untranslated collections, and happy to have some discussion
about what make works interesting.  One of the things I like about
this idea is that as curators of one of the world's largest
international libraries they have a broad sense of 'notability' and
interest in a wikisource sense which we currently lack on the
Project... so this could help drive style guide improvements as well.

Thoughts?  If anyone is specifically interested in this collaboration,
let me know and I'll put you in touch with the organizer.  Of course
once this is more than a pipe dream there will be a public project
page... but early interest now could help frame the initial proposal.

SJ

On Wed, Apr 21, 2010 at 8:56 AM,
&lt;wikisource-l-request(a)lists.wikimedia.org&gt; wrote:
> Send Wikisource-l mailing list submissions to
>        wikisource-l(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> or, via email, send a message with subject or body 'help' to
>        wikisource-l-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
>        wikisource-l-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wikisource-l digest..."
>
>
> Today's Topics:
>
>   1. Re: PDF/Djvu to Index (ThomasV)
>   2. Re: PDF/Djvu to Index (Cecil)
>   3. Strategic Planning Office Hours (Philippe Beaudette)
>   4. Experience, funding, outreach (Lars Aronsson)
>   5. Re: Experience, funding, outreach (Sydney Poore)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 13 Apr 2010 14:56:38 +0200
> From: ThomasV &lt;thomasV1(a)gmx.de&gt;
> Subject: Re: [Wikisource-l] PDF/Djvu to Index
> To: "discussion list for Wikisource,    the free library"
>        &lt;wikisource-l(a)lists.wikimedia.org&gt;
> Message-ID: &lt;4BC46A06.3070506(a)gmx.de&gt;
> Content-Type: text/plain; charset=ISO-8859-1
>
> Lars Aronsson a ?crit :
>> ThomasV wrote:
>>
>>> the problem is that djvu pages on common
>>> do not have a parsable format.
>>>
>>
>> Many of them are. For example,
>> http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf
>> contains:
>>
>> {{Information
>> |Description=Swedish patent 14: Mj?lqvarn
>> |Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se]
>> |Date=January 23, 1885
>> |Author=R. Setz, J. Schweiter, Clus, Switzerland
>> |Permission=
>> |other_versions=
>> }}
>>
>> This looks very formalized and parsable to me.
>> I filled it in when I uploaded the file to
>> Commons, and exactly the same fields need to be
>> filled in manually in the newly created Index page.
>>
>> Maybe I should design a tool or bot that asks for
>> these fields once, and then uploads the file and
>> creates the Index page, based on that information.
>> And my question was: Has anybody already done that?
>>
>>
> not to my knowledge.
> it is possible to request the text with the api, why dont you try it.
>
>>
>>> What you describe is _already_ implemented :
>>> when a page is created, its text is extracted
>>> from the text layer of the corresponding djvu or pdf.
>>> All you need to do is create djvu files with a proper text layer.
>>>
>>
>> You are correct, it does indeed work, but only
>> after I action=purge the PDF file on Commons.
>> It never worked for me on the first try,
>> without any purge. And I was misled by an
>> earlier bug where action=purge didn't help,
>> so it took me a while before I tested this.
>>
>> So why is the purge necessary? If OCR text
>> extraction ever fails, why is this not detected
>> and automatically retried?
>>
> purge is necessary only for files that were uploaded previously, when
> text extraction was not performed. Note that text layer extraction for
> pdf files is new.
>
>> When I try to create
>> http://sv.wikisource.org/wiki/Sida:Swedish_patent_14_Mj%C3%B6lqvarn.pdf/1
>> there is a character encoding error in the OCR text.
>> It looks as if the PDF contains 8-bit data, which
>> is loaded into the UTF-8 form without conversion.
>> Cut-and-paste from Acrobat Reader works fine.
>>
> yes there is a conversion problem with pdf; it works better with djvu.
>
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 13 Apr 2010 16:43:47 +0300
> From: Cecil &lt;cecilatwp(a)gmail.com&gt;
> Subject: Re: [Wikisource-l] PDF/Djvu to Index
> To: "discussion list for Wikisource,    the free library"
>        &lt;wikisource-l(a)lists.wikimedia.org&gt;
> Message-ID:
>        &lt;y2ucec8cb611004130643jfbc6bd1cq6b4f561c5c8e3929(a)mail.gmail.com&gt;
> Content-Type: text/plain; charset="iso-8859-1"
>
> 2010/4/13 ThomasV &lt;thomasV1(a)gmx.de&gt;
>
>> Lars Aronsson a ?crit :
>> > ThomasV wrote:
>> >
>> >> the problem is that djvu pages on common
>> >> do not have a parsable format.
>> >>
>> >
>> > Many of them are. For example,
>> >
>> http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf
>> > contains:
>> >
>> > {{Information
>> > |Description=Swedish patent 14: Mj?lqvarn
>> > |Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se]
>> > |Date=January 23, 1885
>> > |Author=R. Setz, J. Schweiter, Clus, Switzerland
>> > |Permission=
>> > |other_versions=
>> > }}
>> >
>> > This looks very formalized and parsable to me.
>> > I filled it in when I uploaded the file to
>> > Commons, and exactly the same fields need to be
>> > filled in manually in the newly created Index page.
>> >
>> > Maybe I should design a tool or bot that asks for
>> > these fields once, and then uploads the file and
>> > creates the Index page, based on that information.
>> > And my question was: Has anybody already done that?
>> >
>> >
>> not to my knowledge.
>> it is possible to request the text with the api, why dont you try it.
>>
>>
>
> Two problems.
>
> 1. I don't think all projects are using this template as it does not really
> fit for books: At least German Wikisource has special templates for books,
> single pages, djvu-Files, pdf-Files and so on. It uses mostly those for its
> Commons-uploads and not the non-specific information-template as this
> template does not have the parameters needed to describe book data (author,
> publisher, place of publishing, year of first publishing, publishing
> version, year of publishing of this version, ...). And AFAIK de.WS is not
> the only project which uses specialized templates for its Commons-uploads.
>
> 2. I'm not sure about this but I think the index-file has the same fields in
> all projects which use the extension. That would mean that the
> Information-template does not contain the correct data for filling the
> index-page. At least the index-file on de.WS has separate fields for author
> and publisher and year of publishing and place of publishing and we usually
> also link locally to the author and the title-page. So at least from your
> example up there only one parameter (the source) is really useable (and at
> least I usually linked to the Commons-file in the source-parameter on the
> WS-indexPage). I'm not sure how much time the parse-request would need but
> it does not look really worth the time (both programming it and later using
> it) considering its useable return values.
>
> IMO you could create a lot of index-files in the time you spend figuring out
> if it is possible to extract, parse and interpret the data from Commons.
> There are too many uploads which do not use any template, other templates or
> while using this template are filled out in an unuseable way (as everybody
> has a little bit a different style even when using templates), and even then
> it lacks half the needed information while the rest still needs formatting.
> The benefit is quite small compared to the amount of work it requires.
>
> But hey, if you have spare time it would be still interesting to know if you
> can get the data in a way that would not slow down users with not-so-fast
> internet-connections.
>
> Cecil
>