A quick comment on Lars's last post:
I spoke at the Museums and the Web conference last Tuesday in Denver, and someone from the Library of Congress came. They aren't a Museum, but they are working on a small-scale wikipedia project to help illustrate articles in WP with deep citations into some of their more interesting public domain holdings (which are a bit museum-like).
She mentioned that she was most interested in pursuing a further wikisource project, as they have many unique works (journals, manuscripts and the like) which are only available in their original language -- say, a journal of a French botanist -- and which deserve to be translated into other languages.
They would like to help publish digital scans and existing native-language text (cleaned up OCR) to wikisource, in the hopes that translations can be made, into English and other languages.
They are interested in identifying the 'most interesting works' in their untranslated collections, and happy to have some discussion about what make works interesting. One of the things I like about this idea is that as curators of one of the world's largest international libraries they have a broad sense of 'notability' and interest in a wikisource sense which we currently lack on the Project... so this could help drive style guide improvements as well.
Thoughts? If anyone is specifically interested in this collaboration, let me know and I'll put you in touch with the organizer. Of course once this is more than a pipe dream there will be a public project page... but early interest now could help frame the initial proposal.
SJ
On Wed, Apr 21, 2010 at 8:56 AM, wikisource-l-request@lists.wikimedia.org wrote:
Send Wikisource-l mailing list submissions to wikisource-l@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/wikisource-l or, via email, send a message with subject or body 'help' to wikisource-l-request@lists.wikimedia.org
You can reach the person managing the list at wikisource-l-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Wikisource-l digest..."
Today's Topics:
1. Re: PDF/Djvu to Index (ThomasV) 2. Re: PDF/Djvu to Index (Cecil) 3. Strategic Planning Office Hours (Philippe Beaudette) 4. Experience, funding, outreach (Lars Aronsson) 5. Re: Experience, funding, outreach (Sydney Poore)
Message: 1 Date: Tue, 13 Apr 2010 14:56:38 +0200 From: ThomasV thomasV1@gmx.de Subject: Re: [Wikisource-l] PDF/Djvu to Index To: "discussion list for Wikisource, the free library" wikisource-l@lists.wikimedia.org Message-ID: 4BC46A06.3070506@gmx.de Content-Type: text/plain; charset=ISO-8859-1
Lars Aronsson a ?crit :
ThomasV wrote:
the problem is that djvu pages on common do not have a parsable format.
Many of them are. For example, http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf contains:
{{Information |Description=Swedish patent 14: Mj?lqvarn |Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se] |Date=January 23, 1885 |Author=R. Setz, J. Schweiter, Clus, Switzerland |Permission= |other_versions= }}
This looks very formalized and parsable to me. I filled it in when I uploaded the file to Commons, and exactly the same fields need to be filled in manually in the newly created Index page.
Maybe I should design a tool or bot that asks for these fields once, and then uploads the file and creates the Index page, based on that information. And my question was: Has anybody already done that?
not to my knowledge. it is possible to request the text with the api, why dont you try it.
What you describe is _already_ implemented : when a page is created, its text is extracted from the text layer of the corresponding djvu or pdf. All you need to do is create djvu files with a proper text layer.
You are correct, it does indeed work, but only after I action=purge the PDF file on Commons. It never worked for me on the first try, without any purge. And I was misled by an earlier bug where action=purge didn't help, so it took me a while before I tested this.
So why is the purge necessary? If OCR text extraction ever fails, why is this not detected and automatically retried?
purge is necessary only for files that were uploaded previously, when text extraction was not performed. Note that text layer extraction for pdf files is new.
When I try to create http://sv.wikisource.org/wiki/Sida:Swedish_patent_14_Mj%C3%B6lqvarn.pdf/1 there is a character encoding error in the OCR text. It looks as if the PDF contains 8-bit data, which is loaded into the UTF-8 form without conversion. Cut-and-paste from Acrobat Reader works fine.
yes there is a conversion problem with pdf; it works better with djvu.
Message: 2 Date: Tue, 13 Apr 2010 16:43:47 +0300 From: Cecil cecilatwp@gmail.com Subject: Re: [Wikisource-l] PDF/Djvu to Index To: "discussion list for Wikisource, the free library" wikisource-l@lists.wikimedia.org Message-ID: y2ucec8cb611004130643jfbc6bd1cq6b4f561c5c8e3929@mail.gmail.com Content-Type: text/plain; charset="iso-8859-1"
2010/4/13 ThomasV thomasV1@gmx.de
Lars Aronsson a ?crit :
ThomasV wrote:
the problem is that djvu pages on common do not have a parsable format.
Many of them are. For example,
http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf
contains:
{{Information |Description=Swedish patent 14: Mj?lqvarn |Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se] |Date=January 23, 1885 |Author=R. Setz, J. Schweiter, Clus, Switzerland |Permission= |other_versions= }}
This looks very formalized and parsable to me. I filled it in when I uploaded the file to Commons, and exactly the same fields need to be filled in manually in the newly created Index page.
Maybe I should design a tool or bot that asks for these fields once, and then uploads the file and creates the Index page, based on that information. And my question was: Has anybody already done that?
not to my knowledge. it is possible to request the text with the api, why dont you try it.
Two problems.
- I don't think all projects are using this template as it does not really
fit for books: At least German Wikisource has special templates for books, single pages, djvu-Files, pdf-Files and so on. It uses mostly those for its Commons-uploads and not the non-specific information-template as this template does not have the parameters needed to describe book data (author, publisher, place of publishing, year of first publishing, publishing version, year of publishing of this version, ...). And AFAIK de.WS is not the only project which uses specialized templates for its Commons-uploads.
- I'm not sure about this but I think the index-file has the same fields in
all projects which use the extension. That would mean that the Information-template does not contain the correct data for filling the index-page. At least the index-file on de.WS has separate fields for author and publisher and year of publishing and place of publishing and we usually also link locally to the author and the title-page. So at least from your example up there only one parameter (the source) is really useable (and at least I usually linked to the Commons-file in the source-parameter on the WS-indexPage). I'm not sure how much time the parse-request would need but it does not look really worth the time (both programming it and later using it) considering its useable return values.
IMO you could create a lot of index-files in the time you spend figuring out if it is possible to extract, parse and interpret the data from Commons. There are too many uploads which do not use any template, other templates or while using this template are filled out in an unuseable way (as everybody has a little bit a different style even when using templates), and even then it lacks half the needed information while the rest still needs formatting. The benefit is quite small compared to the amount of work it requires.
But hey, if you have spare time it would be still interesting to know if you can get the data in a way that would not slow down users with not-so-fast internet-connections.
Cecil
wikisource-l@lists.wikimedia.org