PDF/Djvu to Index

List overview All Threads
Download

newer

older

Strategic Planning Office Hours

Re: [Wikisource-l] Are the Page...

Lars Aronsson

12 Apr 2010 12 Apr '10

1:37 a.m.

It is increasingly common to add books to Wikisource by finding a PDF or Djvu file, uploading it to Commons, and then to create an Index: page on Wikisource for proofreading.

But this would be much easier if:

1) The fields (author, title, etc.) of the Index page were filled in from the data already given on Commons. (Yes, those could be wrong or need additional care, but this could always be edited afterwards, if initial values are fetched from Commons.)

2) The <pagelist/> tag was already in the "pages" box.

3) All pages were created in automatically with the OCR text from Commons, instead of leaving a long list of red links. (This would require the text for each page to be extracted, something that pdftotext can do in seconds, but Commons takes weeks to do.)

Could this be automated? Is there already some tool or bot that does this?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Show replies by date

ThomasV

13 Apr 13 Apr

5:13 a.m.

Lars Aronsson a écrit :

...

It is increasingly common to add books to Wikisource by finding a PDF or Djvu file, uploading it to Commons, and then to create an Index: page on Wikisource for proofreading.

But this would be much easier if:

The fields (author, title, etc.) of the Index

page were filled in from the data already given on Commons. (Yes, those could be wrong or need additional care, but this could always be edited afterwards, if initial values are fetched from Commons.)

the problem is that djvu pages on common do not have a parsable format.

...

The <pagelist/> tag was already in the

"pages" box.

that's easy. I did it for sites using http://wikisource.org/wiki/MediaWiki:IndexForm.js

...

All pages were created in automatically

with the OCR text from Commons, instead of leaving a long list of red links. (This would require the text for each page to be extracted, something that pdftotext can do in seconds, but Commons takes weeks to do.)

I do not understand what you mean. What you describe is _already_ implemented : when a page is created, its text is extracted from the text layer of the corresponding djvu or pdf. All you need to do is create djvu files with a proper text layer.

Lars Aronsson

8:30 a.m.

ThomasV wrote:

...

the problem is that djvu pages on common do not have a parsable format.

Many of them are. For example, http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf contains:

{{Information |Description=Swedish patent 14: Mjölqvarn |Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se] |Date=January 23, 1885 |Author=R. Setz, J. Schweiter, Clus, Switzerland |Permission= |other_versions= }}

This looks very formalized and parsable to me. I filled it in when I uploaded the file to Commons, and exactly the same fields need to be filled in manually in the newly created Index page.

Maybe I should design a tool or bot that asks for these fields once, and then uploads the file and creates the Index page, based on that information. And my question was: Has anybody already done that?

...

What you describe is _already_ implemented : when a page is created, its text is extracted from the text layer of the corresponding djvu or pdf. All you need to do is create djvu files with a proper text layer.

You are correct, it does indeed work, but only after I action=purge the PDF file on Commons. It never worked for me on the first try, without any purge. And I was misled by an earlier bug where action=purge didn't help, so it took me a while before I tested this.

So why is the purge necessary? If OCR text extraction ever fails, why is this not detected and automatically retried?

When I try to create http://sv.wikisource.org/wiki/Sida:Swedish_patent_14_Mj%C3%B6lqvarn.pdf/1 there is a character encoding error in the OCR text. It looks as if the PDF contains 8-bit data, which is loaded into the UTF-8 form without conversion. Cut-and-paste from Acrobat Reader works fine.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

ThomasV

8:56 a.m.

Lars Aronsson a écrit :

...

ThomasV wrote:

...
the problem is that djvu pages on common do not have a parsable format.

Many of them are. For example, http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf contains:

{{Information |Description=Swedish patent 14: Mjölqvarn |Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se] |Date=January 23, 1885 |Author=R. Setz, J. Schweiter, Clus, Switzerland |Permission= |other_versions= }}

This looks very formalized and parsable to me. I filled it in when I uploaded the file to Commons, and exactly the same fields need to be filled in manually in the newly created Index page.

Maybe I should design a tool or bot that asks for these fields once, and then uploads the file and creates the Index page, based on that information. And my question was: Has anybody already done that?

not to my knowledge. it is possible to request the text with the api, why dont you try it.

...

...
What you describe is _already_ implemented : when a page is created, its text is extracted from the text layer of the corresponding djvu or pdf. All you need to do is create djvu files with a proper text layer.

You are correct, it does indeed work, but only after I action=purge the PDF file on Commons. It never worked for me on the first try, without any purge. And I was misled by an earlier bug where action=purge didn't help, so it took me a while before I tested this.

So why is the purge necessary? If OCR text extraction ever fails, why is this not detected and automatically retried?

purge is necessary only for files that were uploaded previously, when text extraction was not performed. Note that text layer extraction for pdf files is new.

...

When I try to create http://sv.wikisource.org/wiki/Sida:Swedish_patent_14_Mj%C3%B6lqvarn.pdf/1 there is a character encoding error in the OCR text. It looks as if the PDF contains 8-bit data, which is loaded into the UTF-8 form without conversion. Cut-and-paste from Acrobat Reader works fine.

yes there is a conversion problem with pdf; it works better with djvu.

Cecil

9:43 a.m.

2010/4/13 ThomasV thomasV1@gmx.de

...

Lars Aronsson a écrit :

...
ThomasV wrote:

...
the problem is that djvu pages on common do not have a parsable format.

Many of them are. For example,

http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf

...
contains:

{{Information |Description=Swedish patent 14: Mjölqvarn |Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se] |Date=January 23, 1885 |Author=R. Setz, J. Schweiter, Clus, Switzerland |Permission= |other_versions= }}

This looks very formalized and parsable to me. I filled it in when I uploaded the file to Commons, and exactly the same fields need to be filled in manually in the newly created Index page.

Maybe I should design a tool or bot that asks for these fields once, and then uploads the file and creates the Index page, based on that information. And my question was: Has anybody already done that?

not to my knowledge. it is possible to request the text with the api, why dont you try it.

Two problems.

1. I don't think all projects are using this template as it does not really fit for books: At least German Wikisource has special templates for books, single pages, djvu-Files, pdf-Files and so on. It uses mostly those for its Commons-uploads and not the non-specific information-template as this template does not have the parameters needed to describe book data (author, publisher, place of publishing, year of first publishing, publishing version, year of publishing of this version, ...). And AFAIK de.WS is not the only project which uses specialized templates for its Commons-uploads.

2. I'm not sure about this but I think the index-file has the same fields in all projects which use the extension. That would mean that the Information-template does not contain the correct data for filling the index-page. At least the index-file on de.WS has separate fields for author and publisher and year of publishing and place of publishing and we usually also link locally to the author and the title-page. So at least from your example up there only one parameter (the source) is really useable (and at least I usually linked to the Commons-file in the source-parameter on the WS-indexPage). I'm not sure how much time the parse-request would need but it does not look really worth the time (both programming it and later using it) considering its useable return values.

IMO you could create a lot of index-files in the time you spend figuring out if it is possible to extract, parse and interpret the data from Commons. There are too many uploads which do not use any template, other templates or while using this template are filled out in an unuseable way (as everybody has a little bit a different style even when using templates), and even then it lacks half the needed information while the rest still needs formatting. The benefit is quite small compared to the amount of work it requires.

But hey, if you have spare time it would be still interesting to know if you can get the data in a way that would not slow down users with not-so-fast internet-connections.

Cecil

5378

Age (days ago)

5379

Last active (days ago)

wikisource-l@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Cecil
Lars Aronsson
ThomasV