Re: [Wikisource-l] [pywikibot] pdf library - Wikisource-l - lists.wikimedia.org

List overview All Threads
Download

Re: [Wikisource-l] [pywikibot] pdf library

Wikisource Meetup in Wikimania 2016

Re: [Wikisource-l] Wikisource...

John Mark Vandenberg

15 Apr 2016 15 Apr '16

12:19 a.m.

On Fri, Apr 15, 2016 at 1:29 AM, Mpaa <mpaa.wiki(a)gmail.com> wrote:

@Alex since IA is not using djvu any longer

Oh? Where I can read more about this policy change by IA? -- John Vandenberg

Reply

Show replies by date

Federico Leva (Nemo)

15 Apr 15 Apr

4:24 a.m.

New subject: [pywikibot] pdf library

John Mark Vandenberg, 15/04/2016 02:19:

Oh? Where I can read more about this policy change by IA?

https://lists.wikimedia.org/pipermail/wikisource-l/2016-March/002735.html Nemo

Reply

Andrea Zanni

7:03 a.m.

New subject: [pywikibot] pdf library

I actually missed this. This will probably give us some issues: for the last 2-3 of years, every time I spoke with librarians i told them about the "pipeline of open digitization" including both IA and Wikisource. They would upload their folder full of JPEGs on IA, get a djvu, put the URL in the Tpt tool and upload it on Wikisource, then work it with the community. Then they could get the brand new EPUB, with another Tpt tool. I remember Alex Brollo was working with the djvu_xml layer, but i don't know if the tool we used for uploading books from IA to WS will still work using that. We kinda need to fix it... Aubrey On Fri, Apr 15, 2016 at 6:24 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote:

John Mark Vandenberg, 15/04/2016 02:19:

Oh? Where I can read more about this policy change by IA?

https://lists.wikimedia.org/pipermail/wikisource-l/2016-March/002735.html Nemo _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply

Federico Leva (Nemo)

7:41 a.m.

New subject: [pywikibot] pdf library

Andrea Zanni, 15/04/2016 09:03:

I remember Alex Brollo was working with the djvu_xml layer

The XML output from ABBYY is still being published, AFAIK. Nemo

Reply

Andrea Zanni

8:01 a.m.

New subject: [pywikibot] pdf library

Yes, this is why I cited it: if we can manage to use it for Wikisource importing, we could be safe :-) Aubrey On Fri, Apr 15, 2016 at 9:41 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote:

Andrea Zanni, 15/04/2016 09:03:

I remember Alex Brollo was working with the djvu_xml layer

The XML output from ABBYY is still being published, AFAIK. Nemo _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply

Alex Brollo

6:29 p.m.

New subject: [pywikibot] pdf library

Again, just to explain: pdftodjvu output of a IA pdf is a perfect djvu, with its regular OCR mapped layer, so nothing changes but the need of running a very simple command: pdf2djvu namefile.pdf -o namefile.djvu Alex 2016-04-15 10:01 GMT+02:00 Andrea Zanni <zanni.andrea84(a)gmail.com>om>:

Yes, this is why I cited it: if we can manage to use it for Wikisource importing, we could be safe :-) Aubrey On Fri, Apr 15, 2016 at 9:41 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote:

Andrea Zanni, 15/04/2016 09:03:

I remember Alex Brollo was working with the djvu_xml layer

The XML output from ABBYY is still being published, AFAIK. Nemo _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply

Andrea Zanni

18 Apr 18 Apr

8:51 a.m.

New subject: [pywikibot] pdf library

I think that the crucial issue here is: will the ia-upload tool run? https://tools.wmflabs.org/ia-upload/commons/init Aubrey On Fri, Apr 15, 2016 at 8:29 PM, Alex Brollo <alex.brollo(a)gmail.com> wrote:

Again, just to explain: pdftodjvu output of a IA pdf is a perfect djvu, with its regular OCR mapped layer, so nothing changes but the need of running a very simple command: pdf2djvu namefile.pdf -o namefile.djvu Alex 2016-04-15 10:01 GMT+02:00 Andrea Zanni <zanni.andrea84(a)gmail.com>om>:

Yes, this is why I cited it: if we can manage to use it for Wikisource importing, we could be safe :-) Aubrey On Fri, Apr 15, 2016 at 9:41 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com

wrote:

Andrea Zanni, 15/04/2016 09:03:

I remember Alex Brollo was working with the djvu_xml layer

The XML output from ABBYY is still being published, AFAIK. Nemo _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply

Alex Brollo

1:12 p.m.

New subject: [pywikibot] pdf library

Can someone "ping" Phe & Tpt into this talk? Alex 2016-04-18 10:51 GMT+02:00 Andrea Zanni <zanni.andrea84(a)gmail.com>om>:

I think that the crucial issue here is: will the ia-upload tool run? https://tools.wmflabs.org/ia-upload/commons/init Aubrey On Fri, Apr 15, 2016 at 8:29 PM, Alex Brollo <alex.brollo(a)gmail.com> wrote:

Again, just to explain: pdftodjvu output of a IA pdf is a perfect djvu, with its regular OCR mapped layer, so nothing changes but the need of running a very simple command: pdf2djvu namefile.pdf -o namefile.djvu Alex 2016-04-15 10:01 GMT+02:00 Andrea Zanni <zanni.andrea84(a)gmail.com>om>:

Yes, this is why I cited it: if we can manage to use it for Wikisource importing, we could be safe :-) Aubrey On Fri, Apr 15, 2016 at 9:41 AM, Federico Leva (Nemo) < nemowiki(a)gmail.com> wrote:

Andrea Zanni, 15/04/2016 09:03: > I remember Alex Brollo was working with the djvu_xml layer > The XML output from ABBYY is still being published, AFAIK. Nemo _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply

Andrea Zanni

12 May 12 May

5:38 p.m.

New subject: [pywikibot] pdf library

Hi everyone, please let me revive this thread. There is an ongoing discussion on it.source about the new Internet Archive policy, because this is becoming a *quality problem* for the community. You can see for yourself, here: this is a detail[1] from a pdf[2] taken from Archive this is the detail[3] from a djvu (handmade by the user Alex) Please look at the pictures to understand the problem :-) The compression of the IA pdf is unfortunately too high, and also the OCR is not that good. We can't probably ask IA to change its mind and redo djvus, there are other more technical ways. But I'd like this to be a problem to be solved together, maybe directly into the magnificent "IA Upload" tool. Wikisource prides itself with quality, so it's right to demand good scans. What I fear is that bigger communities will have expert users that will make their own djvus, and smaller ones that will have to keep IA uploaded PDFs... Do you have any solutions? Is your community worried about this? Thanks Aubrey [1] https://it.wikisource.org/wiki/File:Tarchetti_pdf.png [2] https://commons.wikimedia.org/w/index.php?title=File%3ATarchetti_-_Paolina.… [3] https://it.wikisource.org/wiki/File:Tarchetti_pdf.png On Mon, Apr 18, 2016 at 3:12 PM, Alex Brollo <alex.brollo(a)gmail.com> wrote:

Can someone "ping" Phe & Tpt into this talk? Alex 2016-04-18 10:51 GMT+02:00 Andrea Zanni <zanni.andrea84(a)gmail.com>om>:

I think that the crucial issue here is: will the ia-upload tool run? https://tools.wmflabs.org/ia-upload/commons/init Aubrey On Fri, Apr 15, 2016 at 8:29 PM, Alex Brollo <alex.brollo(a)gmail.com> wrote:

Again, just to explain: pdftodjvu output of a IA pdf is a perfect djvu, with its regular OCR mapped layer, so nothing changes but the need of running a very simple command: pdf2djvu namefile.pdf -o namefile.djvu Alex 2016-04-15 10:01 GMT+02:00 Andrea Zanni <zanni.andrea84(a)gmail.com>om>:

Yes, this is why I cited it: if we can manage to use it for Wikisource importing, we could be safe :-) Aubrey On Fri, Apr 15, 2016 at 9:41 AM, Federico Leva (Nemo) < nemowiki(a)gmail.com> wrote: > Andrea Zanni, 15/04/2016 09:03: > >> I remember Alex Brollo was working with the djvu_xml layer >> > > The XML output from ABBYY is still being published, AFAIK. > > > Nemo > > _______________________________________________ > Wikisource-l mailing list > Wikisource-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikisource-l > _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply

Federico Leva (Nemo)

6:10 p.m.

New subject: [pywikibot] pdf library

Andrea Zanni, 12/05/2016 19:38:

[1] https://it.wikisource.org/wiki/File:Tarchetti_pdf.png [2] https://commons.wikimedia.org/w/index.php?title=File%3ATarchetti_-_Paolina.… [3] https://it.wikisource.org/wiki/File:Tarchetti_pdf.png

That was meant to be https://it.wikisource.org/wiki/File:Tarchetti_alex_djvu.png I don't think this has anything to do with DjVu or PDF, the problem is very clear just by looking at https://archive.org/download/digitami_LO10534041 : the JP2 conversion compressed the images 30 times, the PDF compression 5 more times. The first step in such cases, as documented in https://en.wikisource.org/wiki/Help:DjVu_files#The_Internet_Archive , is to add/increase the fixed-ppi field. I don't understand what was used in https://catalogd.archive.org/log/487271468 Nemo

Reply

Alex Brollo

13 May 13 May

7:02 a.m.

New subject: [pywikibot] pdf library

Nemo, try to do an "autopsy" of cited IA pdf by pdfimages (xpdf) that recovers raw pdf images into its pages. You'll find that pages are exotically segmented into a full color background, a strange image, and an inverted image of thresholded image (I presume, used as a mask). Just negating the last one, you can get a decent, light BW image of the page. I could build from the last one a decent BW djvu image: https://it.wikisource.org/wiki/File:Paolina.djvu , but it.source users didn't like the idea https://it.wikisource.org/wiki/Wikisource:Bar#Pensiero_in_libert.C3.A0_sull… I presume that this complex structure is somewhat similar of djvu background/foreground segmentation into djvu files, and artifacts are similar. So, pdf images are not only "compressed", but deeply processed and segmented images. Anyway: IA image viewer doesn't use at all pdf (nor djvu) but uses jpg from jp2 files; so, if you need a djvu similar, for details, to what you see into the IA viewer, you have to download and process jp2 images to build a decent djvu file. Is something of this complex IA image processing path documented anywhere? I got my conclusions simply by "try and learn" from IA file "necropsy". Alex 2016-05-12 20:10 GMT+02:00 Federico Leva (Nemo) <nemowiki(a)gmail.com>om>:

Andrea Zanni, 12/05/2016 19:38:

[1] https://it.wikisource.org/wiki/File:Tarchetti_pdf.png [2] https://commons.wikimedia.org/w/index.php?title=File%3ATarchetti_-_Paolina.… [3] https://it.wikisource.org/wiki/File:Tarchetti_pdf.png

That was meant to be https://it.wikisource.org/wiki/File:Tarchetti_alex_djvu.png I don't think this has anything to do with DjVu or PDF, the problem is very clear just by looking at https://archive.org/download/digitami_LO10534041 : the JP2 conversion compressed the images 30 times, the PDF compression 5 more times. The first step in such cases, as documented in https://en.wikisource.org/wiki/Help:DjVu_files#The_Internet_Archive , is to add/increase the fixed-ppi field. I don't understand what was used in https://catalogd.archive.org/log/487271468 Nemo _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply

Federico Leva (Nemo)

8:06 a.m.

New subject: [pywikibot] pdf library

Alex Brollo, 13/05/2016 09:02:

I presume that this complex structure is somewhat similar of djvu background/foreground segmentation into djvu files, and artifacts are similar.

Sure.

So, pdf images are not only "compressed", but deeply processed and segmented images.

...which is what I call "compression". I still recommend to try and increase the fixed-ppi parameter in such a case of excessive compression. I also still need an answer to https://it.wikisource.org/?diff=1733473

Is something of this complex IA image processing path documented anywhere?

What do you mean? Are you asking about details of their derivation plan for books? What we know has been summarised over time at https://en.wikisource.org/wiki/Help:DjVu_files#The_Internet_Archive , as always. As the help page IIRC states, the best way to understand what's going on is to check the item history and read the derive.php log, like https://catalogd.archive.org/log/487271468 which I linked. The main difference compared to the past is, I think, that they're no longer creating the luratech b/w PDF, probably because the "normal" PDF now manages to compress enough. They may have not realised that the single PDF they now produce is too compressed for illustrations and for cases where the original JP2 is too small. Nemo

Reply

Alex Brollo

9:06 a.m.

New subject: [pywikibot] pdf library

Simply, from a practital point iof view, my suggestion is: don't try to get a good djvu from IA pdf, use instead _jp2.zip images (after conversion to jpg the images are very good), and the result will be much better - almost as good as images into IA viewer, that uses the same images. Alex 2016-05-13 10:06 GMT+02:00 Federico Leva (Nemo) <nemowiki(a)gmail.com>om>:

Alex Brollo, 13/05/2016 09:02:

I presume that this complex structure is somewhat similar of djvu background/foreground segmentation into djvu files, and artifacts are similar.

Sure.

So, pdf images are not only "compressed", but deeply processed and segmented images.

...which is what I call "compression". I still recommend to try and increase the fixed-ppi parameter in such a case of excessive compression. I also still need an answer to https://it.wikisource.org/?diff=1733473 Is something of this complex IA image processing path documented

anywhere?

What do you mean? Are you asking about details of their derivation plan for books? What we know has been summarised over time at https://en.wikisource.org/wiki/Help:DjVu_files#The_Internet_Archive , as always. As the help page IIRC states, the best way to understand what's going on is to check the item history and read the derive.php log, like https://catalogd.archive.org/log/487271468 which I linked. The main difference compared to the past is, I think, that they're no longer creating the luratech b/w PDF, probably because the "normal" PDF now manages to compress enough. They may have not realised that the single PDF they now produce is too compressed for illustrations and for cases where the original JP2 is too small. Nemo _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply

Federico Leva (Nemo)

9:20 a.m.

New subject: [pywikibot] pdf library

Alex Brollo, 13/05/2016 11:06:

Simply, from a practital point iof view, my suggestion is: don't try to get a good djvu from IA pdf, use instead _jp2.zip images (after conversion to jpg the images are very good), and the result will be much better - almost as good as images into IA viewer, that uses the same images.

In my experience, when there are problems, usually the JP2 images are either too little compressed or too compressed. This has precise reasons and no trivial solution: http://www.digitizationguidelines.gov/still-image/documents/JP2LossyCompres… Nemo

Reply

Alex Brollo

9:59 a.m.

New subject: [pywikibot] pdf library

You can be right - my tests presently have been done on one book only. As soon as a python tool to get djvu from _jp2 will run with no human effort, I'll try it on lots of books to get some "general rule". But - can you confirm that IA viewer shows jpg images coming from jp2-jpg folder? Another problem, when using original IA pdf (again, I tested it on one book only: see https://it.wikisource.org/wiki/Indice:Tarchetti_-_Paolina.pdf ) is, that OCR text retrieved by mediawiki software is horrible in structure, please try to create any page of that Index. With pdftotext (xpdf) too, results are far from good. Alex Alex 2016-05-13 11:20 GMT+02:00 Federico Leva (Nemo) <nemowiki(a)gmail.com>om>:

Alex Brollo, 13/05/2016 11:06:

Simply, from a practital point iof view, my suggestion is: don't try to get a good djvu from IA pdf, use instead _jp2.zip images (after conversion to jpg the images are very good), and the result will be much better - almost as good as images into IA viewer, that uses the same images.

In my experience, when there are problems, usually the JP2 images are either too little compressed or too compressed. This has precise reasons and no trivial solution: http://www.digitizationguidelines.gov/still-image/documents/JP2LossyCompres… Nemo _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply

2917

days inactive

2945

days old

wikisource-l@lists.wikimedia.org

Manage subscription

14 comments

4 participants

tags (0)

participants (4)

Alex Brollo
Andrea Zanni
Federico Leva (Nemo)
John Mark Vandenberg