Re: [Wikisource-l] [pywikibot] pdf library

13 May 2016

Nemo, try to do an "autopsy" of cited IA pdf by pdfimages (xpdf) that
recovers raw pdf images into its pages. You'll find that pages are
exotically segmented into a full color background, a strange image, and an
inverted image of thresholded image (I presume, used as a mask). Just
negating the last one, you can get a decent, light BW image of the page. I
could build from the last one a decent BW djvu image:
https://it.wikisource.org/wiki/File:Paolina.djvu , but it.source users
didn't like the idea
https://it.wikisource.org/wiki/Wikisource:Bar#Pensiero_in_libert.C3.A0_sull…

I presume that this complex structure is somewhat similar of djvu
background/foreground segmentation into djvu files, and artifacts are
similar.

So, pdf images are not only "compressed", but deeply processed and
segmented images.

Anyway: IA image viewer doesn't use at all pdf (nor djvu) but uses jpg from
jp2 files; so, if you need a djvu similar, for details, to what you see
into the IA viewer, you have to download and process jp2 images to build a
decent djvu file.

Is something of this complex IA image processing path documented anywhere?
I got my conclusions simply by "try and learn" from IA  file
"necropsy".

Alex

2016-05-12 20:10 GMT+02:00 Federico Leva (Nemo) &lt;nemowiki(a)gmail.com&gt;om>:

...
  Andrea Zanni, 12/05/2016 19:38:

  [1]
https://it.wikisource.org/wiki/File:Tarchetti_pdf.png
 [2]

https://commons.wikimedia.org/w/index.php?title=File%3ATarchetti_-_Paolina.…
 [3] https://it.wikisource.org/wiki/File:Tarchetti_pdf.png

 That was meant to be
 https://it.wikisource.org/wiki/File:Tarchetti_alex_djvu.png

 I don't think this has anything to do with DjVu or PDF, the problem is
 very clear just by looking at
 https://archive.org/download/digitami_LO10534041 : the JP2 conversion
 compressed the images 30 times, the PDF compression 5 more times.

 The first step in such cases, as documented in
 https://en.wikisource.org/wiki/Help:DjVu_files#The_Internet_Archive , is
 to add/increase the fixed-ppi field. I don't understand what was used in
 https://catalogd.archive.org/log/487271468

 Nemo

 _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] [pywikibot] pdf library