My way: pdf2djvu converts perfectly IA pdfs into excellent djvu files, so I download pdf, convert it into djvu and upload it into Commons. Probably very soon IA Upload bot will upload djvu files from IA as previously, simply grabbing the pdf and converting it into djvu before uploading them into Commons.

pdftotext (one of xpdf routines) has some problems coming from different text layer of pdf files; ie, I found that in many pdf files soft hyphens are removed. This is a big problem while proofreading.

IMHO it would be a pity to emulate IA policy and to shift from djvu to pdf - djvu files are simply "open & free pdf", and wiki loves freedom.

Alex

2016-04-14 20:29 GMT+02:00 Mpaa <mpaa.wiki@gmail.com>:

@Alex
since IA is not using djvu any longer, on en.wikisource there is demand of a script similar to djvutxt.py for pdf
(or it could be a single one for both formats ...)

On Thu, Apr 14, 2016 at 7:35 PM, John Mark Vandenberg <jayvdb@gmail.com> wrote:

On 14 Apr 2016 02:18, "Mpaa" <mpaa.wiki@gmail.com> wrote:
>
> Hi.
>
> Is there any preference for a python pdf library, in case one would like to add pdf file processing to pywikibot?

I have no preference, or experience.
Pypdf2 seems to be the most commonly used.
There are a few worrying Python 3 encoding bugs.

> Or is it good enough, if possible, to rely on pdfinfo (which I guess is linux-only)?

For a script, using pdfinfo is definately good enough IMO. reflinks uses pdfinfo.

There are precompiled binaries available for windows at
http://www.foolabs.com/xpdf/download.html

--
John

_______________________________________________
pywikibot mailing list
pywikibot@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikibot

_______________________________________________
pywikibot mailing list
pywikibot@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikibot