Make PDFs from articles - Wikitech-l

8 Feb 2005


      Hello folks,
I want to make PDF files from a number of wikipedia articles, but
encountered several technical problems and wondered if there's a better
way how to do that than I came about. (If you want to know what this is
for, look at the end of this posting.)
The number of articles should be about 1000. That has some consequences:
1. I have to avoid wasting bandwidth and wikipedia server load as
   much as possible.
2. I have to automate the task because I cannot click 1000 pages
   interactively.
To save server load and bandwidth, I considered using the database dump,
but that lacks the images and the layout, right? I even downloaded the
wikipedia CDROM image, but discovered that's a Windows software with
data stuffed into some database where it's probably difficult to
retrieve and make PDFs from.
My current idea is to use the normal web access since I have no other
working solution. I would spread the accesses over a week and only use
times where normal server load is low.
But: html2ps is regarded as a harvester by wikipedia.org and doesn't get
the real pages, even with User-Agent tweaked by a squid. When I use wget
-p and html2ps afterwards on the local files, it fetches the images
twice for some weird reason. And worse, html2ps is obviously not capable
of dealing with utf-8 encoded pages.
So, I have two questions to the community here:
1. What's the best way to get those 1000 articles from the servers
   without putting too much load on them?
2. What's a good way to automate the conversion to PDF of those pages?
Any suggestions appreciated.
(What is all this for? I work for a software company that will show a
demonstration of a document management system on the CeBIT fair in March
2005 in Hannover, Germany. I will hold some lectures there where I will
demonstrate a power plant documentation system. Of couse I cannot do
that with the actual real documents from our customers because those are
classified, so I intend to substitute them with wikipedia articles about
power plants, electricity and related topics. Thus, I only show free
content, and I can also use the opportunity to demonstrate wikipedia
quality to some business people who might not know it yet. I hope you
agree this is a perfectly valid intention of using wikipedia content.)
Regards
-- 
Maik Musall maik@musall.de
GPG public key 0x856861EB (keyserver: wwwkeys.de.pgp.net)