Hello folks,
I want to make PDF files from a number of wikipedia articles, but encountered several technical problems and wondered if there's a better way how to do that than I came about. (If you want to know what this is for, look at the end of this posting.)
The number of articles should be about 1000. That has some consequences: 1. I have to avoid wasting bandwidth and wikipedia server load as much as possible. 2. I have to automate the task because I cannot click 1000 pages interactively.
To save server load and bandwidth, I considered using the database dump, but that lacks the images and the layout, right? I even downloaded the wikipedia CDROM image, but discovered that's a Windows software with data stuffed into some database where it's probably difficult to retrieve and make PDFs from.
My current idea is to use the normal web access since I have no other working solution. I would spread the accesses over a week and only use times where normal server load is low.
But: html2ps is regarded as a harvester by wikipedia.org and doesn't get the real pages, even with User-Agent tweaked by a squid. When I use wget -p and html2ps afterwards on the local files, it fetches the images twice for some weird reason. And worse, html2ps is obviously not capable of dealing with utf-8 encoded pages.
So, I have two questions to the community here: 1. What's the best way to get those 1000 articles from the servers without putting too much load on them? 2. What's a good way to automate the conversion to PDF of those pages?
Any suggestions appreciated.
(What is all this for? I work for a software company that will show a demonstration of a document management system on the CeBIT fair in March 2005 in Hannover, Germany. I will hold some lectures there where I will demonstrate a power plant documentation system. Of couse I cannot do that with the actual real documents from our customers because those are classified, so I intend to substitute them with wikipedia articles about power plants, electricity and related topics. Thus, I only show free content, and I can also use the opportunity to demonstrate wikipedia quality to some business people who might not know it yet. I hope you agree this is a perfectly valid intention of using wikipedia content.)
Regards