Hello folks,
I want to make PDF files from a number of wikipedia articles, but encountered several technical problems and wondered if there's a better way how to do that than I came about. (If you want to know what this is for, look at the end of this posting.)
The number of articles should be about 1000. That has some consequences: 1. I have to avoid wasting bandwidth and wikipedia server load as much as possible. 2. I have to automate the task because I cannot click 1000 pages interactively.
To save server load and bandwidth, I considered using the database dump, but that lacks the images and the layout, right? I even downloaded the wikipedia CDROM image, but discovered that's a Windows software with data stuffed into some database where it's probably difficult to retrieve and make PDFs from.
My current idea is to use the normal web access since I have no other working solution. I would spread the accesses over a week and only use times where normal server load is low.
But: html2ps is regarded as a harvester by wikipedia.org and doesn't get the real pages, even with User-Agent tweaked by a squid. When I use wget -p and html2ps afterwards on the local files, it fetches the images twice for some weird reason. And worse, html2ps is obviously not capable of dealing with utf-8 encoded pages.
So, I have two questions to the community here: 1. What's the best way to get those 1000 articles from the servers without putting too much load on them? 2. What's a good way to automate the conversion to PDF of those pages?
Any suggestions appreciated.
(What is all this for? I work for a software company that will show a demonstration of a document management system on the CeBIT fair in March 2005 in Hannover, Germany. I will hold some lectures there where I will demonstrate a power plant documentation system. Of couse I cannot do that with the actual real documents from our customers because those are classified, so I intend to substitute them with wikipedia articles about power plants, electricity and related topics. Thus, I only show free content, and I can also use the opportunity to demonstrate wikipedia quality to some business people who might not know it yet. I hope you agree this is a perfectly valid intention of using wikipedia content.)
Regards
On Tue, 8 Feb 2005, Maik Musall wrote:
Hello folks,
I want to make PDF files from a number of wikipedia articles, but encountered several technical problems and wondered if there's a better way how to do that than I came about. (If you want to know what this is for, look at the end of this posting.)
http://cvs.sourceforge.net/viewcvs.py/wikipdf/
Maybe a good start...
On Tue, 8 Feb 2005 23:13:12 +0100, Maik Musall lists@musall.de wrote:
- I have to avoid wasting bandwidth and wikipedia server load as much as possible.
To save server load and bandwidth, I considered using the database dump, but that lacks the images and the layout, right? I even downloaded the wikipedia CDROM image, but discovered that's a Windows software with data stuffed into some database where it's probably difficult to retrieve and make PDFs from.
My current idea is to use the normal web access since I have no other working solution. I would spread the accesses over a week and only use times where normal server load is low.
- What's the best way to get those 1000 articles from the servers without putting too much load on them?
Do you know about Special:Export? http://en.wikipedia.org/wiki/Special:Export/Electricity
All you need to do then is to download the images.
Find out how the wikireaders were made. http://en.wikipedia.org/wiki/Wikipedia:WikiReader
Maik Musall wrote:
quality to some business people who might not know it yet. I hope you agree this is a perfectly valid intention of using wikipedia content.)
Yes, absolutely.
Can you release whatever code/solution you get to this problem?
I think there is huge demand for the ability to have a nice quick non-abusive-to-the-servers way for people to make pdfs out of 1000 articles.
--Jimbo
On Wed, Feb 09, 2005 at 07:36:02AM -0800, Jimmy (Jimbo) Wales wrote:
Maik Musall wrote:
quality to some business people who might not know it yet. I hope you agree this is a perfectly valid intention of using wikipedia content.)
Yes, absolutely.
Can you release whatever code/solution you get to this problem?
Yes, naturally. I was pointed to some existing entry points which are a good starting point for me. I'll keep you updated if I happen to add something new to that.
Regards
Maik Musall schrieb:
Yes, naturally. I was pointed to some existing entry points which are a good starting point for me. I'll keep you updated if I happen to add something new to that.
Regards
I made some simple tests with the itext engine: http://www.lowagie.com/iText/ I took the HTML output from the Plog4U Eclipse Wikipedia Editor preview and transformed it to HTML. Seems to me that it's possible to integrate a PDF exporter in the plugin.
I made some simple tests with the itext engine: http://www.plog4u.org/index.php/Generate_PDF_with_iText
I took the HTML output from the Plog4U Eclipse Wikipedia Editor previewer and transformed it to PDF. Seems to me that it's possible to integrate a PDF exporter in the plugin.
wikitech-l@lists.wikimedia.org