On Thu, 2006-06-15 at 11:47 +0200, Andy Rabagliati wrote:
On Wed, 14 Jun 2006, SJ wrote:
I have a related, orthogonal, request, regarding the process of assembling a CD or other snapshot. My interest is less to do with quality, and more to do with the process. My end result is either a CD, or a plucker document for PalmOS.
http://en.wikipedia.org/wiki/Wikipedia_talk:Version_0.5
I see the job as too big to be done via hand selection.
I agree, kinda. Too big for maybe one person, but we have a lot of hands.
I am also more interested in coverage than quality - I figure the quality will just get better.
Yes. I am too. For .5 and 1.0, Both are important though.
So, I want automated methods, both for selecting good coverage, and (less important at the moment) version selection. I also would like to target a size - 128Meg, 512Meg, 600Meg, 1Gig, 4Gig.
I really like this. Thumbdrive ? Check. CD? Check. DVD? Check. My laptop? Check.
I am also interested in post-processing - stripping redlinks, including ''main article'' references on core articles, like ''History of South Africa'' etc. I want to be able to tweak parameters, then press a button and get a new CD (from my downloaded XML dump of en and a picture collection, and possibly via a live mediawiki snapshot of that content).
Live? How did you want to do this? Perhaps using the toolserver?
This is what I have tried, mostly with available tools, and a bit of perl.
- Download recent XML dump.
- Download list of articles from category (currently using the WPCD template)
- Trim the full dump to the above article list (natively performed by mwdumper --exactlist)
- Import this to mysql
- import (full) category dump to mysql (sql dump downloaded from wikipedia)
- Use mediawiki/maintenance/dumpHTML.php to convert this to HTML
- perl script removes categories with less than four included items from HTML dump
- redlink removal by un-anchoring HTML with class=new (red links) - but not Categories (that always seem to appear red)
Problems I have come across:-
- templates (particularly <nowiki>{{main|History of Country}}</nowiki> and the like) do not make it through dumpHTML.php. Maybe I have to hack the php.
- Remove all the dross at the end, like inter-wiki links.
You don't want those? Why not? (assuming they reflecting a link within the dump)
Could this be done by tweaking the CSS from dumpHTML ?
I don't know, but I would like to reproduce your steps. Can you show me your perl and any other special things you are using? Kyle