On Wed, 14 Jun 2006, SJ wrote:
Note that the "Wikipedia 0.5" WikiProject on en:wp is tackling this issue with some energy, and could use more input and nominations:
http://en.wikipedia.org/wiki/Wikipedia:Version_0.5_Nominations
I have a related, orthogonal, request, regarding the process of assembling a CD or other snapshot. My interest is less to do with quality, and more to do with the process. My end result is either a CD, or a plucker document for PalmOS.
http://en.wikipedia.org/wiki/Wikipedia_talk:Version_0.5
I see the job as too big to be done via hand selection. I am also more interested in coverage than quality - I figure the quality will just get better. So, I want automated methods, both for selecting good coverage, and (less important at the moment) version selection. I also would like to target a size - 128Meg, 512Meg, 600Meg, 1Gig, 4Gig. I am also interested in post-processing - stripping redlinks, including ''main article'' references on core articles, like ''History of South Africa'' etc. I want to be able to tweak parameters, then press a button and get a new CD (from my downloaded XML dump of en and a picture collection, and possibly via a live mediawiki snapshot of that content).
This is what I have tried, mostly with available tools, and a bit of perl.
* Download recent XML dump. * Download list of articles from category (currently using the WPCD template) * Trim the full dump to the above article list (natively performed by mwdumper --exactlist) * Import this to mysql * import (full) category dump to mysql (sql dump downloaded from wikipedia) * Use mediawiki/maintenance/dumpHTML.php to convert this to HTML * perl script removes categories with less than four included items from HTML dump * redlink removal by un-anchoring HTML with class=new (red links) - but not Categories (that always seem to appear red)
Problems I have come across:-
* templates (particularly <nowiki>{{main|History of Country}}</nowiki> and the like) do not make it through dumpHTML.php. Maybe I have to hack the php. * Remove all the dross at the end, like inter-wiki links.
Could this be done by tweaking the CSS from dumpHTML ?
Cheers, Andy!