Dear wikitech list,
I'm preparing a Spanish Wikipedia snapshot for the One Laptop per Child laptop, and to get the size down I'm trying to do some article selection using MWDumper¹. As background, here are the steps I've gone thrown so far using the eswiki dump:
thunk:cjb~% mysql -u eswiki -p eswiki < eswiki-20080416-pagelinks.sql
mysql> SELECT pl_namespace, pl_title, COUNT(*) INTO outfile \ "/tmp/incominglinks" FROM pagelinks GROUP BY pl_namespace, pl_title
The "incominglinks" file, after processing it to be one article per line and excluding articles with few inbound links, looks like this:
thunk:cjb~ % wc -l incominglinks.names 162974 incominglinks.names
This is where MWDumper comes in -- I'd like to create an XML dump of each article in the incominglinks.names. I tried:
thunk:cjb~% java -jar mwdumper-2008-04-13.jar \ --output=bzip2:eswiki_limited.xml.bz2 \ --format=xml \ --filter=list:incominglinks.names \ eswiki-20080416-pages-articles.xml.bz2
MWDumper didn't return any errors, and ran through the whole of the .bz2 to completion. I then ran:
thunk:cjb~ % perl -nle 'print $1 if /<title>(.*?)</title>/' \ < eswiki_limited.xml > incominglinks.output
The resulting file is:
thunk:cjb~ % wc -l incominglinks.output 45395 incominglinks.output
Can anyone think of possible reasons for the discrepancy in number of articles asked for and received? I've looked briefly at character set and namespaces, and neither seems responsible. Here is an example of articles present in the input and not the output:
thunk:cjb~ % grep nuclear incominglinks.names.sorted Abandono_de_la_energía_nuclear Accidente_nuclear Arma_nuclear Armas_nucleares Bomba_nuclear Central_nuclear [59 matches]
thunk:cjb~ % grep nuclear incominglinks.output.sorted thunk:cjb~ %
Here are complete versions of the two files above:
http://dev.laptop.org/~cjb/incominglinks.names.sorted http://dev.laptop.org/~cjb/incominglinks.output.sorted
So, a few questions:
* Any ideas on what could cause many articles to be being dropped here? * Is there any further output I could provide that would be useful? * Is there a tool other than MWDumper that could do this for me?
Thanks very much for any suggestions!
- Chris.
¹: http://www.mediawiki.org/wiki/MWDumper