Dear wikitech list,
I'm preparing a Spanish Wikipedia snapshot for the One Laptop per Child
laptop, and to get the size down I'm trying to do some article selection
using MWDumper¹. As background, here are the steps I've gone thrown
so far using the eswiki dump:
thunk:cjb~% mysql -u eswiki -p eswiki < eswiki-20080416-pagelinks.sql
mysql> SELECT pl_namespace, pl_title, COUNT(*) INTO outfile \
"/tmp/incominglinks" FROM pagelinks GROUP BY pl_namespace, pl_title
The "incominglinks" file, after processing it to be one article per
line and excluding articles with few inbound links, looks like this:
thunk:cjb~ % wc -l incominglinks.names
162974 incominglinks.names
This is where MWDumper comes in -- I'd like to create an XML dump of
each article in the incominglinks.names. I tried:
thunk:cjb~% java -jar mwdumper-2008-04-13.jar \
--output=bzip2:eswiki_limited.xml.bz2 \
--format=xml \
--filter=list:incominglinks.names \
eswiki-20080416-pages-articles.xml.bz2
MWDumper didn't return any errors, and ran through the whole of the
.bz2 to completion. I then ran:
thunk:cjb~ % perl -nle 'print $1 if /<title>(.*?)<\/title>/' \
< eswiki_limited.xml > incominglinks.output
The resulting file is:
thunk:cjb~ % wc -l incominglinks.output
45395 incominglinks.output
Can anyone think of possible reasons for the discrepancy in number of
articles asked for and received? I've looked briefly at character set
and namespaces, and neither seems responsible. Here is an example of
articles present in the input and not the output:
thunk:cjb~ % grep nuclear incominglinks.names.sorted
Abandono_de_la_energía_nuclear
Accidente_nuclear
Arma_nuclear
Armas_nucleares
Bomba_nuclear
Central_nuclear
[59 matches]
thunk:cjb~ % grep nuclear incominglinks.output.sorted
thunk:cjb~ %
Here are complete versions of the two files above:
http://dev.laptop.org/~cjb/incominglinks.names.sorted
http://dev.laptop.org/~cjb/incominglinks.output.sorted
So, a few questions:
* Any ideas on what could cause many articles to be being dropped here?
* Is there any further output I could provide that would be useful?
* Is there a tool other than MWDumper that could do this for me?
Thanks very much for any suggestions!
- Chris.
¹:
http://www.mediawiki.org/wiki/MWDumper
--
Chris Ball <cjb(a)laptop.org>