Dear wikitech list,
I'm preparing a Spanish Wikipedia snapshot for the One Laptop per Child laptop, and to get the size down I'm trying to do some article selection using MWDumper¹. As background, here are the steps I've gone thrown so far using the eswiki dump:
thunk:cjb~% mysql -u eswiki -p eswiki < eswiki-20080416-pagelinks.sql
mysql> SELECT pl_namespace, pl_title, COUNT(*) INTO outfile \ "/tmp/incominglinks" FROM pagelinks GROUP BY pl_namespace, pl_title
The "incominglinks" file, after processing it to be one article per line and excluding articles with few inbound links, looks like this:
thunk:cjb~ % wc -l incominglinks.names 162974 incominglinks.names
This is where MWDumper comes in -- I'd like to create an XML dump of each article in the incominglinks.names. I tried:
thunk:cjb~% java -jar mwdumper-2008-04-13.jar \ --output=bzip2:eswiki_limited.xml.bz2 \ --format=xml \ --filter=list:incominglinks.names \ eswiki-20080416-pages-articles.xml.bz2
MWDumper didn't return any errors, and ran through the whole of the .bz2 to completion. I then ran:
thunk:cjb~ % perl -nle 'print $1 if /<title>(.*?)</title>/' \ < eswiki_limited.xml > incominglinks.output
The resulting file is:
thunk:cjb~ % wc -l incominglinks.output 45395 incominglinks.output
Can anyone think of possible reasons for the discrepancy in number of articles asked for and received? I've looked briefly at character set and namespaces, and neither seems responsible. Here is an example of articles present in the input and not the output:
thunk:cjb~ % grep nuclear incominglinks.names.sorted Abandono_de_la_energía_nuclear Accidente_nuclear Arma_nuclear Armas_nucleares Bomba_nuclear Central_nuclear [59 matches]
thunk:cjb~ % grep nuclear incominglinks.output.sorted thunk:cjb~ %
Here are complete versions of the two files above:
http://dev.laptop.org/~cjb/incominglinks.names.sorted http://dev.laptop.org/~cjb/incominglinks.output.sorted
So, a few questions:
* Any ideas on what could cause many articles to be being dropped here? * Is there any further output I could provide that would be useful? * Is there a tool other than MWDumper that could do this for me?
Thanks very much for any suggestions!
- Chris.
¹: http://www.mediawiki.org/wiki/MWDumper
Chris Ball wrote:
Can anyone think of possible reasons for the discrepancy in number of articles asked for and received? I've looked briefly at character set and namespaces, and neither seems responsible. Here is an example of articles present in the input and not the output:
thunk:cjb~ % grep nuclear incominglinks.names.sorted Abandono_de_la_energía_nuclear Accidente_nuclear Arma_nuclear Armas_nucleares Bomba_nuclear Central_nuclear [59 matches]
thunk:cjb~ % grep nuclear incominglinks.output.sorted thunk:cjb~ %
1) Have you confirmed that all these pages are present in the dump? The pagelinks table lists links regardless of the existence of their targets.
2) A quick look at the ListFilter class in mwdumper indicates it isn't normalizing "_" to " ", though all the XML processing will be using spaces. Ensure that your list uses spaces rather than underscores.
A quick spot-check indicates that the vast majority of failed matches in your data set contained a "_" in the input list; of those without a "_", the few I tried don't currently exist on es.wikipedia.org.
Ensure also that you're not mixing data from different namespaces.
-- brion
Hi,
- A quick look at the ListFilter class in mwdumper indicates it
isn't normalizing "_" to " ", though all the XML processing will be using spaces. Ensure that your list uses spaces rather than underscores.
Thanks, Brion, this was exactly the problem. Enclosed is a tested patch to perform the conversion, and a patch to fix up the gcj Make build.
- Chris.
On Wed, May 7, 2008 at 3:32 AM, Chris Ball cjb@laptop.org wrote:
Hi,
- A quick look at the ListFilter class in mwdumper indicates it
isn't normalizing "_" to " ", though all the XML processing will be using spaces. Ensure that your list uses spaces rather than underscores.
Thanks, Brion, this was exactly the problem. Enclosed is a tested patch to perform the conversion, and a patch to fix up the gcj Make build.
- Chris.
Attachements don't come through on the mailing list. Please use bugzilla: https://bugzilla.wikimedia.org/
Bryan
wikitech-l@lists.wikimedia.org