Re: [Wikitech-l] Article selection with MWDumper.

7 May 2008


      Chris Ball wrote:
...
Can anyone think of possible reasons for the discrepancy in number of
articles asked for and received?  I've looked briefly at character set
and namespaces, and neither seems responsible.  Here is an example of
articles present in the input and not the output:
thunk:cjb~ % grep nuclear incominglinks.names.sorted
 Abandono_de_la_energía_nuclear
 Accidente_nuclear
 Arma_nuclear
 Armas_nucleares
 Bomba_nuclear
 Central_nuclear
 [59 matches]
thunk:cjb~ % grep nuclear incominglinks.output.sorted
 thunk:cjb~ %
1) Have you confirmed that all these pages are present in the dump? The 
pagelinks table lists links regardless of the existence of their targets.
2) A quick look at the ListFilter class in mwdumper indicates it isn't 
normalizing "_" to " ", though all the XML processing will be using 
spaces. Ensure that your list uses spaces rather than underscores.
A quick spot-check indicates that the vast majority of failed matches in 
your data set contained a "_" in the input list; of those without a "_", 
the few I tried don't currently exist on es.wikipedia.org.
Ensure also that you're not mixing data from different namespaces.
-- brion

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Article selection with MWDumper.