Chris Ball wrote:
Can anyone think of possible reasons for the discrepancy in number of articles asked for and received? I've looked briefly at character set and namespaces, and neither seems responsible. Here is an example of articles present in the input and not the output:
thunk:cjb~ % grep nuclear incominglinks.names.sorted Abandono_de_la_energĂa_nuclear Accidente_nuclear Arma_nuclear Armas_nucleares Bomba_nuclear Central_nuclear [59 matches]
thunk:cjb~ % grep nuclear incominglinks.output.sorted thunk:cjb~ %
1) Have you confirmed that all these pages are present in the dump? The pagelinks table lists links regardless of the existence of their targets.
2) A quick look at the ListFilter class in mwdumper indicates it isn't normalizing "_" to " ", though all the XML processing will be using spaces. Ensure that your list uses spaces rather than underscores.
A quick spot-check indicates that the vast majority of failed matches in your data set contained a "_" in the input list; of those without a "_", the few I tried don't currently exist on es.wikipedia.org.
Ensure also that you're not mixing data from different namespaces.
-- brion