[pywikibot] Replace.py: very slow reading from XML dump

16 Sep 2018

Hi folks,

I still use trunk/compat for many reasons, but as I see the new code at
https://github.com/wikimedia/pywikibot/blob/master/scripts/replace.py, the
core version must suffer from the same problem.

If we use -namespace for namespace filtering, class
XmlDumpReplacePageGenerator will go through ALL pages, THEN the result is
filtered by a namespace generator. This may MULTIPLY the running time in
some cases and this may cost hours or even days for a fix of complicated,
slow regexes.
I have just checked, that dump does contain namespace informátion. So why
don't we filter during the scan?

I made an experiment. I modified my copy to display count of articles and
count of matching pages. The replacement was:
(ur'(\d)\s*%', ur'\1%'),
which seems pretty slow. :-(
The bot scanned the latest huwiki dump for 14 hours(!). (Not the whole
dump, I used -xmlstart.) It went through 820 thousand pages and found 240+
matches (I displayed every 10th match).
Then the bot worked further 30-40 minutes to check the actual pages from
live wiki, this time with namespace filtering on. (I don't replace in this
phase, just save the list, so no human interaction is implied in this time.)
Guess the result! 62 out of 240 remained. This means that the bigger part
of these 14 hours went into /dev/null.
Now I realize how much time I wasted in the past 10 years. :-(

I am sure that passing namespaces to XmlDumpReplacePageGenerator is worth.

-- 
Bináris

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

[pywikibot] Replace.py: very slow reading from XML dump