There is another issue with that generator: it always checks for replacements but does not
apply them which means replacements are always done twice which might slow down the run
too. I think we should open a Phabricator task for it.
Best
Xqt
Am 16.09.2018 um 22:03 schrieb Bináris
<wikiposta(a)gmail.com>om>:
Hi folks,
I still use trunk/compat for many reasons, but as I see the new code at
https://github.com/wikimedia/pywikibot/blob/master/scripts/replace.py, the core version
must suffer from the same problem.
If we use -namespace for namespace filtering, class XmlDumpReplacePageGenerator will go
through ALL pages, THEN the result is filtered by a namespace generator. This may MULTIPLY
the running time in some cases and this may cost hours or even days for a fix of
complicated, slow regexes.
I have just checked, that dump does contain namespace informátion. So why don't we
filter during the scan?
I made an experiment. I modified my copy to display count of articles and count of
matching pages. The replacement was:
(ur'(\d)\s*%', ur'\1%'),
which seems pretty slow. :-(
The bot scanned the latest huwiki dump for 14 hours(!). (Not the whole dump, I used
-xmlstart.) It went through 820 thousand pages and found 240+ matches (I displayed every
10th match).
Then the bot worked further 30-40 minutes to check the actual pages from live wiki, this
time with namespace filtering on. (I don't replace in this phase, just save the list,
so no human interaction is implied in this time.)
Guess the result! 62 out of 240 remained. This means that the bigger part of these 14
hours went into /dev/null.
Now I realize how much time I wasted in the past 10 years. :-(
I am sure that passing namespaces to XmlDumpReplacePageGenerator is worth.
--
Bináris
_______________________________________________
pywikibot mailing list
pywikibot(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikibot