2009/4/16 <cosoleto(a)svn.wikimedia.org>rg>:
Revision: 6608
Author: cosoleto
Date: 2009-04-15 17:53:59 +0000 (Wed, 15 Apr 2009)
Log Message:
-----------
Correction for a CPU overload problem introduced with the recent changes in
PageGenerators module that would use the DuplicateFilterPageGenerator always (probably a
bad idea). This filter was using a 'list' object to check for duplicated
'Page' and was storing 'Page' objects making the comparision process much
more complicated...
'set' looks here more appropriate, as it is hashed; and storing for comparision
the title and the interwiki link should be enough. This also reduces allocated memory a
lot compared with the previous revision (60-65% estimated with a fixed title length of 14
chars).
This commit reduces CPU usage for a so simple task on my five/six years old system from
99% to 30%.
Good, very nice catch =)
A small note here: set is not really meant to be used on incremental
.add(), because sets are frozen (not mutable), and add() instantiates
a new set on each .add() action. Sets are useful for set operations
(union, intersection), but are not really helpful when it comes to
incrementally construct them. When I need performance for such kind of
lookups, a simple dictionary is usually way faster than sets :) I
would suggest using a dictionary here :)
--
Nicolas Dumazet — NicDumZ [ nɪk.d̪ymz ]