2009/4/16 cosoleto@svn.wikimedia.org:
Revision: 6608 Author: cosoleto Date: 2009-04-15 17:53:59 +0000 (Wed, 15 Apr 2009)
Log Message:
Correction for a CPU overload problem introduced with the recent changes in PageGenerators module that would use the DuplicateFilterPageGenerator always (probably a bad idea). This filter was using a 'list' object to check for duplicated 'Page' and was storing 'Page' objects making the comparision process much more complicated...
'set' looks here more appropriate, as it is hashed; and storing for comparision the title and the interwiki link should be enough. This also reduces allocated memory a lot compared with the previous revision (60-65% estimated with a fixed title length of 14 chars).
This commit reduces CPU usage for a so simple task on my five/six years old system from 99% to 30%.
Good, very nice catch =)
A small note here: set is not really meant to be used on incremental .add(), because sets are frozen (not mutable), and add() instantiates a new set on each .add() action. Sets are useful for set operations (union, intersection), but are not really helpful when it comes to incrementally construct them. When I need performance for such kind of lookups, a simple dictionary is usually way faster than sets :) I would suggest using a dictionary here :)