Revision: 6608 Author: cosoleto Date: 2009-04-15 17:53:59 +0000 (Wed, 15 Apr 2009)
Log Message: ----------- Correction for a CPU overload problem introduced with the recent changes in PageGenerators module that would use the DuplicateFilterPageGenerator always (probably a bad idea). This filter was using a 'list' object to check for duplicated 'Page' and was storing 'Page' objects making the comparision process much more complicated...
'set' looks here more appropriate, as it is hashed; and storing for comparision the title and the interwiki link should be enough. This also reduces allocated memory a lot compared with the previous revision (60-65% estimated with a fixed title length of 14 chars).
This commit reduces CPU usage for a so simple task on my five/six years old system from 99% to 30%.
Modified Paths: -------------- trunk/pywikipedia/pagegenerators.py
Modified: trunk/pywikipedia/pagegenerators.py =================================================================== --- trunk/pywikipedia/pagegenerators.py 2009-04-15 08:28:21 UTC (rev 6607) +++ trunk/pywikipedia/pagegenerators.py 2009-04-15 17:53:59 UTC (rev 6608) @@ -705,10 +705,11 @@ Wraps around another generator. Yields all pages, but prevents duplicates. """ - seenPages = [] + seenPages = set() for page in generator: - if page not in seenPages: - seenPages.append(page) + _page = page.aslink(forceInterwiki = True)[2:-2] + if _page not in seenPages: + seenPages.add(_page) yield page
def RegexFilterPageGenerator(generator, regex):