[Pywikipedia-l] SVN: [6608] trunk/pywikipedia/pagegenerators.py

cosoleto at svn.wikimedia.org cosoleto at svn.wikimedia.org
Wed Apr 15 17:54:00 UTC 2009


Revision: 6608
Author:   cosoleto
Date:     2009-04-15 17:53:59 +0000 (Wed, 15 Apr 2009)

Log Message:
-----------
Correction for a CPU overload problem introduced with the recent changes in PageGenerators module that would use the DuplicateFilterPageGenerator always (probably a bad idea). This filter was using a 'list' object to check for duplicated 'Page' and was storing 'Page' objects making the comparision process much more complicated...

'set' looks here more appropriate, as it is hashed; and storing for comparision the title and the interwiki link should be enough. This also reduces allocated memory a lot compared with the previous revision (60-65% estimated with a fixed title length of 14 chars).

This commit reduces CPU usage for a so simple task on my five/six years old system from 99% to 30%.

Modified Paths:
--------------
    trunk/pywikipedia/pagegenerators.py

Modified: trunk/pywikipedia/pagegenerators.py
===================================================================
--- trunk/pywikipedia/pagegenerators.py	2009-04-15 08:28:21 UTC (rev 6607)
+++ trunk/pywikipedia/pagegenerators.py	2009-04-15 17:53:59 UTC (rev 6608)
@@ -705,10 +705,11 @@
     Wraps around another generator. Yields all pages, but prevents
     duplicates.
     """
-    seenPages = []
+    seenPages = set()
     for page in generator:
-        if page not in seenPages:
-            seenPages.append(page)
+        _page = page.aslink(forceInterwiki = True)[2:-2]
+        if _page not in seenPages:
+            seenPages.add(_page)
             yield page
 
 def RegexFilterPageGenerator(generator, regex):





More information about the Pywikipedia-l mailing list