Revision: 6608 Author: cosoleto Date: 2009-04-15 17:53:59 +0000 (Wed, 15 Apr 2009)
Log Message: ----------- Correction for a CPU overload problem introduced with the recent changes in PageGenerators module that would use the DuplicateFilterPageGenerator always (probably a bad idea). This filter was using a 'list' object to check for duplicated 'Page' and was storing 'Page' objects making the comparision process much more complicated...
'set' looks here more appropriate, as it is hashed; and storing for comparision the title and the interwiki link should be enough. This also reduces allocated memory a lot compared with the previous revision (60-65% estimated with a fixed title length of 14 chars).
This commit reduces CPU usage for a so simple task on my five/six years old system from 99% to 30%.
Modified Paths: -------------- trunk/pywikipedia/pagegenerators.py
Modified: trunk/pywikipedia/pagegenerators.py =================================================================== --- trunk/pywikipedia/pagegenerators.py 2009-04-15 08:28:21 UTC (rev 6607) +++ trunk/pywikipedia/pagegenerators.py 2009-04-15 17:53:59 UTC (rev 6608) @@ -705,10 +705,11 @@ Wraps around another generator. Yields all pages, but prevents duplicates. """ - seenPages = [] + seenPages = set() for page in generator: - if page not in seenPages: - seenPages.append(page) + _page = page.aslink(forceInterwiki = True)[2:-2] + if _page not in seenPages: + seenPages.add(_page) yield page
def RegexFilterPageGenerator(generator, regex):
2009/4/16 cosoleto@svn.wikimedia.org:
Revision: 6608 Author: cosoleto Date: 2009-04-15 17:53:59 +0000 (Wed, 15 Apr 2009)
Log Message:
Correction for a CPU overload problem introduced with the recent changes in PageGenerators module that would use the DuplicateFilterPageGenerator always (probably a bad idea). This filter was using a 'list' object to check for duplicated 'Page' and was storing 'Page' objects making the comparision process much more complicated...
'set' looks here more appropriate, as it is hashed; and storing for comparision the title and the interwiki link should be enough. This also reduces allocated memory a lot compared with the previous revision (60-65% estimated with a fixed title length of 14 chars).
This commit reduces CPU usage for a so simple task on my five/six years old system from 99% to 30%.
Good, very nice catch =)
A small note here: set is not really meant to be used on incremental .add(), because sets are frozen (not mutable), and add() instantiates a new set on each .add() action. Sets are useful for set operations (union, intersection), but are not really helpful when it comes to incrementally construct them. When I need performance for such kind of lookups, a simple dictionary is usually way faster than sets :) I would suggest using a dictionary here :)
"Nicolas Dumazet" nicdumz@gmail.com said:
Good, very nice catch =)
A small note here: set is not really meant to be used on incremental .add(), because sets are frozen (not mutable), and add() instantiates a new set on each .add() action. Sets are useful for set operations (union, intersection), but are not really helpful when it comes to incrementally construct them. When I need performance for such kind of lookups, a simple dictionary is usually way faster than sets :) I would suggest using a dictionary here :)
A dict works fine, but to be pedantic about it, sets *are* mutable. If you want a non-mutable set, use the frozenset() type. http://docs.python.org/library/stdtypes.html#set-types-set-frozenset
Russ
Nicolas Dumazet ha scritto:
A small note here: set is not really meant to be used on incremental .add(), because sets are frozen (not mutable), and add() instantiates a new set on each .add() action. Sets are useful for set operations (union, intersection), but are not really helpful when it comes to incrementally construct them. When I need performance for such kind of lookups, a simple dictionary is usually way faster than sets :) I would suggest using a dictionary here :)
'set' is implemented as a 'dict', using add() method you just are adding a new keyword, so using directly a 'dict' should be theorically faster. Unfortunately, I cannot see this in a test I have executed (1,23% slower). I'll try to run it again later. Maybe I miss something.
2009/4/16 Francesco Cosoleto cosoleto@free.fr
Nicolas Dumazet ha scritto:
A small note here: set is not really meant to be used on incremental .add(), because sets are frozen (not mutable), and add() instantiates a new set on each .add() action. Sets are useful for set operations (union, intersection), but are not really helpful when it comes to incrementally construct them. When I need performance for such kind of lookups, a simple dictionary is usually way faster than sets :) I would suggest using a dictionary here :)
'set' is implemented as a 'dict', using add() method you just are adding a new keyword, so using directly a 'dict' should be theorically faster. Unfortunately, I cannot see this in a test I have executed (1,23% slower). I'll try to run it again later. Maybe I miss something.
Yes, after Russell's message, I double-checked set.add vs dict.__setitem__ performance, and to my surprise, the runtime differences where not as significant as I first thought. One important parameter however, seems to be the length of the keys.
I am currently running more systematic tests, I will let you know the results =)
2009/4/16 Nicolas Dumazet nicdumz@gmail.com
I am currently running more systematic tests, I will let you know the results =)
Oh well. I don't know where I got that key length dependance idea. Performance is absolutely not related to length of keys, from what I saw today.
From this test, dict. __setitem__ seem to be 25% faster than set.add on
Python 2.5 :
from timeit import Timer import string, random
input = [] testSet = set() testDict = dict()
def initAdd(): for i in range(1000): input.append("".join(random.sample(string.letters, 40)))
def setAdd(): for e in input: testSet.add(e)
def dictAdd(): for e in input: testDict[e] = True
initAdd() s = Timer('setAdd()', 'from __main__ import setAdd').timeit() print int(s), 'seconds for set.add' d = Timer('dictAdd()', 'from __main__ import dictAdd').timeit() print int(d), 'seconds for dict.__setitem__'
notIn = [] def initContains(): for e in input: notIn.append('0' + e[1:])
def setContains(): for e in input: assert e in testSet for e in notIn: assert e not in testSet
def dictContains(): for e in input: assert e in testDict for e in notIn: assert e not in testDict
initContains() s = Timer('setContains()', 'from __main__ import setContains').timeit() print int(s), 'seconds for set.__contains__' d = Timer('dictContains()', 'from __main__ import dictContains').timeit() print int(d), 'seconds for dict.__contains__'
# 312 seconds for set.add # 224 seconds for dict.__setitem__
# 359 seconds for set.__contains__ # 362 seconds for dict.__contains__
As you can see, __contains__ shows the same performances between set and dict.
I tried on 3 different hardwares: different absolute values, same relative differences.
Asking around about that "slow set" rumor, I got told that set might have performed quite bad in Python 2.4 . Well, I tried, and 2.4 is not significantly slower than 2.5 on those operations (did the set improvements got backported to 2.5 ?)
Anyway, bottom line: * "bite your tongue a thousand times..." * set is mutable, and has not been immutable ever in Python history * dict.__setitem__ looks 25% faster than set.add * key length (strings, from 5 to 100 characters long) does not appear to affect run time * Python 2.4 & 2.5 perform equally well wrt set & dict operations
Nicolas Dumazet ha scritto:
From this test, dict. __setitem__ seem to be 25% faster than set.add on Python 2.5 :
[...]
# 312 seconds for set.add # 224 seconds for dict.__setitem__
# 359 seconds for set.__contains__ # 362 seconds for dict.__contains__
As you can see, __contains__ shows the same performances between set and dict.
I tried on 3 different hardwares: different absolute values, same relative differences.
I can confirm this values on my Linux system. On Cygwin __contains__ part looks as slower:
400 seconds for set.add 291 seconds for dict.__setitem__ 505 seconds for set.__contains__ 550 seconds for dict.__contains__
Asking around about that "slow set" rumor, I got told that set might have performed quite bad in Python 2.4 . Well, I tried, and 2.4 is not significantly slower than 2.5 on those operations (did the set improvements got backported to 2.5 ?)
Probably 2.3 or earlier version instead.
Anyway, bottom line:
- "bite your tongue a thousand times..."
- set is mutable, and has not been immutable ever in Python history
- dict.__setitem__ looks 25% faster than set.add
- key length (strings, from 5 to 100 characters long) does not appear to
affect run time
- Python 2.4 & 2.5 perform equally well wrt set & dict operations
Ok, I think we can replace 'set' with 'dict', but there are some other near functions that are extremely slow and should be optimized before, so the performance change is more noticeable. I am going to do that.
pywikipedia-l@lists.wikimedia.org