SVN: [6608] trunk/pywikipedia/pagegenerators.py

List overview All Threads
Download

newer

older

SVN: [6618]...

[ pywikipediabot-Patches-2726343...

cosoleto＠svn.wikimedia.org

15 Apr 2009 15 Apr '09

1:54 p.m.

Revision: 6608 Author: cosoleto Date: 2009-04-15 17:53:59 +0000 (Wed, 15 Apr 2009)

Log Message: ----------- Correction for a CPU overload problem introduced with the recent changes in PageGenerators module that would use the DuplicateFilterPageGenerator always (probably a bad idea). This filter was using a 'list' object to check for duplicated 'Page' and was storing 'Page' objects making the comparision process much more complicated...

'set' looks here more appropriate, as it is hashed; and storing for comparision the title and the interwiki link should be enough. This also reduces allocated memory a lot compared with the previous revision (60-65% estimated with a fixed title length of 14 chars).

This commit reduces CPU usage for a so simple task on my five/six years old system from 99% to 30%.

Modified Paths: -------------- trunk/pywikipedia/pagegenerators.py

Modified: trunk/pywikipedia/pagegenerators.py =================================================================== --- trunk/pywikipedia/pagegenerators.py 2009-04-15 08:28:21 UTC (rev 6607) +++ trunk/pywikipedia/pagegenerators.py 2009-04-15 17:53:59 UTC (rev 6608) @@ -705,10 +705,11 @@ Wraps around another generator. Yields all pages, but prevents duplicates. """ - seenPages = [] + seenPages = set() for page in generator: - if page not in seenPages: - seenPages.append(page) + _page = page.aslink(forceInterwiki = True)[2:-2] + if _page not in seenPages: + seenPages.add(_page) yield page

def RegexFilterPageGenerator(generator, regex):

Show replies by date

Nicolas Dumazet

15 Apr 15 Apr

2:06 p.m.

2009/4/16 cosoleto@svn.wikimedia.org:

...

Revision: 6608 Author: cosoleto Date: 2009-04-15 17:53:59 +0000 (Wed, 15 Apr 2009)

Log Message:

Correction for a CPU overload problem introduced with the recent changes in PageGenerators module that would use the DuplicateFilterPageGenerator always (probably a bad idea). This filter was using a 'list' object to check for duplicated 'Page' and was storing 'Page' objects making the comparision process much more complicated...

'set' looks here more appropriate, as it is hashed; and storing for comparision the title and the interwiki link should be enough. This also reduces allocated memory a lot compared with the previous revision (60-65% estimated with a fixed title length of 14 chars).

This commit reduces CPU usage for a so simple task on my five/six years old system from 99% to 30%.

Good, very nice catch =)

A small note here: set is not really meant to be used on incremental .add(), because sets are frozen (not mutable), and add() instantiates a new set on each .add() action. Sets are useful for set operations (union, intersection), but are not really helpful when it comes to incrementally construct them. When I need performance for such kind of lookups, a simple dictionary is usually way faster than sets :) I would suggest using a dictionary here :)

-- Nicolas Dumazet — NicDumZ [ nɪk.d̪ymz ]

Russell Blau

3:41 p.m.

"Nicolas Dumazet" nicdumz@gmail.com said:

...

Good, very nice catch =)

A small note here: set is not really meant to be used on incremental .add(), because sets are frozen (not mutable), and add() instantiates a new set on each .add() action. Sets are useful for set operations (union, intersection), but are not really helpful when it comes to incrementally construct them. When I need performance for such kind of lookups, a simple dictionary is usually way faster than sets :) I would suggest using a dictionary here :)

A dict works fine, but to be pedantic about it, sets *are* mutable. If you want a non-mutable set, use the frozenset() type. http://docs.python.org/library/stdtypes.html#set-types-set-frozenset

Russ

Francesco Cosoleto

16 Apr 16 Apr

4:48 a.m.

Nicolas Dumazet ha scritto:

...

A small note here: set is not really meant to be used on incremental .add(), because sets are frozen (not mutable), and add() instantiates a new set on each .add() action. Sets are useful for set operations (union, intersection), but are not really helpful when it comes to incrementally construct them. When I need performance for such kind of lookups, a simple dictionary is usually way faster than sets :) I would suggest using a dictionary here :)

'set' is implemented as a 'dict', using add() method you just are adding a new keyword, so using directly a 'dict' should be theorically faster. Unfortunately, I cannot see this in a test I have executed (1,23% slower). I'll try to run it again later. Maybe I miss something.

-- Francesco Cosoleto Anyone can make mistakes, but only an idiot persists in his error. (Cicero)

Nicolas Dumazet

4:59 a.m.

2009/4/16 Francesco Cosoleto cosoleto@free.fr

...

Nicolas Dumazet ha scritto:

...
A small note here: set is not really meant to be used on incremental .add(), because sets are frozen (not mutable), and add() instantiates a new set on each .add() action. Sets are useful for set operations (union, intersection), but are not really helpful when it comes to incrementally construct them. When I need performance for such kind of lookups, a simple dictionary is usually way faster than sets :) I would suggest using a dictionary here :)

'set' is implemented as a 'dict', using add() method you just are adding a new keyword, so using directly a 'dict' should be theorically faster. Unfortunately, I cannot see this in a test I have executed (1,23% slower). I'll try to run it again later. Maybe I miss something.

Yes, after Russell's message, I double-checked set.add vs dict.__setitem__ performance, and to my surprise, the runtime differences where not as significant as I first thought. One important parameter however, seems to be the length of the keys.

I am currently running more systematic tests, I will let you know the results =)

-- Nicolas Dumazet — NicDumZ [ nɪk.d̪ymz ]

Nicolas Dumazet

1:23 p.m.

2009/4/16 Nicolas Dumazet nicdumz@gmail.com

...

I am currently running more systematic tests, I will let you know the results =)

Oh well. I don't know where I got that key length dependance idea. Performance is absolutely not related to length of keys, from what I saw today.

...

From this test, dict. __setitem__ seem to be 25% faster than set.add on

Python 2.5 :

from timeit import Timer import string, random

input = [] testSet = set() testDict = dict()

def initAdd(): for i in range(1000): input.append("".join(random.sample(string.letters, 40)))

def setAdd(): for e in input: testSet.add(e)

def dictAdd(): for e in input: testDict[e] = True

initAdd() s = Timer('setAdd()', 'from __main__ import setAdd').timeit() print int(s), 'seconds for set.add' d = Timer('dictAdd()', 'from __main__ import dictAdd').timeit() print int(d), 'seconds for dict.__setitem__'

notIn = [] def initContains(): for e in input: notIn.append('0' + e[1:])

def setContains(): for e in input: assert e in testSet for e in notIn: assert e not in testSet

def dictContains(): for e in input: assert e in testDict for e in notIn: assert e not in testDict

initContains() s = Timer('setContains()', 'from __main__ import setContains').timeit() print int(s), 'seconds for set.__contains__' d = Timer('dictContains()', 'from __main__ import dictContains').timeit() print int(d), 'seconds for dict.__contains__'

# 312 seconds for set.add # 224 seconds for dict.__setitem__

# 359 seconds for set.__contains__ # 362 seconds for dict.__contains__

As you can see, __contains__ shows the same performances between set and dict.

I tried on 3 different hardwares: different absolute values, same relative differences.

Asking around about that "slow set" rumor, I got told that set might have performed quite bad in Python 2.4 . Well, I tried, and 2.4 is not significantly slower than 2.5 on those operations (did the set improvements got backported to 2.5 ?)

Anyway, bottom line: * "bite your tongue a thousand times..." * set is mutable, and has not been immutable ever in Python history * dict.__setitem__ looks 25% faster than set.add * key length (strings, from 5 to 100 characters long) does not appear to affect run time * Python 2.4 & 2.5 perform equally well wrt set & dict operations

-- Nicolas Dumazet — NicDumZ [ nɪk.d̪ymz ]

Francesco Cosoleto

17 Apr 17 Apr

4:42 p.m.

Nicolas Dumazet ha scritto:

...

From this test, dict. __setitem__ seem to be 25% faster than set.add on Python 2.5 :

[...]

...

# 312 seconds for set.add # 224 seconds for dict.__setitem__

# 359 seconds for set.__contains__ # 362 seconds for dict.__contains__

As you can see, __contains__ shows the same performances between set and dict.

I tried on 3 different hardwares: different absolute values, same relative differences.

I can confirm this values on my Linux system. On Cygwin __contains__ part looks as slower:

400 seconds for set.add 291 seconds for dict.__setitem__ 505 seconds for set.__contains__ 550 seconds for dict.__contains__

...

Asking around about that "slow set" rumor, I got told that set might have performed quite bad in Python 2.4 . Well, I tried, and 2.4 is not significantly slower than 2.5 on those operations (did the set improvements got backported to 2.5 ?)

Probably 2.3 or earlier version instead.

...

Anyway, bottom line:

"bite your tongue a thousand times..."

set is mutable, and has not been immutable ever in Python history

dict.__setitem__ looks 25% faster than set.add

key length (strings, from 5 to 100 characters long) does not appear to

affect run time

Python 2.4 & 2.5 perform equally well wrt set & dict operations

Ok, I think we can replace 'set' with 'dict', but there are some other near functions that are extremely slow and should be optimized before, so the performance change is more noticeable. I am going to do that.

-- Francesco Cosoleto History is a gallery of pictures in which there are few originals and many copies. (Alexis de Tocqueville)

5558

Age (days ago)

5560

Last active (days ago)

pywikipedia-l@lists.wikimedia.org

6 comments

5 participants

tags (0)

participants (5)

cosoleto＠svn.wikimedia.org
Francesco Cosoleto
Francesco Cosoleto
Nicolas Dumazet
Russell Blau