[Pywikipedia-l] [ pywikipediabot-Patches-1843787 ] catlib _getContentsAndSupercats performance issue

SourceForge.net noreply at sourceforge.net
Sun Dec 9 13:37:53 UTC 2007


Patches item #1843787, was opened at 2007-12-04 04:18
Message generated for change (Comment added) made by rotemliss
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1843787&group_id=93107

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Pietro Battiston (toobaz)
Assigned to: Nobody/Anonymous (nobody)
Summary: catlib _getContentsAndSupercats performance issue

Initial Comment:
catlib.py's _getContentsAndSupercats method has a performance issue that in
some cases can slow a lot the process of recursiverly downloading all pages
or subcategories of a category.

See this example (chosen just because it's short to report, not because
it's so pathological):

###########ipython output###############
In [1]: import catlib
Checked for running processes. 1 processes currently running, including the
current process.

In [2]: len(catlib.Category('it', 'Categoria:Geometria
descrittiva').articlesList(recurse=True))
Getting [[Categoria:Geometria descrittiva]]...
Getting [[Categoria:Coperture a volta]]...
Getting [[Categoria:Corrispondenza biunivoca (geometria descrittiva)]]...
Getting [[Categoria:Curve piane]]...
Getting [[Categoria:Curve tridimensionali]]...
Getting [[Categoria:Glossario (geometria descrittiva)]]...
Getting [[Categoria:Metodi di rappresentazione]]...
Getting [[Categoria:Modellazione geometrica]]...
Getting [[Categoria:Tassellazioni]]...
Getting [[Categoria:Poliedri]]...
Getting [[Categoria:Tassellazioni]]...
Getting [[Categoria:Problemi di misura]]...
Getting [[Categoria:Stub geometria descrittiva]]...
Getting [[Categoria:Superfici]]...
Getting [[Categoria:Sviluppo di solidi]]...
Getting [[Categoria:Tangenza]]...
Out[2]: 393
###########end ipython output###############

As you can see, [[Categoria:Tassellazioni]] is downloaded 2 times. But I
can grant you that there are a lot of much worse cases.

Anyway, I'm attaching a patch. After the patch, here are the same
commands:

###########ipython output###############
In [1]: import catlib
Checked for running processes. 1 processes currently running, including the
current process.

In [2]: len(catlib.Category('it', 'Categoria:Geometria
descrittiva').articlesList(recurse=True))
Getting [[Categoria:Geometria descrittiva]]...
Getting [[Categoria:Coperture a volta]]...
Getting [[Categoria:Corrispondenza biunivoca (geometria descrittiva)]]...
Getting [[Categoria:Curve piane]]...
Getting [[Categoria:Curve tridimensionali]]...
Getting [[Categoria:Glossario (geometria descrittiva)]]...
Getting [[Categoria:Metodi di rappresentazione]]...
Getting [[Categoria:Modellazione geometrica]]...
Getting [[Categoria:Tassellazioni]]...
Getting [[Categoria:Poliedri]]...
Getting [[Categoria:Problemi di misura]]...
Getting [[Categoria:Stub geometria descrittiva]]...
Getting [[Categoria:Superfici]]...
Getting [[Categoria:Sviluppo di solidi]]...
Getting [[Categoria:Tangenza]]...
Out[2]: 393
###########end ipython output###############

Notice this patch also solves the problem of eventual loops in categories:
catlib won't loop.

----------------------------------------------------------------------

Comment By: Rotem Liss (rotemliss)
Date: 2007-12-09 15:37

Message:
Logged In: YES 
user_id=1327030
Originator: NO

I changed the patch so that it solves also the problem of duplicate
returns, and also does add the subcategory to its cache, and applied it in
r4656.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1843787&group_id=93107



More information about the Pywikipedia-l mailing list