Patches item #1843787, was opened at 2007-12-03 23:18 Message generated for change (Settings changed) made by leogregianin You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1843787...
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None
Status: Closed Resolution: Fixed
Priority: 5 Private: No Submitted By: Pietro Battiston (toobaz) Assigned to: Nobody/Anonymous (nobody) Summary: catlib _getContentsAndSupercats performance issue
Initial Comment: catlib.py's _getContentsAndSupercats method has a performance issue that in some cases can slow a lot the process of recursiverly downloading all pages or subcategories of a category.
See this example (chosen just because it's short to report, not because it's so pathological):
###########ipython output############### In [1]: import catlib Checked for running processes. 1 processes currently running, including the current process.
In [2]: len(catlib.Category('it', 'Categoria:Geometria descrittiva').articlesList(recurse=True)) Getting [[Categoria:Geometria descrittiva]]... Getting [[Categoria:Coperture a volta]]... Getting [[Categoria:Corrispondenza biunivoca (geometria descrittiva)]]... Getting [[Categoria:Curve piane]]... Getting [[Categoria:Curve tridimensionali]]... Getting [[Categoria:Glossario (geometria descrittiva)]]... Getting [[Categoria:Metodi di rappresentazione]]... Getting [[Categoria:Modellazione geometrica]]... Getting [[Categoria:Tassellazioni]]... Getting [[Categoria:Poliedri]]... Getting [[Categoria:Tassellazioni]]... Getting [[Categoria:Problemi di misura]]... Getting [[Categoria:Stub geometria descrittiva]]... Getting [[Categoria:Superfici]]... Getting [[Categoria:Sviluppo di solidi]]... Getting [[Categoria:Tangenza]]... Out[2]: 393 ###########end ipython output###############
As you can see, [[Categoria:Tassellazioni]] is downloaded 2 times. But I can grant you that there are a lot of much worse cases.
Anyway, I'm attaching a patch. After the patch, here are the same commands:
###########ipython output############### In [1]: import catlib Checked for running processes. 1 processes currently running, including the current process.
In [2]: len(catlib.Category('it', 'Categoria:Geometria descrittiva').articlesList(recurse=True)) Getting [[Categoria:Geometria descrittiva]]... Getting [[Categoria:Coperture a volta]]... Getting [[Categoria:Corrispondenza biunivoca (geometria descrittiva)]]... Getting [[Categoria:Curve piane]]... Getting [[Categoria:Curve tridimensionali]]... Getting [[Categoria:Glossario (geometria descrittiva)]]... Getting [[Categoria:Metodi di rappresentazione]]... Getting [[Categoria:Modellazione geometrica]]... Getting [[Categoria:Tassellazioni]]... Getting [[Categoria:Poliedri]]... Getting [[Categoria:Problemi di misura]]... Getting [[Categoria:Stub geometria descrittiva]]... Getting [[Categoria:Superfici]]... Getting [[Categoria:Sviluppo di solidi]]... Getting [[Categoria:Tangenza]]... Out[2]: 393 ###########end ipython output###############
Notice this patch also solves the problem of eventual loops in categories: catlib won't loop.
----------------------------------------------------------------------
Comment By: Rotem Liss (rotemliss) Date: 2007-12-09 10:37
Message: Logged In: YES user_id=1327030 Originator: NO
I changed the patch so that it solves also the problem of duplicate returns, and also does add the subcategory to its cache, and applied it in r4656.
----------------------------------------------------------------------
You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1843787...
pywikipedia-l@lists.wikimedia.org