[Pywikipedia-l] [ pywikipediabot-Bugs-2193942 ] reading category: memory leak and slow down

SourceForge.net noreply at sourceforge.net
Sat Oct 25 11:10:52 UTC 2008


Bugs item #2193942, was opened at 2008-10-25 13:10
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2193942&group_id=93107

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: category
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Simone Malacarne (smalacarne)
Assigned to: Nobody/Anonymous (nobody)
Summary: reading category: memory leak and slow down

Initial Comment:
I need to read a very big category (80.000+ articles).
So i just do:
site = wikipedia.getSite()
cat = catlib.Category(site,'category name')
gen = pagegenerators.PreloadingGenerator(cat.articles(), pageNumber=100)
for page in gen:
       do_something

problem is that the program start using more and more memory (at the end near 2giga ram). Even cpu time increase over time, if first 10.000 articles are processed in 10 min, second 10.000 double that time and so on... it takes about 20 hours to read all the articles.

If i use: 
gen = pagegenerators.CategorizedPageGenerator(cat , recurse=False, start=u'')
instead of PreloadingGenerator i dont have mem or cpu leaks but it's slow as hell to read and articles at the time (more than 24 hours to finish).

Pywikipedia [http] trunk/pywikipedia (r6015, Oct 24 2008, 18:29:39)
Python 2.5.2 (r252:60911, Oct  5 2008, 19:29:17) 
[GCC 4.3.2]

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2193942&group_id=93107



More information about the Pywikipedia-l mailing list