Bugs item #2136828, was opened at 2008-09-29 21:10
Message generated for change (Comment added) made by spacebirdy
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2136828&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: ulana merops (spacebirdy)
Assigned to: Nobody/Anonymous (nobody)
Summary: wiktionary_family.py - wrong sort order for fr.wikt.
Initial Comment:
Please see http://fr.wiktionary.org/wiki/Discussion_Wiktionnaire:Structure_des_article…
and remove 'fr': self.alphabetic,
in line 416
Syntax on fr.wikt:
http://fr.wiktionary.org/wiki/Wiktionnaire:Structure_des_articles#Liens_int…
I don't know who added this here but it seems wrong, thanks
----------------------------------------------------------------------
>Comment By: ulana merops (spacebirdy)
Date: 2008-10-29 18:25
Message:
There will be no response to that ever...
I suggest to remove it, since I was notified by a fr.wikt user and this
page
http://fr.wiktionary.org/wiki/Wiktionnaire:Structure_des_articles#Liens_int…
It is quite annoying that this issue is not solved by the fr.wikt
community, I am unsure which sort order to use.
Thanks for Your time.
----------------------------------------------------------------------
Comment By: NicDumZ — Nicolas Dumazet (nicdumz)
Date: 2008-10-20 04:16
Message:
I really don't understand that request ; that line has always been here,
and used as a convention for all pywikipedia bot running on fr.wikt
I left a note, in French, on the project page.
----------------------------------------------------------------------
Comment By: ulana merops (spacebirdy)
Date: 2008-10-11 14:02
Message:
Please, I would like to update the bot normally without having to remove
that line all the time, thanks in advance.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2136828&group_…
Revision: 6041
Author: nicdumz
Date: 2008-10-29 04:24:40 +0000 (Wed, 29 Oct 2008)
Log Message:
-----------
yes, query-continue can return ints ... but sometimes also some non-ascii characters :)
Modified Paths:
--------------
branches/rewrite/pywikibot/data/api.py
Modified: branches/rewrite/pywikibot/data/api.py
===================================================================
--- branches/rewrite/pywikibot/data/api.py 2008-10-28 22:15:22 UTC (rev 6040)
+++ branches/rewrite/pywikibot/data/api.py 2008-10-29 04:24:40 UTC (rev 6041)
@@ -409,8 +409,11 @@
raise Error("Missing '%s' key in ['query-continue'] value."
% self.module)
update = self.data["query-continue"][self.module]
- for key in update:
- self.request[key] = str(update[key])
+ for key, value in update.iteritems():
+ # query-continue can return ints
+ if isinstance(value, int):
+ value = str(value)
+ self.request[key] = value
def result(self, data):
"""Process result data as needed for particular subclass."""
Bugs item #2200214, was opened at 2008-10-27 09:12
Message generated for change (Comment added) made by pathoschild
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2200214&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: General
Group: v1.0 (example)
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Jesse PW (pathoschild)
Assigned to: Nobody/Anonymous (nobody)
Summary: yo.wikibooks incorrectly listed as obsolete
Initial Comment:
Yo.wikibooks is listed as obsolete, but is open. This can be fixed by removing it from the "self.obsolete" dictionary in pywikipedia/families/wikibooks_family.py on line 318.
Version at time of report:
Pywikipedia [http] trunk/pywikipedia (r6019, Oct 25 2008, 16:16:12)
Python 2.6 (r26:66721, Oct 2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)]
----------------------------------------------------------------------
>Comment By: Jesse PW (pathoschild)
Date: 2008-10-28 15:33
Message:
There was a discussion to close it, but it was never done. The message on
the main page is the default for all new wikis.
----------------------------------------------------------------------
Comment By: NicDumZ — Nicolas Dumazet (nicdumz)
Date: 2008-10-28 11:08
Message:
mmmm what about
http://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Closure_of_Yo…
then ?
And
http://yo.wikibooks.org/wiki/Oj%C3%BAew%C3%A9_%C3%80k%E1%BB%8D%CC%81k%E1%BB…
which reads : "This subdomain is reserved for the creation of a Wikibooks
in the Yoruba language. There are currently 4 pages in this Wikibooks." ??
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2200214&group_…
Bugs item #2200214, was opened at 2008-10-27 10:12
Message generated for change (Comment added) made by nicdumz
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2200214&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: General
Group: v1.0 (example)
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Jesse PW (pathoschild)
Assigned to: Nobody/Anonymous (nobody)
Summary: yo.wikibooks incorrectly listed as obsolete
Initial Comment:
Yo.wikibooks is listed as obsolete, but is open. This can be fixed by removing it from the "self.obsolete" dictionary in pywikipedia/families/wikibooks_family.py on line 318.
Version at time of report:
Pywikipedia [http] trunk/pywikipedia (r6019, Oct 25 2008, 16:16:12)
Python 2.6 (r26:66721, Oct 2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)]
----------------------------------------------------------------------
Comment By: NicDumZ — Nicolas Dumazet (nicdumz)
Date: 2008-10-28 12:08
Message:
mmmm what about
http://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Closure_of_Yo…
then ?
And
http://yo.wikibooks.org/wiki/Oj%C3%BAew%C3%A9_%C3%80k%E1%BB%8D%CC%81k%E1%BB…
which reads : "This subdomain is reserved for the creation of a Wikibooks
in the Yoruba language. There are currently 4 pages in this Wikibooks." ??
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2200214&group_…
Bugs item #2193942, was opened at 2008-10-25 13:10
Message generated for change (Settings changed) made by nicdumz
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2193942&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: category
Group: None
>Status: Pending
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: Simone Malacarne (smalacarne)
Assigned to: Nobody/Anonymous (nobody)
Summary: reading category: memory leak and slow down
Initial Comment:
I need to read a very big category (80.000+ articles).
So i just do:
site = wikipedia.getSite()
cat = catlib.Category(site,'category name')
gen = pagegenerators.PreloadingGenerator(cat.articles(), pageNumber=100)
for page in gen:
do_something
problem is that the program start using more and more memory (at the end near 2giga ram). Even cpu time increase over time, if first 10.000 articles are processed in 10 min, second 10.000 double that time and so on... it takes about 20 hours to read all the articles.
If i use:
gen = pagegenerators.CategorizedPageGenerator(cat , recurse=False, start=u'')
instead of PreloadingGenerator i dont have mem or cpu leaks but it's slow as hell to read and articles at the time (more than 24 hours to finish).
Pywikipedia [http] trunk/pywikipedia (r6015, Oct 24 2008, 18:29:39)
Python 2.5.2 (r252:60911, Oct 5 2008, 19:29:17)
[GCC 4.3.2]
----------------------------------------------------------------------
Comment By: NicDumZ — Nicolas Dumazet (nicdumz)
Date: 2008-10-28 11:30
Message:
Well, guess what ? I have no idea why we would need to cache the content of
a category... I guess someone assumed users would iterate through a
category several times. Does anyone has a serious usage case of such a
behavior ? I might be wrong but I think that you can always serialize in
some way your code to avoid calling several times your generator function.
Since r6038, the default generator now uses a naive content getter which
does not cache anything.
----------------------------------------------------------------------
Comment By: Simone Malacarne (smalacarne)
Date: 2008-10-26 21:43
Message:
I track the problem to catlib in the category._getContents function.
The function cache something but with a lot of pages memory and cpu use is
massive.
I try to comment 2 lines in this part:
else:
print ('not Cached')
for tag, page in self._parseCategory(purge, startFrom):
if tag == ARTICLE:
#self.articleCache.append(page)
if not page in cache:
#cache.append(page)
yield ARTICLE, page
and all is fine now, memory use is about 20/30mbyte fix and cpu occupation
is normal.
Don't know what that cache is used for but it caused me lot of trouble.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2193942&group_…
Bugs item #2193942, was opened at 2008-10-25 13:10
Message generated for change (Comment added) made by nicdumz
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2193942&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: category
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Simone Malacarne (smalacarne)
Assigned to: Nobody/Anonymous (nobody)
Summary: reading category: memory leak and slow down
Initial Comment:
I need to read a very big category (80.000+ articles).
So i just do:
site = wikipedia.getSite()
cat = catlib.Category(site,'category name')
gen = pagegenerators.PreloadingGenerator(cat.articles(), pageNumber=100)
for page in gen:
do_something
problem is that the program start using more and more memory (at the end near 2giga ram). Even cpu time increase over time, if first 10.000 articles are processed in 10 min, second 10.000 double that time and so on... it takes about 20 hours to read all the articles.
If i use:
gen = pagegenerators.CategorizedPageGenerator(cat , recurse=False, start=u'')
instead of PreloadingGenerator i dont have mem or cpu leaks but it's slow as hell to read and articles at the time (more than 24 hours to finish).
Pywikipedia [http] trunk/pywikipedia (r6015, Oct 24 2008, 18:29:39)
Python 2.5.2 (r252:60911, Oct 5 2008, 19:29:17)
[GCC 4.3.2]
----------------------------------------------------------------------
Comment By: NicDumZ — Nicolas Dumazet (nicdumz)
Date: 2008-10-28 11:30
Message:
Well, guess what ? I have no idea why we would need to cache the content of
a category... I guess someone assumed users would iterate through a
category several times. Does anyone has a serious usage case of such a
behavior ? I might be wrong but I think that you can always serialize in
some way your code to avoid calling several times your generator function.
Since r6038, the default generator now uses a naive content getter which
does not cache anything.
----------------------------------------------------------------------
Comment By: Simone Malacarne (smalacarne)
Date: 2008-10-26 21:43
Message:
I track the problem to catlib in the category._getContents function.
The function cache something but with a lot of pages memory and cpu use is
massive.
I try to comment 2 lines in this part:
else:
print ('not Cached')
for tag, page in self._parseCategory(purge, startFrom):
if tag == ARTICLE:
#self.articleCache.append(page)
if not page in cache:
#cache.append(page)
yield ARTICLE, page
and all is fine now, memory use is about 20/30mbyte fix and cpu occupation
is normal.
Don't know what that cache is used for but it caused me lot of trouble.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2193942&group_…
Revision: 6038
Author: nicdumz
Date: 2008-10-28 10:23:45 +0000 (Tue, 28 Oct 2008)
Log Message:
-----------
Fix for [2193942 ] reading category: memory leak and slow down :
Let's not assume users want to crawl several times a category, and make the default behavior NON CACHING. Why would a user iterate several times on a category content anyway ?
Modified Paths:
--------------
trunk/pywikipedia/catlib.py
Modified: trunk/pywikipedia/catlib.py
===================================================================
--- trunk/pywikipedia/catlib.py 2008-10-27 20:51:31 UTC (rev 6037)
+++ trunk/pywikipedia/catlib.py 2008-10-28 10:23:45 UTC (rev 6038)
@@ -93,7 +93,7 @@
else:
return '[[%s]]' % titleWithSortKey
- def _getContents(self, recurse=False, purge=False, startFrom=None, cache=None):
+ def _getAndCacheContents(self, recurse=False, purge=False, startFrom=None, cache=None):
"""
Cache results of _parseCategory for a second call.
@@ -129,7 +129,7 @@
# contents of subcategory are cached by calling
# this method recursively; therefore, do not cache
# them again
- for item in subcat._getContents(newrecurse, purge, cache=cache):
+ for item in subcat._getAndCacheContents(newrecurse, purge, cache=cache):
yield item
else:
for tag, page in self._parseCategory(purge, startFrom):
@@ -147,11 +147,22 @@
# contents of subcategory are cached by calling
# this method recursively; therefore, do not cache
# them again
- for item in page._getContents(newrecurse, purge, cache=cache):
+ for item in page._getAndCacheContents(newrecurse, purge, cache=cache):
yield item
if not startFrom:
self.completelyCached = True
+ def _getContentsNaive(self, recurse=False, startFrom=None):
+ """
+ Simple category content yielder. Naive, do not attempts to
+ cache anything
+ """
+ for tag, page in self._parseCategory(startFrom=startFrom):
+ yield tag, page
+ if tag == SUBCATEGORY and recurse:
+ for item in page._getContentsNaive(recurse=True):
+ yield item
+
def _parseCategory(self, purge=False, startFrom=None):
"""
Yields all articles and subcategories that are in this category.
@@ -259,7 +270,7 @@
else:
break
- def subcategories(self, recurse=False, startFrom=None):
+ def subcategories(self, recurse=False, startFrom=None, cacheResults=False):
"""
Yields all subcategories of the current category.
@@ -269,9 +280,18 @@
equivalent to recurse = False, recurse = 1 gives first-level
subcategories of subcategories but no deeper, etcetera).
+ cacheResults - cache the category contents: useful if you need to
+ do several passes on the category members list. The simple cache
+ system is *not* meant to be memory or cpu efficient for large
+ categories
+
Results a sorted (as sorted by MediaWiki), but need not be unique.
"""
- for tag, subcat in self._getContents(recurse, startFrom=startFrom):
+ if cacheResults:
+ gen = self._getAndCacheContents
+ else:
+ gen = self._getContentsNaive
+ for tag, subcat in gen(recurse=recurse, startFrom=startFrom):
if tag == SUBCATEGORY:
yield subcat
@@ -289,7 +309,7 @@
subcats.append(cat)
return unique(subcats)
- def articles(self, recurse=False, startFrom=None):
+ def articles(self, recurse=False, startFrom=None, cacheResults=False):
"""
Yields all articles of the current category.
@@ -297,10 +317,19 @@
Recurse can be a number to restrict the depth at which subcategories
are included.
+ cacheResults - cache the category contents: useful if you need to
+ do several passes on the category members list. The simple cache
+ system is *not* meant to be memory or cpu efficient for large
+ categories
+
Results are unsorted (except as sorted by MediaWiki), and need not
be unique.
"""
- for tag, page in self._getContents(recurse, startFrom=startFrom):
+ if cacheResults:
+ gen = self._getAndCacheContents
+ else:
+ gen = self._getContentsNaive
+ for tag, page in gen(recurse=recurse, startFrom=startFrom):
if tag == ARTICLE:
yield page
@@ -342,7 +371,7 @@
def isEmpty(self):
# TODO: rename; naming conflict with Page.isEmpty
- for tag, title in self._getContents(purge = True):
+ for tag, title in self._parseCategory():
return False
return True
Patches item #2192349, was opened at 2008-10-24 19:20
Message generated for change (Settings changed) made by wikipedian
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=2192349&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Translations
Group: None
>Status: Closed
>Resolution: Accepted
Priority: 5
Private: No
Submitted By: Christoffer (roboticuskhan)
Assigned to: Nobody/Anonymous (nobody)
Summary: Faroese (fo) translations for interwiki.py
Initial Comment:
I have submitted a patch file containing the Faroese translations for interwiki.py provided by User:Quackor (http://fo.wikipedia.org/wiki/Br%C3%BAkari:Quackor), a native speaker of Faroese. Thanks.
----------------------------------------------------------------------
Comment By: Christoffer (roboticuskhan)
Date: 2008-10-24 19:23
Message:
I forgot; I got the translations here:
http://fo.wikipedia.org/wiki/Wikipedia_kjak:%C3%81heitan_um_bott_st%C3%B8%C…
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=2192349&group_…