nicdumz@svn.wikimedia.org ha scritto:
Revision: 6756 Author: nicdumz Date: 2009-04-30 01:47:36 +0000 (Thu, 30 Apr 2009)
Log Message:
Adding an experimental contents_on_disk feature: save the Page contents on disk, in a python shelf, and load them only when needed, instead of loading the contents in RAM.
Activating this option might slow down a bit the whole interwiki process: fetching an entry on disk is slower than simply fetching in RAM the attribute. This should however greatly reduce the memory consumption.
[...]
Modified: trunk/pywikipedia/interwiki.py
[...]
# (C) Rob W.W. Hooft, 2003 # (C) Daniel Herding, 2004 # (C) Yuri Astrakhan, 2005-2006 +# (C) Pywikipedia bot team, 2007-2009
I think you should put your name instead of a generic "Pywikipedia bot team" copyright statement. A comment from original authors would be preferable though.
index = 1
while True:
path = config.datafilepath('cache', 'pagestore' + str(index))
if not os.path.exists(path): break
index += 1
At least this looks nice for diskcache module too, so we can easily get rid of the imported random module and the ugly '*-abfdexjwi' like filenames.
It's also not necessary to set theses line as a Subject destructor:
these
2009/5/1 Francesco Cosoleto cosoleto@gmail.com:
I think you should put your name instead of a generic "Pywikipedia bot team" copyright statement. A comment from original authors would be preferable though.
Well I just wanted to update the date, and I thought that a generic statement was better: in fact... why would I put my name, knowing that purodha did some important fixes on the file during those years?
Note that I'm very flexible on those attributions sections. Any suggestion in welcome, and is likely to be fine with me.
- index = 1
- while True:
- path = config.datafilepath('cache', 'pagestore' + str(index))
- if not os.path.exists(path): break
- index += 1
At least this looks nice for diskcache module too, so we can easily get rid of the imported random module and the ugly '*-abfdexjwi' like filenames.
Thinking again about this: those files are temporary, and are only accessed from one specific entry point. A tempfile would be even cleaner, right? ( http://docs.python.org/library/tempfile.html , standard since 2.3 ) I think I could do this for both diskcache and interwiki, and remove the cache/ directory. Comments?
Speaking of diskcache: I wondered if a simple Shelf ( http://docs.python.org/library/shelve.html ) wouldn't be faster than diskcache. Shelf has been written at low levels, has different interfaces for each specific system family. Naturally I would think that Shelf should be faster and more appropriate than our custom-made module, but Shelf might be too generic, and induce unnecessary overhead?
- It's also not necessary to set theses line as a Subject destructor:
these
fixed, thanks :)
Nicolas Dumazet ha scritto:
Well I just wanted to update the date, and I thought that a generic statement was better: in fact... why would I put my name, knowing that purodha did some important fixes on the file during those years?
Note that I'm very flexible on those attributions sections. Any suggestion in welcome, and is likely to be fine with me.
Forget it. I cannot talk about changes I haven't seen.
index = 1
while True:
path = config.datafilepath('cache', 'pagestore' + str(index))
if not os.path.exists(path): break
index += 1
At least this looks nice for diskcache module too, so we can easily get rid of the imported random module and the ugly '*-abfdexjwi' like filenames.
Thinking again about this: those files are temporary, and are only accessed from one specific entry point. A tempfile would be even cleaner, right? ( http://docs.python.org/library/tempfile.html , standard since 2.3 ) I think I could do this for both diskcache and interwiki, and remove the cache/ directory. Comments?
It would be preferable creating a single file, instead of adding a new file for each separated but identical Site, repeating the same download within a relatively short time... Working similar to a web browser cache.
You can use tempfile in current implementation, but "cache" directory is used from featured.py too, instead of "featured" (r5536). Maybe it's better to keep it, as it's a common name. For example, some my external scripts use it, and maybe in the future more scripts will do it.
Speaking of diskcache: I wondered if a simple Shelf ( http://docs.python.org/library/shelve.html ) wouldn't be faster than diskcache. Shelf has been written at low levels, has different interfaces for each specific system family. Naturally I would think that Shelf should be faster and more appropriate than our custom-made module, but Shelf might be too generic, and induce unnecessary overhead?
I am not sure if it is worth replacing it with shelve here, probably not if you think to speed up the code.
I had always asked myself why we have adopted this solution because I have a doubt about the amount of RAM requested by mediawiki-messages that the bot actually use. I think a list of items not to discard would have been simpler. Although I have really appreciated this more sophisticated solution.
Francesco Cosoleto ha scritto:
I had always asked myself why we have adopted this solution because I have a doubt about the amount of RAM requested by mediawiki-messages that the bot actually use. I think a list of items not to discard would have been simpler. Although I have really appreciated this more sophisticated solution.
It requires about 1-2 Kb for site on wikipedia family, this family has a total of 255 sites. Looks for me as acceptable memory usage (recently I have reduced memory requested by wikipedia module of about 60 Kb with r6751 if I remember rightly...). And with diskcache enabled are 50 Mb or more of diskspace wasted (software should use temporary files only if really needed).
Simple test script:
grep --exclude-dir=.svn -rohP ".mediawiki_message\s*(\s*['"][^)]+)" ./ | sort | uniq | sed -e "s/^/ sum += len(site/" -e "s/$/)/" -e 1i"import wikipedia\nsum = 0\nfor lang in wikipedia.Site('en', 'wikipedia').languages():\n site = wikipedia.getSite(lang, 'wikipedia')" -e "$a\ print sum" >mwmsg_length.py
(I have disabled 'sp-contributions-older' line to run it as it raises an exception on wikipedia:gv)
pywikipedia-l@lists.wikimedia.org