I was taking a look at our dumps in user-store and none of them are compressed, and I was socked about that. I know a lot of people use pywikipedia to parse the dumps, and I know it can handle the bz2 files. any reason we dont just make them all bz2?
John
Yep, if we can use compressed dumps, we can use much lesser resources that what it is using now (currently >800GB).
On Sun, May 6, 2012 at 12:24 AM, John phoenixoverride@gmail.com wrote:
I was taking a look at our dumps in user-store and none of them are compressed, and I was socked about that. I know a lot of people use pywikipedia to parse the dumps, and I know it can handle the bz2 files. any reason we dont just make them all bz2?
John
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
We don't use only PWB. Danny_B made some work recently so that all dumps are in the same place etc., what consumed the most space were duplicates. This was supposed to be documented on https://wiki.toolserver.org/view/User-store but it's never been. Most space is taken by view stats whose compression is discussed here: https://wiki.toolserver.org/view/Talk:User-store
Nemo
We don't use only PWB. Danny_B made some work recently so that all dumps are in the same place etc., what consumed the most space were duplicates. This was supposed to be documented on https://wiki.toolserver.org/view/User-store but it's never been. Most space is taken by view stats whose compression is discussed here: https://wiki.toolserver.org/view/Talk:User-store
There is some basic doc in "readme" files in /mnt/usr-store[/dumps[/...]] though.
However, thanks for reminding me the wiki doc, I'll try to put it on ASAP.
Danny B.
toolserver-l@lists.wikimedia.org