Hi!
I'd like to do some statistics based on the database dumps, so I'd need a lot of more disk quota (50GB would be fine or at least 2GB for the cur-dumps :-). It's better to only mirror dump files at one place so I created the directory
/u01/u/voj/dumps
that is also accesible via
http://tools.wikimedia.de/~voj/dumps/
Unfourtunately I have not been able to add header and footer to the the directory listing - I made .htaccess and readme readable and writable for the whole users group - maybe someone can help? We can also move the dumps directory to another location.
Since you can generate a huge amount of data when analysing the dumps I wrote a lot of scripts to pipe the date from one task to another. Here is first example:
/u01/u/voj/dumps/tools/catlinks.pl wp eo 20051009 | wc -l
By the way it's much more faster to bulk import tab-seperated data into a MySQL database than using INSERT statements.
Greetings, Jakob
On Fri, Oct 14, 2005 at 02:14:59PM +0200, Jakob Voss wrote:
I'd like to do some statistics based on the database dumps, so I'd need a lot of more disk quota (50GB would be fine or at least 2GB for the cur-dumps :-)
i've increased your quota to 52428800 blocks (50Gbytes).
. It's better to only mirror dump files at one place so I created the directory
/u01/u/voj/dumps
yes, it would make sense to only have one copy of the dumps. since yours is already there, would you like to become the official dump copier? ;-)
Unfourtunately I have not been able to add header and footer to the the directory listing - I made .htaccess and readme readable and writable for the whole users group - maybe someone can help? We can also move the dumps directory to another location.
what particular problem did you have with it? you can see my .htaccess at /u01/u/kate/public_html/.htaccess which seems to work okay.
you do probably need it to be world readable as well - Apache isn't in the users group when it's generating directory listings.
btw, i was using /u01/wikipedia/dumps/ before. i've made that writable by users, if you feel like moving them.
Greetings, Jakob
kate.
Kate Turner wrote:
I'd like to do some statistics based on the database dumps, so I'd need a lot of more disk quota (50GB would be fine or at least 2GB for the cur-dumps :-)
i've increased your quota to 52428800 blocks (50Gbytes).
. It's better to only mirror dump files at one place so I created the directory
/u01/u/voj/dumps
yes, it would make sense to only have one copy of the dumps. since yours is already there, would you like to become the official dump copier? ;-)
At the moment I don't have time to write and test some scripts to easily copy dumps on demand or with a cron job. Maybe at the moment it's enough to collect all copied dumps in the same way at the same place so if you'd like to get some dumps you have to switch to the corresponding directory and wget it.
what particular problem did you have with it? you can see my .htaccess at /u01/u/kate/public_html/.htaccess which seems to work okay.
Thanks - now it works. And you can browse http://tools.wikimedia.de/~voj/dumps/
Please don't uncompress files if you don't have to. For instance you can read file.gz this way:
gzip -dc file.gz |
btw, i was using /u01/wikipedia/dumps/ before. i've made that writable by users, if you feel like moving them.
I renamed the files and sorted it to into /u01/u/voj/dumps/. If you want you can also move everything to somewhere else and make it accesible via
http://tools.wikimedia.de/dumps/
filenames in .htaccess need to be changed when moving.
It's a pitty that I have not had this server 2 month ago. I almost finished my masters thesis with wikipedia statistics of smaller proportions of the dumps and now I could analyse all of it!
Greetings, Jakob
toolserver-l@lists.wikimedia.org