I have written a PERL script that parses the SQL dumps (cur & old) and generates a html file, containing for each month since the project started:
1 = number of wikipedians who contributed at least 10 edits 2 = increase in (1) in the past month 3 = wikipedians who contributed > 5 edits in the past month, idem > 100 4 = total number of articles according to new (link) counting system 5 = mean number of revisions per article 6 = mean size of article in bytes 7 = total edits in past month 8 = combined size of all articles 9 = total number of links
The reports also list most active wikipedians, ordered by # edits
Please note that the script produces historical growth figures per Wikipedia based on the >> new (link) counting system << right from the first month.
See results for de: fr: nl: eo: (and Perl script itself) at http://members.chello.nl/epzachte/Wikipedia/Statistics
A. Feedback is appreciated
B. I propose to run this script weekly on the new SQL dumps for all WP's and put the resulting html files in a public folder.
C. I'd like to test the script with the huge English SQL dumps, but I can't download a 1600Mb file without transmission errors. Could someone please split the file in 50 Mb chuncks, generate MD5 checksums, put all in a temp folder (public!) and inform me? Thanks! xxx@chello.nl (xxx=epzachte)
D. Open issues: unicode support, possibly further optimization for English version (e.g. sort)
Erik Zachte
---- ad B The English version will run for a while, but at least it will not tie up the live database. I expect it will take under an hour, the German files (650 Mb) were processed in 4.5 minutes on my 1.2 GHz PC.
Ad C. As a test I downloaded the same 100Mb TomeRaider file eight times, four times the checksum failed, FTP is too unreliable for huge files.
Erik Z.-
I have written a PERL script that parses the SQL dumps (cur & old) and generates a html file, containing for each month since the project started:
Nice work, and all without using an actual database. Impressive. I have split up the English OLD tarball into 50 megs file and put it up on
http://pliny.wikipedia.org/tarballs/tmp/
together with the md5sums. Please let me know when you have downloaded it so I can remove the files.
However, I find it a bit strange that you get these transmission errors. Normally TCP/IP should take care of that, no? Have you experienced this before?
Regards,
Erik
Erik Zachte wrote:
I'd like to test the script with the huge English SQL dumps, but I can't download a 1600Mb file without transmission errors.
You should use a Download manager that can resume your download in the case of an error. Try, for example, http://www.getright.com/ (but that's just an example, I don't favour any particular one).
Timwi
Erik Zachte wrote:
I have written a PERL script that parses the SQL dumps (cur & old) and generates a html file, containing for each month since the project started.
Thanks a lot! I prefer developing this features out of the server anyway.
I wrote a german translation. feel free to modify :-) http://meta.wikipedia.org/wiki/Statistics_script
Jakob
wikitech-l@lists.wikimedia.org