[1] Thanks to superb work by Erik Garrison, we now have an efficient, C-based parser that extracts header data from WMF xml dumps into csv files readable by standard statistical software packages. * Source for this parser will soon be web-available; stay tuned. * The csv files will also be available online, either from download.wikimedia.org (if the parser can be run on the WMF servers) or from a webserver on karma or at NBER (see below). * If you just can't wait, let us know and we'll offer express service :) * The csv files consist of these variables with these types: names: title,articleid,revid,date,time,anon,editor,editorid,minor types: str,int,int,str,str,[0/1],str,int,[0/1]
[2] We have begun to use these csv files to produce weekly sets of statistics. See last week's work here: http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikidemia/Quant/Stats2006...
This week we will finish out that set of stats. Next week's list needs your creative suggestions: Please edit directly! http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikidemia/Quant/Stats2006...
[3] NBER has set us up with a pretty good Linux box, wikiq.nber.org, running Fedora Core 5. We hope to have Xen instances available for researchers interested in doing statistical analysis on the csv files within two weeks.
[4] WMF readership data continues to be irretrievably lost. What can we do to begin saving at least some of it as soon as possible? If we were to save only articleid for one of every hundred squid requests, and include some indicator in the file at the end of each day, privacy concerns and computational burdens would be minimized, and this would still be a great start. How can we make this happen?
Best, Jeremy
wikitech-l@lists.wikimedia.org