Thanks to superb work by Erik Garrison, we now have an efficient,
C-based parser that extracts header data from WMF xml dumps into csv files
readable by standard statistical software packages.
* Source for this parser will soon be web-available; stay tuned.
* The csv files will also be available online, either from
(if the parser can be run on the WMF servers) or from
a webserver on karma or at NBER (see below).
* If you just can't wait, let us know and we'll offer express service :)
* The csv files consist of these variables with these types:
 We have begun to use these csv files to produce weekly sets of
See last week's work here:
This week we will finish out that set of stats.
Next week's list needs your creative suggestions: Please edit directly!
 NBER has set us up with a pretty good Linux box, wikiq.nber.org
Fedora Core 5. We hope to have Xen instances available for researchers
interested in doing statistical analysis on the csv files within two weeks.
 WMF readership data continues to be irretrievably lost. What can we do
to begin saving at least some of it as soon as possible? If we were to save
only articleid for one of every hundred squid requests, and include some
indicator in the file at the end of each day, privacy concerns and
computational burdens would be minimized, and this would still be a great
How can we make this happen?