[Wiki-research-l] Updates: XML parser, weekly stats production, server, readership data

11 Jul 2006

[1] Thanks to superb work by Erik Garrison, we now have an efficient,
C-based parser that extracts header data from WMF xml dumps into csv files
readable by standard statistical software packages.
  * Source for this parser will soon be web-available; stay tuned.
  * The csv files will also be available online, either from
download.wikimedia.org (if the parser can be run on the WMF servers) or from
a webserver on karma or at NBER (see below).
  * If you just can't wait, let us know and we'll offer express service :)
  * The csv files consist of these variables with these types:
names:  title,articleid,revid,date,time,anon,editor,editorid,minor
types:  str,int,int,str,str,[0/1],str,int,[0/1]

[2] We have begun to use these csv files to produce weekly sets of
statistics.
See last week's work here:
http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikidemia/Quant/Stats200…

This week we will finish out that set of stats.
Next week's list needs your creative suggestions:  Please edit directly!
http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikidemia/Quant/Stats200…

[3] NBER has set us up with a pretty good Linux box, wikiq.nber.org, running
Fedora Core 5.  We hope to have Xen instances available for researchers
interested in doing statistical analysis on the csv files within two weeks.

[4] WMF readership data continues to be irretrievably lost.  What can we do
to begin saving at least some of it as soon as possible?  If we were to save
only articleid for one of every hundred squid requests, and include some
indicator in the file at the end of each day, privacy concerns and
computational burdens would be minimized, and this would still be a great
start.
     How can we make this happen?

Best,
Jeremy

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[Wiki-research-l] Updates: XML parser, weekly stats production, server, readership data