This mail (including pictures) was sent to attendants of Wikimania 2006 and some others that recently showed active interest in quantitative research. Crossposting here. I hope you will find at least something in this mail that is to your liking.
Wikimania 2006 was, like its predecessor in Frankfurt, a source of inspiration. Several official and impromptu meetings were held that were related to research and quantitative analysis. On a conference with 6 parallel sessions one has to make difficult choices, and for me it was impossible to attend several highly interesting research meetings.
----------------------------------------------------------------------
Wikimedia Research
I am very much looking forward towards a transcript or at least speaker notes and/or personal observations of several presentations. Foremost among them James' Research about Wikimedia: A workshop [1]
I also hope that James as Chief Research Officer could give us a sense of direction and timing: the mission of the Wikimedia Research Network [2] is lofty, the number of Wikimedians that subscribed large, but the current status for most activities seems to be 'idle' [3] [4] ? Also is there any coordination with external research groups, like mentioned on [5] and elsewhere [6] ?
Would it be useful to divide Wikimedia Research Network activities in A Quantitative Analysis B Social Research Collaborations [7] C Other Activities and coordinate these separately?
C would still cover 50%+ of the WRN mission statement, like: identify the needs of the individual Wikimedia projects, make recommendations for targeted development, guide and motivate outside developers, assist in the study of new project proposals.
I expect on Wikimania most social science sessions [8] presented relevant material and either used or added to quantative research. So there is synergy between A and B.
[1] http://wikimania2006.wikimedia.org/wiki/Proceedings:JF1 [2] http://meta.wikimedia.org/wiki/Wikimedia_Research_Network [3] http://meta.wikimedia.org/wiki/Category:Research_Team [4] http://meta.wikimedia.org/wiki/Research/Research_Projects [5] http://meta.wikimedia.org/wiki/Research [6] http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikidemia [7] http://meta.wikimedia.org/wiki/Research/Social_Research_Collaborations [8] http://wikimania2006.wikimedia.org/wiki/Category:Wiki_Social_Science
---------------------------------------------------------------------- Communication
There was no IRC meeting of the Research Team after December 2005. There are pretty active Wikimedia researchers outside the team though. For me Wikimania 2006 confirmed that more exchange of ideas would be helpful.
I'm not sure more IRC discussions are a panacea. Personally I prefer discussion via wiki and mailing list, it is less spontaneous but one can easier formulate a coherent proposal or comment on it in a thoughtful manner, and no less important: it is much better to follow for others who read the discussion later.
Part of the information flow is now on meta, some of it on the research mailing list [8] (which is largely dormant [9], though recent posts are very useful). And some of it on the freelogy list [10] and probably elsewhere.
What about making the Wikimedia research list the central forum for all broad and conceptual discussions and link from there to meta for detailed discussions? I will post his mail there anyway, of course without the images.
[8] http://mail.wikipedia.org/pipermail/wiki-research-l/ [9] http://www.infodisiac.com/Wikipedia/ScanMail/Wiki-research-l.html [10] http://karma.med.harvard.edu/mailman/listinfo/freelogy-discuss
----------------------------------------------------------------------
Visualisation
I personally enjoyed very much session Can Visualization Help? [11] = IBM researcher Fernanda Viégas [12] talked about the famous Wikipedia History Flow tool [13], which was recently extended, announced a free edition and told that Tim Starling had pledged to reinstate the relevant export function so that we can use the tool on our projects. = IBM researcher Martin Wattenberg [14] showed his newest toy where one can see all contributions of one single Wikimedia editor, presented as an association cloud (titles grouped per namespace and sorted by number of edits, font size varied per title to express relative number of edits). It is somewhat scary though, I feel a quantitative improvement - exposing data that are already online in a much more efficient manner -, can lead to a qualitative setback - exposing ones character and interests in a way that was never expected. People may after all regret that they edited under their real name. Although personally I will happily continue to do so, it is a matter of responsibility towards the community to at least discuss whether we should actively promote such a tool. I know I'm partially guilty in this respect myself with mailing list stats but feel that did not cross the line. = Visualization guru Ben Schneiderman [15] made a case for more advanced data visualisation tools to spice up wikistats. I am a long time admirer of several of his UI inventions and happy to take up the challenge.
[11] http://wikimania2006.wikimedia.org/wiki/Proceedings:FV1 [12] http://alumni.media.mit.edu/~fviegas/ [13] http://www.research.ibm.com/visual/projects/history_flow/ [14] http://www.bewitched.com/ [15] http://www.cs.umd.edu/~ben/
---------------------------------------------------------
General User Survey
One promising but sleeping WRT project, that I initiated myself, is the 'General User Survey' [21]. A few Wikimania participants interested in wikistats gathered ad hoc at lunch time on Saturday (others interested in the project, Cormaggio, Piotrus were at the conference, but not in the vicinity at that moment). Kevin Gamble, associate director of 75 Land-Grant Universities, expressed his continued interest and said he might be able to offer programming support
A project definition plus rationale [21] and a mockup questionnaire form [22] have been created and discussed for more than a year. I started the transition towards technical design [23] and with Kevins support and resources coding might follow later this year. Once we have a proof of concept in e.g. English and German (at least two languages to show multilingual aspects) I'm sure more people will start to take notice, and help to discuss and fine-tune the questionnaire. At a later stage, before going live with a multilingual golden edition, we will probably have to discuss matters with the board (Anthere already stated her support) in order to make this an official survey, hopefully with coverage on the project pages themselves (banner announcement ?). Mind you, the implementation is not exactly trivial, lots of issues involved that require critical discussion, code and coordination. I invite everyone to comment on tech notes, especially of course Kevin, and hope to learn from him whether coding this project fits within his budget.
[21] http://meta.wikimedia.org/wiki/General_User_Survey [22] http://meta.wikimedia.org/wiki/General_User_Survey/Questionnaire [23] http://meta.wikimedia.org/wiki/General_User_Survey/Implementation_Issues
--------------------------------------------------------
Quantitative Analysis
Saturday I met Jeremy Tobacman. We had a long and very interesting discussion, mainly on new initiatives centered around the freelogy servers. Jeremy proposed to held an impromptu lunch meeting on Sunday and gathered a room full of people.
[pictures removed]
Several mails have already been written about this, but to a smaller audience. So here are a few highlights.
Issues that were discussed:
1 Hardware The two tool servers [32] are very crowded and insufficient for all stats jobs we might want to run. The tool servers run a mirror of the live database so well behaved SQL queries are possible. Well behaved meaning they should no try to emulate the xml dump process where extracting the English Wikipedia (all revisions) already takes a full week.
Alexander Wait (Sasha) has access to huge hardware resources, enough to calculate how many parallel universes it takes to find at least one zebra couple where a black-and-white mother and a white-and-black father have exactly mirrored patterns and thus produce offspring that is either all black or all white (mind you, albino's are false positives).
Since in reality Sasha is merely interested in unraveling the secrets of DNA he has some cpu cycles to spare. Upon request virtual machines can be catered for. The freelogy-discuss mailing list archives have information about hardware availability [33]
By the way, Jeremy and Erik Tobacman have a server at The National Bureau of Economic Research (NBER) for quantitative research on Wikipedia.
Also I am urged by the Communications Subcomittee to spend more of my time on publishable stats (in time spent TomeRaider offline edition of Wikipedia easily dominated, but the time for offline browsing is nearly over) and they want me to have a dedicated server. I would like it to be well utilised, but of course it should produce timely wikistats in the first place, as that is what it is offered for. To be discussed.
2 Real time data collection / Performance / Storage It would be useful to learn when a page is being slashdotted or otherwise in the news, at the moment of the actual event, in order that vandal patrols can be timely summoned, and article improvement can commence right away.
Major performance issues need to be addressed.
Do we gather and keep every page hit ? Hardly practicable. Wikimedia visitor stats were not disabled for no reason. It seems we are getting switches that can log accesses stochastically (e.g. every 100nth access, plus for a selected subset of IP addresses all hits to monitor navigation patterns). There might be a need to store data in aggregated (condensed) form, as volumes will be huge. At least tapping from switches directly puts no burden on squids (=web proxies/caches).
Brion will be asked to drop bz2 compression on xml dump job, as it is so much slower and compresses so much less than 7zip. Brion had to develop a distributed version of bzip to get it working at all on the 800 Gb enwiki dump file. Format bz2 is however supported on more platforms, so Brion may no comply.
Specifically about wikistats: I explained why I always process the full historic dump instead of doing incremental steps: new functionality in wikistats means processing it all anyway. Data for older months are not really static due to frequent deletions and moves. Could I speed up counts section of wikistats by splitting job over several servers ? I'll have to look into it.
3 Data publishing We should be careful not to publish very granular data for outside inspection. It is a well known fact that China wants complete control over its citizens. Less known is that they have the latest technology (mainly bought in the US) and lots of it, and about 30.000 IT professionals (estimate by Reporters without Borders/Reporters sans Frontières) working on concealment of internet resources, redirection of internet requests and spying on internet usage patterns in general. They would love to see our raw access logs. Cathy will you attend the Chinese Wikimania? [34] If you happen to hear about these things, I hope you will blog about it. See also [35]
See also well timed scoop [36] about AOL privacy disaster.
4 Measuring quality quantitatively It may be impossible to define quality, let alone measure it, But it will be fun to zoom in on it and see how far we can come. Spurred by Jimbo's excellent Wikimania kick off speech, where he stressed we will need more attention to quality, I started a project to extend wikistats. Brian offered lots of ideas and hopefully will prove me wrong in my belief that adding spelling, grammar and readability assessments is not to be taken too lightly in a multilingual environment [37] [38]
[31] http://wikimania2006.wikimedia.org/wiki/Proceedings:CM1 mp3 audio available [32] http://meta.wikimedia.org/wiki/Toolserver [33] http://karma.med.harvard.edu/mailman/private/freelogy-discuss/2006-May/00000 2.html (registration needed: http://karma.med.harvard.edu/mailman/listinfo/freelogy-discuss) [34] http://en.wikinews.org/wiki/Chinese_Wikimania_2006_to_be_held_in_Hong_Kong [35] http://wikimania2006.wikimedia.org/wiki/User:Roadrunner (I wonder if he is the person who gave a smashing full hour speech on this at 20c3 Berlin) [36] http://www.siliconbeat.com/entries/2006/08/06/aol_research_exposes_data_weve _got_a_little_sick_feeling.html (data were anonimized but some users had searched for their own name several times and were easily recognized, lots of very embarrassing stuff was uncovered) [37] http://meta.wikimedia.org/wiki/Wikistats/Measuring_Article_Quality (conceptual overview) [38] http://meta.wikimedia.org/wiki/Wikistats/Measuring_Article_Quality/Operation alisation_for_wikistats
---------------------------------------------------------
Ongoing
By the way Angela Beasley and Jakob Voss will give a workshop on Wikipedia research on WikiSym 2006 [41] [42]
[41] http://ws2006.wikisym.org/space/Workshop%3E%3EWikipedia+Research [42] http://meta.wikimedia.org/wiki/Workshop_on_Wikipedia_Research%2C_WikiSym_200 6
Regards, Erik Zachte
wiki-research-l@lists.wikimedia.org