Phoebe Ayers and I are leading a workshop at WikiSym this year,
"WikiLit: Collecting the Wiki and Wikipedia Literature". We would love
to have your participation!
This workshop has three key goals. First, we will examine existing and
proposed systems for collecting and analyzing the research literature
about wikis. Second, we will discuss the challenges in building such a
system and will engage participants to design a sustainable
collaborative system to achieve this goal. Finally, we will provide a
forum to build upon ongoing wiki community discussions about problems
and opportunities in finding and sharing the wiki research literature.
For more details, please see:
Please do not hesitate to ask questions, either by replying here on the
list or by contacting me or Phoebe (psayers(a)ucdavis.edu) directly.
Looking forward to seeing you at WikiSym!
I've been looking to experiment with node.js lately and created a
little toy webapp that displays updates from the major language
wikipedias in real time:
Perhaps like you, I've often tried to convey to folks in the GLAM
sector (Galleries, Libraries, Archives and Museums) just how much
Wikipedia is actively edited. GLAM institutions are increasingly
interested in "digital curation" and I've sometimes displayed the IRC
activity at workshops to demonstrate the sheer number of people (and
bots) that are actively engaged in improving the content there...with
the hopes of making the Wikipedia platform part of their curation
Anyhow, I'd be interested in any feedback you might have about wikistream.
Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and
Fabian Kaelin (who are all Summer of Research fellows) have worked hard
on a customized stream-based InputFormatReader that allows parsing of both
bz2 compressed and uncompressed files of the full Wikipedia dump (dump file
with the complete edit histories) using Hadoop. Prior to WikiHadoop and the
accompanying InputFormatReader it was not possible to use Hadoop to analyze
the full Wikipedia dump files (see the detailed tutorial / background for an
explanation why that was not possible).
1) We can now harness Hadoop's distributed computing capabilities in
analyzing the full dump files.
2) You can send either one or two revisions to a single mapper so it's
possible to diff two revisions and see what content has been addded /
3) You can exclude namespaces by supplying a regular expression.
4) We are using Hadoop's Streaming interface which means people can use this
InputFormat Reader using different languages such as Java, Python, Ruby and
The source code is available at: https://github.com/whym/wikihadoop
A more detailed tutorial and installation guide is available at:
(Apologies for cross-posting to wikitech-l and wiki-research-l)
We are glad to announce the inaugural issue of the Wikimedia Research Newsletter , a new monthly survey of recent scholarly research about Wikimedia projects.
This is a joint project of the Signpost  and the Wikimedia Research Committee  and follows the publication of two research updates in the Signpost, see also last month's announcement on this list .
The first issue (which is simultaneously posted as a section of the Signpost and as a stand-alone article in the Wikimedia Research Index) includes 5 "in depth" reviews of papers published over the last few
months and a number of shorter notes for a total of 15 publications, covering both peer-reviewed research and results published in research blogs. It also includes a report from the Wikipedia research workshop
at OKCon 2011 and highlights from the Wikimedia Summer of Research program.
The following is the TOC of issue #1:
• 1 Edit wars and conflict metrics
• 2 The anatomy of a Wikipedia talk page
• 3 Wikipedians as "Janitors of Knowledge"
• 4 Use of Wikipedia among law students: a survey
• 5 Miscellaneous
• 6 Wikipedia research at OKCon 2011
• 7 Wikimedia Summer of Research
• 7.1 How New English Wikipedians Ask for Help
• 7.2 Who Edits Trending Articles on the English Wikipedia
• 7.3 The Workload of New Page Patrollers & Vandalfighters
• 8 References
We are planning to make the newsletter easy to syndicate and subscribe to. If you wish your research to be featured, a CFP or event you organized to be highlighted, or just join the team of contributors, head over to this page to find out how:  We hope to make this newsletter a favorite reading for our research community and we look forward to your feedback and contributions.
Dario Taraborelli, Tilman Bayer (HaeB)
on behalf of the WRN contributors
Dario Taraborelli, PhD
Senior Research Analyst
(* apologies for cross-posting *)
The second issue of the monthly Wikimedia Research Newsletter is out:
In this issue:
• Effective collaboration leads to earlier article promotion
• Deleted revisions in the English Wikipedia
• Wikipedia and open-access repositories
• Quality of featured articles doesn't always impress readers
• In swine flu outbreak, Wikipedia reading preceded blogging and newspaper writing
• Extensive analysis of gender gap in Wikipedia to be presented at WikiSym 2011
• "Bandwagon effect" spurs wiki adoption among Chinese-speaking users
• In brief
You can post suggestions and contributions for the next issue at:
Dario Taraborelli, PhD
Senior Research Analyst
WikiSym 2011, The International Symposium on Wikis and Open
Collaboration, taking place October 3-5, 2011 in Mountain View,
California has early-bird registration that ends August 29. That is
Finn Årup Nielsen, DTU Informatics
http://www.imm.dtu.dk/~fn/ +45 45 25 39 21.
Wikimedia UK is pleased to announce that we are offering two full scholarships to enable UK researchers to attend WikiSym this year. You can find the full information, and details of how to apply, at:
Please let me know if you have any questions, and please feel free to pass this on to anyone you think might be interested.
I would like to take a sample from the english wikipedia based on page
ratings. This requires to extract all page ratings and then pick the
best according to specific feedback-labels. There does not seem to be an
api call to filter pages directly via their rating. So my approach would
be to get all page ratings and then apply my filter criteria. The api
calls to access feedback only seem to allow one page per call- which
results in a lot of calls ;-). Example:
Does anybody know a more polite way to get this information? I have
checked the dumps but could not found a suitable archive (at least given
by the names).
> I'm doing some analysis on the wikipedia image metadata and seeing some
> missing image rows in the sql dumps.
> I downloaded
> enwiki-latest-image.sql, enwiki-latest-imagelinks.sql,
> and enwiki-latest-oldimage.sql from
> I picked a page, 25041,
> I get 39 links from
> "select il_to from imagelinks where il_from = 25041"
> When I query the image table for these, only 8 of the 39 appear.
> Some of the missing files are 050218-F-1234P-076.jpg, 020930-O-9999G-017.jpg
> I grepped the original mysql file for these and get nothing.
> I can see the original file here though:
> I did a select count and got a total of 849,801 rows. Seems low for the
> total # of wikipedia images.
> Any ideas why i'm getting missing data?