Analytics April 2015

analytics@lists.wikimedia.org

51 participants
46 discussions

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

Relevant Content Availability

by Abdel Samad, Rawia

Hello, I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it: * We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs? * Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time? * What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.) * We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping? Many Thanks, Rawia [Description: Strategy& Logo] Formerly Booz & Company Rawia Abdel Samad Direct: +9611985655 | Mobile: +97455153807 Email: Rawia.AbdelSamad(a)strategyand.pwc.com<mailto:Rawia.AbdelSamad@strategyand.pwc.com> www.strategyand.com

8 years, 6 months

Monthly compressed traffic delay

by Michael Hale

Hello, I'm inquiring about the delay for publishing the January compressed Wikistats files that are maintained by Erik Zachte. I'm guessing those processes are given a low priority compared to the content backups that need to run. More generally, I'm interested in finding new ways that I can help out. I'm an ex-Microsoftie who is now on the fraud analytics team at TD Bank. I've been involved with the Wikimedia group in Atlanta. I organize the picnic each summer, and helped get the rest of the historic buildings photographed. I've dabbled in reverting vandalism, and I contribute to articles when I actually have something to contribute. I don't feel like I've settled into a contributor role that really fits me yet though. I enjoy using a variety of the traffic data sets that Wikimedia publishes. It seems the traffic servers get bogged down sometimes though. Can I help? Should I try to get the Atlanta group to pool our donations this year for an extra computer? Thanks, Michael

8 years, 10 months

Contribute

by Ron Baasland

Hello, My username is rbaasland and I would like to contribute to the analytics project. I was wondering if I could have access to the project, or how I go about contributing to this project? Thank you very much, Ron Baasland

8 years, 10 months

"Maybe Analytics" project in Phabricator

by Andre Klapper

Today somebody on IRC pointed out the existence of https://phabricator.wikimedia.org/tag/maybe_analytics/ which seems to be entirely unused (created in Feb 2015). Its description implies that its intended use is more or less the same as the #Blocked-on-Analytics project (created in Dec 2014). So can this project be archived? If not, how do you plan to actually use it? Generally speaking: I'm not aware of a task where the creation of this project was proposed / discussed. For future reference, please respect https://www.mediawiki.org/wiki/Phabricator/Creating_and_renaming_projects andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

8 years, 10 months

Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

by Dario Taraborelli

I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1] Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps. Feedback on the proposal is welcome on the lists or the project talk page on Meta [3] Dario [1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagev… [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 [3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_…

8 years, 11 months

analytics-store heads up

by Sean Pringle

Hi analytics-store tmp space filled up today with many large temporary tables (it was ~32G) from many slow research queries. Those had to be killed, the database process restarted, and tmp space expanded. It's back up now. Sean -- DBA @ WMF

8 years, 11 months

April 2015 research showcase: remix and reuse in collaborative communities; the oral citations debate

by Dario Taraborelli

I am thrilled to announce our speaker lineup for this month’s research showcase <https://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase#April_2…>. Jeff Nickerson (Stevens Institute of Technology) will talk about remix and reuse in collaborative communities; Heather Ford (Oxford Internet Institute) will present an overview of the oral citations debate in the English Wikipedia. The showcase will be recorded and publicly streamed at 11.30 PT on Thursday, April 30 (livestream link will follow). We’ll hold a discussion and take questions from remote attendees via the Wikimedia Research IRC channel (#wikimedia-research <http://webchat.freenode.net/?channels=wikimedia-research> on freenode) as usual. Looking forward to seeing you there. Dario Creating, remixing, and planning in open online communities Jeff Nickerson Paradoxically, users in remixing communities don’t remix very much. But an analysis of one remix community, Thingiverse, shows that those who actively remix end up producing work that is in turn more likely to remixed. What does this suggest about Wikipedia editing? Wikipedia allows more types of contribution, because creating and editing pages are done in a planning context: plans are discussed on particular loci, including project talk pages. Plans on project talk pages lead to both creation and editing; some editors specialize in making article changes and others, who tend to have more experience, focus on planning rather than acting. Contributions can happen at the level of the article and also at a series of meta levels. Some patterns of behavior – with respect to creating versus editing and acting versus planning – are likely to lead to more sustained engagement and to higher quality work. Experiments are proposed to test these conjectures. Authority, power and culture on Wikipedia: The oral citations debate Heather Ford In 2011, Wikimedia Foundation Advisory Board member, Achal Prabhala was funded by the WMF to run a project called 'People are knowledge' or the Oral citations project <https://meta.wikimedia.org/wiki/Research:Oral_Citations>. The goal of the project was to respond to the dearth of published material about topics of relevance to communities in the developing world and, although the majority of articles in languages other than English remain intact, the English editions of these articles have had their oral citations removed. I ask why this happened, what the policy implications are for oral citations generally, and what steps can be taken in the future to respond to the problem that this project (and more recent versions of it <https://meta.wikimedia.org/wiki/Research:Indigenous_Knowledge>) set out to solve. This talk comes out of an ethnographic project in which I have interviewed some of the actors involved in the original oral citations project, including the majority of editors of the surr <https://en.wikipedia.org/wiki/surr> article that I trace in a chapter of my PhD[1] <http://www.oii.ox.ac.uk/people/?id=286>.

8 years, 11 months

article creation stuck in February

by Amir E. Aharoni

Hi, The article creation tables were last updated for February: http://stats.wikimedia.org/EN/TablesArticlesNewPerDay.htm When can we expected newer data, at least for March? It's pretty important for ContentTranslation metrics. Thanks! -- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

8 years, 12 months

Event Logging outage

by Dan Andreescu

Event Logging was down for 2 hours yesterday. The incident report [1] mentions that we can not backfill the data at this time (if this means you lost critical data, please let us know offline). [1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20150428-EventLo…

8 years, 12 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics April 2015