Analytics November 2013

analytics@lists.wikimedia.org

34 participants
20 discussions

Re: [Analytics] Visualizing Indic Wikipedia projects.
by sumandro 13 Mar '14

13 Mar '14

Erik, Thanks a lot for the appreciation. As Sajjad mentioned, we have already obtained a edit-per-location dataset from Evan (Rosen) that has the following column structure: *language,country,city,start,end,fraction,ts* *start* and *end* denote the beginning and ending date for counting the number of edits, and *ts* is time stamp. The *fraction*, however, gives a national ratio of edit activity, that is it gives the ratio of 'total edits from that city for that language Wikipedia project' divided 'total edits from that country for that language Wikipedia project'. Hence, it cannot be used to understand global edit contributions to a Wikipedia project (for a time period). It seems that the original data (from where this dataset is extracted) should also have the global fractions -- total edit from a city divided by total edit from the whole world, for a project, for a time period. Would you know if the global fractions can also be derived from the XML dumps? Or, even better, is the relevant raw data available in CSV form somewhere else? Bests, sumandro ------------- sumandro ajantriks.net On Wednesday 15 May 2013 12:32 AM, analytics-request(a)lists.wikimedia.org wrote: > Send Analytics mailing list submissions to > analytics(a)lists.wikimedia.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.wikimedia.org/mailman/listinfo/analytics > or, via email, send a message with subject or body 'help' to > analytics-request(a)lists.wikimedia.org > > You can reach the person managing the list at > analytics-owner(a)lists.wikimedia.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Analytics digest..." > > ---------------------------------------------------------------------- > > > Date: Tue, 14 May 2013 19:40:00 +0200 > From: "Erik Zachte" <ezachte(a)wikimedia.org> > To: "'A mailing list for the Analytics Team at WMF and everybody who > has an interest in Wikipedia and analytics.'" > <analytics(a)lists.wikimedia.org> > Subject: Re: [Analytics] Visualizing Indic Wikipedia projects. > Message-ID: <016f01ce50ca$0fe736b0$2fb5a410$(a)wikimedia.org> > Content-Type: text/plain; charset="iso-8859-1" > > Awesome work! I like the flexibility of the charts, easy to switch metrics > and presentation mode. > > > > 1. WMF has never captured ip->geo data on city level, but afaik this is > going to change with Kraken. > > > > 2. Total edits per article per year can be derived from the xml dumps. I may > have some csv data that come in handy. > > For edit wars you need track reverts on an per article basis, right? That > can also be derived from dumps. > > For long history you need full archive dumps and need to calc checksum per > revision text. (stub dumps have checksum but only for last year or two) > > > > Erik Zachte > > >

8 10

the use of the templates: comparison between different wikipedias
by Yury Katkov 11 Mar '14

11 Mar '14

Hi everyone! Has anyone tried to observer how different wikipedias use the templates: how often, what's the average depth of template calls, etc? ----- Yury Katkov, WikiVote

5 7

Re: [Analytics] State of mobile limn dashboard
by Arthur Richards 06 Dec '13

06 Dec '13

+analytics On Wed, Nov 13, 2013 at 5:09 PM, Jon Robson <jrobson(a)wikimedia.org> wrote: > Thanks so much Juliusz for exploring this and great work fixing the > schema (apologies for me not predicting that might be an issue) and > sorry for all the pain this must have caused you. > > We can't be the only teams using Limn in the Foundation. It might be > worth pulling everyone together. Am I right in thinking that Limn is a > child of the analytics team? Maybe we should at least spend some with > them getting our use case resolved.. I guess this is why we have an > analytics department? I can raise this issue in the next Scrum of > Scrums if it is not resolved by then. > > On Wed, Nov 13, 2013 at 3:54 PM, Juliusz Gonera <jgonera(a)wikimedia.org> > wrote: > > For the past few days (or more) graphs at > > http://mobile-reportcard.wmflabs.org/ stopped updating. The dashboard > > consists of two parts: Limn, which displays the data, and backend scripts > > that generate the graph data based on Event Logging data. The issue was > > caused by two independent problems in the second component: > > > > 1. A change of MobileWebEditing schema was incorrectly addressed in the > > scripts' config and caused the script to throw an exception. > > 2. Backend scripts are stupid and not optimized at all. > > > > The first thing is fixed. To work around the second thing I had to > disable > > updates of "Editors registered on mobile who made 5+ edits on enwiki > > (mobile+desktop)" graph [1] for now (the query was timing out and > causing an > > exception too) and removed the performance graph, since we'll be using > > ganglia (and soon graphite) for that [2]. Graphs should get updated soon. > > > > So why are those backend scripts stupid? Because they run every hour and > > recalculate _all_ the values for every single graph. For example, even > > though total unique editors for June 2013 will never change, they are > still > > recalculated every hour. This was a quick and easy solution for > generating > > graphs, but as Event Logging tables keep growing, we add more graphs and > > those graphs show more and more data, it's no longer performing. > > > > I discussed this briefly with Ori and I think we agree on the general > > direction. We should definitely schedule some time for working on this. > We > > could start with a spike investigating if there is a framework for > > aggregating the sums that we could use and asking what other teams in the > > foundation use for generating their graph data. The results of this spike > > and possible following work could be useful not only for the mobile team. > > > > [1] https://gerrit.wikimedia.org/r/#/c/95298/ > > [2] > > > http://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&tab=v&vn=Mobile+Web&hi… > > > > -- > > Juliusz > -- Arthur Richards Software Engineer, Mobile [[User:Awjrichards]] IRC: awjr +1-415-839-6885 x6687

10 51

Really strange malformed requests since yesterday
by Faidon Liambotis 30 Nov '13

30 Nov '13

Hi, We've been having spikes in our 5xx error logs since yesterday. There are definitely multiple distinct causes for those, incl. esams network issues, random people trying to DoS us, MediaWiki bugs that got backported yesterday etc. One of the most peculiar cause of errors, though, are requests of this form: GET \\nki/Random_article HTTP/1.1 Host: en.wikipedia.org ... That's GET space backslash newline ki/Random_article ("Random_article" being an example). This makes Varnish think the URL is "\" and "ki/Random_article HTTP/1.1" some random malformed header and so it responds with a 503 (and not a 400 -- that's a bug of its own). The first occurence of such a request in our logs is 2013-11-25T12:03:45. Before that we had 0 (zero) such requests in our logs, for all of November that I checked. Since then and until now we've had 83.010 such requests (about 1/3 of our total 5xx). I've verified those strange requests coming directly to our frontends -- they are not passing through our SSL terminators or special proxies like Opera Mini. You can see e.g. a sample filtered pcap at fenari.wikimedia.org:~faidon/malformed-GET-20131126.pcap (this has private data, do not share). The packets' TCP checksum is obviously correct. Those requests always are for en.wikipedia.org articles, no other languages or projects. They come from all user-agents & operating systems (so, probably not a malware). They have all kind of Referers, including internal links. About 3/4 are coming from Google, but this isn't irregular. Some of them have proper Cookies, including session tokens and such (so, probably not just spoofed UAs). The requests are 83.010, coming from 21.193 unique IPs in 121 different countries. The distribution by country is the most interesting part; the top 5 of unique IPs reads: 18152 IN 271 PH 268 AE 228 MY 207 US i.e. 85% comes from India -but not a particular ISP-, in a >24h period. The distribution of hits per datacenter is: 78938 eqiad (incl. 72516 for India) 4072 esams I've been on this for some time and I'm currently out of ideas. At this point, the only theory that I have is some popular CPE device or, alternatively, state surveillance device (e.g. BlueCoat), has gone haywire and is corrupting HTTP requests (paranoia about state surveillance was one of the reasons I kept digging). Some parts don't fit in either theory (traffic is distributed across both DCs & multiple countries for state surveillance; requests are too targetted to enwiki for CPEs). Other thoughts? Am I missing something completely obvious? Regards, Faidon

3 4

Updates on the Analytics team
by Toby Negrin 27 Nov '13

27 Nov '13

There are some changes in the Analytics team I'd like to let you know about. First of all, I'd like to congratulate Dario Taraborelli on his promotion to Senior Research Scientist, Research and Data team lead. Dario has made significant contributions to the Foundation and Community's use of data and analysis over the past several years and has been instrumental in establishing the charter of the new Research and Data team in Analytics. I'm really excited about Dario's increased scope and visibility in his new role both in and outside of the Foundation. In addition, Aaron Halfaker has been given the new title of Research Scientist. This new role recognizes his expertise and scientific contributions as a researcher and now as a full-time member of the Research and Data team at the Foundation. Please join me in congratulating Dario and Aaron! I also want to announce that Diederik van Liere will be leaving the Analytics team and is exploring other roles within the Foundation. Currently, he is helping Rob Lanphier with the planning of the Architecture summit. I will be handling the Product Management duties pending the hiring of a new Product Owner. Diederik played a significant role in introducing both Hadoop and Kafka to the Foundation's technology stack, and in introducing Scrum to the Analytics team. I would like to thank Diederik for his 2 years of product leadership in Analytics and his contributions to the Foundation and the Community. -Toby

11 10

Fwd: [Wikimedia-l] Increase in page views for the last 3 months
by Matthew Flaschen 23 Nov '13

23 Nov '13

It's not clear if this is a bug or true organic growth, but it seems to be occurring across multiple Wikipedias (see the rest of the thread). Matt Flaschen -------- Original Message -------- Subject: [Wikimedia-l] Increase in page views for the last 3 months Date: Fri, 22 Nov 2013 23:41:09 +0200 From: Strainu <strainu10(a)gmail.com> Reply-To: Wikimedia Mailing List <wikimedia-l(a)lists.wikimedia.org> To: Wikimedia Mailing List <wikimedia-l(a)lists.wikimedia.org> Hi, Looking at the summary reports per language, I've noticed a linear, significant increase in pageviews for many European languages (ro, bg, hu, fr) Wikipedias in the last 3 months. This is not happening for Asian languages or Russian and is not obvious from the report card. Has anything changed in the reporting or the visit patterns for these Wikipedias? It looks pretty weird to have a 100% increase for Romanian in just 3 months [1]. Thanks, Strainu [1] http://stats.wikimedia.org/EN/SummaryRO.htm _______________________________________________ Wikimedia-l mailing list Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

1 0

Notifications dashboard
by Jan Ainali 23 Nov '13

23 Nov '13

What happened with the Notifications statistiscs on Swedish WP on Nov 8? http://ee-dashboard.wmflabs.org/dashboards/svwiki-features No thanks, mentions, talk, reverts or review notfications since then, only link and system. *Best regards,Jan Ainali* CEO, Wikimedia Sverige <http://se.wikimedia.org/wiki/Huvudsida>

5 8

Re: [Analytics] [eventlogging-alerts] Brief EventLogging outage coming right up
by Ori Livneh 21 Nov '13

21 Nov '13

On Thu, Nov 21, 2013 at 10:43 AM, EventLogging Alerts <eventlogging-alerts(a)lists.wikimedia.org> wrote: > Chris needs to replace a failed disk on vanadium, which should only > take 5-10 minutes. I will follow up to indicate the exact start time > and duration of the outage. Maintenance complete; outage was 8 minutes: Thu Nov 21 18:48 UTC to 18:56 UTC.

1 0

Brief EventLogging outage coming right up
by Ori Livneh 21 Nov '13

21 Nov '13

Chris needs to replace a failed disk on vanadium, which should only take 5-10 minutes. I will follow up to indicate the exact start time and duration of the outage. --- Ori Livneh ori(a)wikimedia.org

1 0

Tracking progress on community data/research requests
by Dario Taraborelli 21 Nov '13

21 Nov '13

The Research & Data team is currently experimenting with a tool called Trello for tracking progress and simplifying monthly reporting [1]. We don’t have a good solution for tracking progress on research/data support requests originating from the community or from non-WMF researchers. Using the same board for these requests is not going to work: the board is currently set up as read-only for non-WMF users it mostly reflects work prioritized by the team as part of our quarterly planning [2] and it’s not designed as a generic inbox for data requests repurposing the board as a generic backlog would set the wrong expectations that the team has bandwidth or a mandate to support these requests as they come in What if we set up a public (read/write accessible) board where anyone (including volunteers) can create, pick up, execute and complete requests? The purpose of this would be purely to categorize, track and (self-)assign or reassign tasks: the actual requirements and the output of a request would be hosted on Meta (for example in the Research Index or the Labs2 portal) and/or in a public data repository. How do people feel about this? We also have a bugzilla component for generic analytics requests that people have been using for a while [3] but I don’t think it has been particularly successful because BZ is mostly focused on development and bug reports or feature requests for analytics infrastructure. The bottom line is that I don’t want to create more work for WMF researchers – we are a small team of 2.5 FTE staffers supporting the whole organization, if we exclude WMF analysts that are not part of Analytics – but test if a lightweight tool like Trello can be used to distribute tasks and track progress on a body of research and data requests. Dario [1] https://trello.com/b/k5N0ivoM/research-and-data [2] https://www.mediawiki.org/wiki/File:Analytics_Quarterly_Review_Q2_2013_(Res… [3] https://bugzilla.wikimedia.org/buglist.cgi?list_id=251983&resolution=---&re…

3 7

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics November 2013