Analytics December 2014

analytics@lists.wikimedia.org

37 participants
37 discussions

Adventures in Clusterland 2014-12-08--2014-12-14
by Christian Aistleitner 17 Dec '14

17 Dec '14

Hi, in the week from 2014-12-08–2014-12-14 Andrew, and I worked on the following items around the Analytics Cluster and Analytics related Ops: * Stat1001 behind misc-web * Compression analysis for storing xmldumps in cluster * EventLogging replication lag (details below) Have fun, Christian * Stat1001 behind misc-web stat1001 (which handles stats.wikimedia.org, and datasets.wikimedia.org) got moved behind misc-web. This makes stat1001 use the WMF standard SSL setup, and removes certificate issues (Like T74805 [1]). So URLs like https://datasets.wikimedia.org/public-datasets/ (note the s in https) should finally work without warnings/errors. * Compression analysis for making xmldumps available in cluster More research around making xmldumps available in the custer has been done. The numbers can be found on https://wikitech.wikimedia.org/wiki/Analytics/Cluster/xmldumps#Results * EventLogging replication lag EventLogging replication got stuck. Only for some tables. This was a combination of EventLogging being liberal in what characters are allowed in table names, but the replication being very defensive. Sean made the blocked replication behave again (thanks!), and replication caught up. Restrictions on table naming got set up and are still getting tuned a bit. [1] https://phabricator.wikimedia.org/T74805 -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 0

Adventures in Clusterland 2014-12-01--2014-12-07
by Christian Aistleitner 17 Dec '14

17 Dec '14

Hi, in the week from 2014-12-01–2014-12-07 Andrew, and I worked on the following items around the Analytics Cluster and Analytics related Ops: * Change in SSL setup causing pagecounts-raw to be off ... temporary * Preparing for vlan move of stats machines * Ganglia -> Graphite -> Grafana * Wikipedia Zero graph comparability (details below) Have fun, Christian * Change in SSL setup causing pagecounts-raw to be off ... temporary Ops changed the SSL setup from dedicated SSL terminators to cache-local SSL terminators for eqiad and esams. This change came a bit as a surprise to us, and (as expected) made webstatscollector's C implementation (pagecounts-raw) overcount HTTPS traffic. We adjusted webstatscollector's C implementation accordingly. While some weeks back that would be the end of the story and we'd just be left with a few days of broken data, we now have the data in the cluster, and have a Hive implementation too. So we could effectively backfill pagecounts-raw for the affected days. Up to my knowledge, this is the first time we could cover/mitigate a webstatscollector on the udp2log pipeline issue through the cluster. And pagecounts-raw has good data again for the affected period :-) * Preparing for vlan move of stats machines To develop infrastructure and research pipelines, devs and researchers would need some more basic development tools (E.g.: Maven, Virtualenv) on stat100[123] that Ops would prefer us not to use in the machines' current vlan. Hence, preparations started to move stat100[123] into the separate analytics vlan. This will address the concerns of Ops, while it still allows to install the needed tools. * Ganglia -> Graphite -> Grafana Ops is more and more moving from ganglia to graphite to do checks on numbers. So work has been started to look into graphite a bit more and on how to instrument it to perform checks. The cluster got re-configured to get the key metrics get fed into graphite. For dashboarding, it seems grafana might give a kibana-like interface. And http://grafana.wikimedia.org/#/dashboard/db/kafka got setup to provide a high-level, realtime view on kafka. * Wikipedia Zero graph comparability Following up from the previous week, the Wikipedia Zero had further concerns about the differences between their new on-wiki graphs and the Analytics team's dashboards. We identified and explained the differences for them. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 0

EventLogging data QA
by Dario Taraborelli 16 Dec '14

16 Dec '14

I am kicking off this thread after a good conversation with Nuria and Kaldari on pain points and opportunities we have around data QA for EventLogging. Kaldari, Leila and I have gone through several rounds of data QA before and after the deployment of new features on Mobile and we haven’t found yet a good solution to catch data quality issues early enough in the deployment cycle. Data quality issues with EventLogging typically fall under one of these 5 scenarios: 1) events are logged and schema-compliant but don’t capture data correctly (for example: a wrong value is logged; event counts that should match don’t) 2) events are logged but are not schema-compliant (e.g.: a required field is missing) 3) events are missing due to issues with the instrumentation (e.g.: a UI element is not instrumented) 4) events are missing due to client issues (a specific UI element is not correctly rendered on a given browser/platform and as a result the event is not fired) 5) events are missing due to EventLogging outages In the early days, Ori and I floated the idea of unit tests for instrumentation to capture constraint violations that are not easily detected via manual testing or the existing client-side validation, but this never happened. When it comes to feature deployments, beta labs is a great starting point for running manual data QA in an environment that is as close as possible to prod. However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally). Having a full-fledged set of unit tests for data would be terrific, but in the short term I’d like to find a better way to at least identify events that fail validation as early as possible. - the SQL log database has real-time data but only for event that pass client-side validation - the JSON logfiles on stat1003 include invalid events, but the data is only rsync’ed from vanadium once a day is there a way to inspect invalid events in near real time without having access to vanadium? For example, could we create either a dedicated database to write invalid events only or a logfile for validation errors rsync’ed to stat1003 more frequently than once a day? Thoughts? Dario

9 17

Switching the R&D team to Phabricator
by Oliver Keyes 15 Dec '14

15 Dec '14

Hey all, As people probably(?) know, the WMF has replaced Bugzilla with Phabricator ( https://phabricator.wikimedia.org/) . This is also taking over from a host of other services, including RT. Analytics Engineering has already switched over, as have a lot of teams, but R&D has not - instead, we use Trello ( https://trello.com/b/k5N0ivoM/research-and-data). I think that if we're going to switch over, we should probably do it reasonably soon (the longer we wait, the more things we have to port). This thread is to have the switch-or-not conversation in. I'll start ;p. I'd like to strongly advocate that we switch to phabricator, for several reasons. Even were Phabricator less-good than Trello, there's an inherent advantage in consolidating systems. It means fewer logins to maintain, and a less-distributed workset. By extension, it means a reduced barrier for interacting with other teams, or volunteers, and vice versa. But actually, Phab isn't worse than Trello: it's better. For one thing, it's better at letting us work with other teams. We're dependent on Analytics Engineering (on Phabricator), and work with the VE team (on Phabricator), Fundraising (on Mingle), Mobile (also on Trello)....the list goes on and on. The trello model, in which everything is split out into different boards you may or may not have access to, combined with the distribution of teams across platforms, makes it a constant pain to bring people into conversations and work on problems that are both our problem + AnEng's problem, or our problem + customer's problem. People need to cross the streams or juggle multiple logins. With Phabricator, it's a lot easier to see what everyone is doing, keep abreast of the general gestalt in movement/WMF work, and chip in on tasks that don't officially belong to your team. And because a lot of teams use it, the responsiveness from customers when we ask questions is a lot better. Phab also seems to, at least for me, naturally fit my work process better. I think of a research project ("find out how long mobile sessions are") as actually being a series of individual tasks - "find out what a session is", "work out how to measure it", "measure it". Trello doesn't really have support for that kind of heirarchical, dependent, chunked work. It has checklists but they don't allow for any actual data segmentation or detail. Alternately you can write multiple cards and link them together, but this is entirely ad-hoc; there's no structure to it, it's not obvious without reading each card what the relationship is, and you have to do the heavy lifting yourself. Phabricator is designed for precisely this model, because that's how engineering work tends to break down. It's built-in, fully supported, and extracting the tree is easy. So those are the reasons I have, off the top of my head. Other reasons? Counter-arguments? Post em here. -- Oliver Keyes Research Analyst Wikimedia Foundation

7 11

EventLogging workshop at the Wikimedia Developer Summit (WMDS)
by Kevin Leduc 15 Dec '14

15 Dec '14

I have updated our team's workshop entry at the WMDS. The Analytics Engineering team wants lead an EventLogging workshop. If you are interested in attending, please add your name to the list in this section: https://www.mediawiki.org/wiki/MediaWiki_Developer_Summit_2015#Setting_up_E… The more people add their name to the list, the more likely this will happen!

1 0

Detecting SSL terminators
by Oliver Keyes 15 Dec '14

15 Dec '14

So, we've had conversations about detecting SSL terminators, for two reasons: 1. It would allow us to know when, particularly, we should trust x_forwarded_for fields for geolocation; 2. More importantly, it would allow us to reliably exclude traffic from internal IP ranges without excluding SSL traffic. Aaron talked to Ops about this problem (notes at http://etherpad.wikimedia.org/p/ssl_terminators) - in conversation with Ori, though, I found out that this approach won't actually work, because caches != SSL terminators, all the time. So: what's the right approach? How do we find these things easily and automagically. -- Oliver Keyes Research Analyst Wikimedia Foundation

3 8

Is VisualEditor good for preserving new editors?
by Amir E. Aharoni 15 Dec '14

15 Dec '14

Hi, One thing I keep wondering about is how good the VisualEditor is at acquiring and preserving new editors. A few months ago I wrote a project proposal here: https://meta.wikimedia.org/wiki/Research:Ideas/How_does_the_availability_of… See also its talk page for a bit of work on the subject by a Hebrew Wikipedia editor: https://meta.wikimedia.org/wiki/Research_talk:Ideas/How_does_the_availabili… I didn't see much more follow-up on my proposal. Is anybody else working on anything like that? -- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

4 7

browsers: readers vs. editors
by Amir E. Aharoni 15 Dec '14

15 Dec '14

Hi, On http://stats.wikimedia.org/wikimedia/squids/SquidReportClients.htm there are statistics for the numbers of people who use different browsers. Are there statistics that show separate numbers for people who read articles and people who edit them? -- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

3 4

The state of field names in MediaWiki data
by Aaron Halfaker 12 Dec '14

12 Dec '14

Hey folks, I was talking to ottomata today about developing a schema for processing revisions in Hadoop. We came across a deep problem with field names that I'd like to discuss because I want people to be aware of the problem. To explain this, I'll use an example. Let's say you want to get the namespace of this page: https://en.wikipedia.org/wiki/Biology In javascript, this is represented as the variable *wgNamespaceNumber*. In the database, this is represented as *page.page_namespace* In the XML database dump, this is represented as the value at *<page><ns> *or *<namespaces><namespace.key> *depending where you are. Right now, ottomata and I are considering the more descriptive name *page_namespace_id* since the value of all of these valiables/fields is an identifier -- not a name. I think that this is a *good* name if we consider it in a vacuum, but if we choose it, we'll add yet another name for wiki devs & analysts to be aware of. Given the context of this decision, my instinct is to choose the least surprising name. Since I mostly work with the database, that would mean I'd choose *page_namespace*. This is just one example of such nonsense. The decisions we make in formats that we produce now can have immeasurable effects on the sanity of others. I hope that the decisions we make today will minimize such pain, but it's hard to know for sure. -Aaron

5 11

EventLogging and Adblock on Linux/Firefox
by Leila Zia 12 Dec '14

12 Dec '14

Hi everyone, From some initial tests it appears to me that EventLogging is not logging events from Linux/Firefox when Adblock is enabled. I'm on Ubuntu 14.04, Firefox 34.0, and Adblock Plus 2.6.6. When I disable Adblock, I see event.gif?{...} in Console, when I enable it, I don't. Just to make sure, I've checked the EL tables and my events don't get registered there. I'd be happy to sit with someone to troubleshoot before 4pm (PST), after 5:30pm (PST) or tomorrow. Leila

7 16

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics December 2014