Analytics January 2015

analytics@lists.wikimedia.org

48 participants
47 discussions

Re: [Analytics] Performance Visualization Frontend (Christy Okpo)
by E.C Okpo 12 Jan '15

12 Jan '15

Hi, In terms of where the data will be coming from, it will be coming from node-txstatsD, a port of statsD/node-statsD, a statsD client. I had previously been working on using Event Emitters in conjunction with statsD, but node-txstatsD seems to solve the task of obtaining and sending timing to a visualization front-end. The data to be visualized will be timing (in ms), counts, and maybe sizes(KB, MB). Based on this, what would be the recommended front-end to then visualize the information? Message: 1 > Date: Tue, 30 Dec 2014 07:37:35 -0800 > From: Nuria Ruiz <nuria(a)wikimedia.org> > To: "A mailing list for the Analytics Team at WMF and everybody who > has an interest in Wikipedia and analytics." > <analytics(a)lists.wikimedia.org> > Subject: Re: [Analytics] Performance Visualization Frontend > Message-ID: > < > CAMpYYkG4aVNLFJL3p6+a2mH3gTcntH7oxu+haGFsdaA_ig2Qzw(a)mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hello, > > The more important question is where will your data come from: event > logging? graphite? elsewhere? visualization comes secondary to this. > > EventLogging is a good solution for structured, somewhat complex, > application data, graphite is s good solution for plain counters, which is > well suited to perf data. Let us know if you already have data and we can > proceed from there. > > Thanks, > > Nuria > > >

2 1

Renaming wmf.webstats Hive table and /wmf/data/archive/webstats HDFS path
by Christian Aistleitner 11 Jan '15

11 Jan '15

Hi, within the Analytics cluster, the pagecounts-all-sites dataset is still referred to by the legacy name “webstats” in * the wmf.webstats Hive table, and * the /wmf/data/archive/webstats HDFS path. Both have no known external customers, so renaming them to "pagecounts-all-sites" should not affect anyone. But just in case ... if you use any of them, let us know by 2015-01-09 08:00 UTC. If no one speaks up, I'll remove the webstats Hive table, and the webstats path in HDFS. The new Hive table wmf.pagecounts_all_sites and the new HDFS path /wmf/data/archive/pagecounts-all-sites are already available and contain both the old and also the new data. Have fun, Christian P.S.: This only affects the data in the Analytics cluster. The public URL stays unaffected: http://dumps.wikimedia.org/other/pagecounts-all-sites/ No changes there. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 1

Catching EventLogging schema violations
by Ori Livneh 10 Jan '15

10 Jan '15

One of the persistent pain-points for teams using EventLogging seems to be the ease with which schema violation errors can slip through to production and linger unnoticed. Validation errors are written to the browser console, where they are easy to miss. There are several things we could do to make it easier to spot problems in client-side analytics code. I'd like to have live graphs showing the volume of data (valid and invalid) currently being logged under a particular schema integrated into the schema page in MetaWiki, and I'd like us to have some automatic alerting for that. Nuria is also working on making the process of testing analytics code in labs clearer and more useful. One thing you can do right now (and that I hope you will do) is to add a small piece of custom JavaScript to your global.js that will flash validation errors in an error bar on the page you are viewing. To do that, edit both https://meta.wikimedia.org/wiki/User:<your username>/global.js (for production) and http://meta.wikimedia.beta.wmflabs.org/wiki/User:<your username>/global.js (for labs), and add the following: // Show EventLogging validation errors in a dismissible bar at the top of page. var $el = $( '<pre style="background: yellow; margin: 0; padding: 8px; position: fixed; top: 0; width: 100%; z-index: 99"></pre>' ); $el.click( function () { $el.empty().detach(); } ); mw.trackSubscribe( 'eventlogging.error', function ( topic, err ) { $el.text( function ( idx, text ) { return ( text && text + '\n' ) + err; } ).appendTo( 'body' ); } ); This code won't do anything to change your user experience unless there are schema violations in the JavaScript code that is loaded by the page. If there are any errors, they will looks like this: http://i.imgur.com/ReVnfbn.png

3 2

Analytics Engineering Team Commitments 2014-12-11 -- 2014-12-25
by Kevin Leduc 10 Jan '15

10 Jan '15

Hello, It has been a while since the last email of this kind. The team continued it’s bi-weekly sprints around Columbus day, US Thanksgiving and through the switch from bugzilla to Phabricator. We have now re-organized our processes around phabricator and are excited to see how this tool will display our point burndown during the sprint. Start 2014-12-11 End 2014-12-23 Theme Wikimetrics Theme song All Along the Watchtower <https://en.wikipedia.org/wiki/All_Along_the_Watchtower> Point Commitment 62 # of Tasks 7 Burndown Chart https://phabricator.wikimedia.org/sprint/view/935/ Sprint Board https://phabricator.wikimedia.org/sprint/board/935/query/all/ Note: emails have not gone out for the last few sprints, but you can see the slideshows of their showcases below. Sprint ending 2014-12-09: https://docs.google.com/presentation/d/1LmkWEpcJD0-AtQMRmLEFSM-T_9hNCWVGvCv… Sprint ending 2014-11-25: https://docs.google.com/presentation/d/1siaxV4CVzx-Rqbs9zrEC9lmfhijweaNuRz1… Sprint ending 2014-11-12 https://docs.google.com/presentation/d/1XTy0yLCCKFk-CFKXiiUAZd1uVYL61vwAlOo… Cheers, Kevin (Analytics Product Manager) PS >From now on, the team will vote on a theme song at the sprint planning session. The song must start with the next letter in the alphabet of the previous song and will loosely reflect the mood of the team.

3 3

Image serving performance discoveries
by Gilles Dubuc 10 Jan '15

10 Jan '15

Hi everyone, I recently looked very closely at client-gathered statistics about image serving performance from within Media Viewer. Looking more specifically at the effect of thumbnail pre-rendering at upload time (which has been live for a few months) and thumbnail chaining (which was live for a few weeks and has now been turned off). The main question I was looking to answer is whether either of those techniques improved performance as experienced by users. Chaining, when combined with pre-rendering, had no noticeable effect on performance experienced by viewers. This is logical because pre-rendering means that the thumbnail generating gains only happen at upload time, therefore clients requesting the image later won't be affected. As for the effect on image scalers load, it was so insignificant that it couldn't be measured. Chaining is probably still useful for people requesting non-standard thumbnail sizes, which I'm not measuring since I've only been looking at Media Viewer, but the priority of addressing the community concerns over JPG sharpening in order to redeploy chaining seems much lower to me now if that's the only use case chaining will be useful for. The big discovery in my research is that we set out to do pre-rendering based on a wrong assumption. When looking at performance statistics earlier last year, we clearly saw that Varnish misses performed a lot worse than Varnish hits (well, duh) and so we set out to deploy pre-rendering the thumbnail sizes Media Viewer needs in order to get drasticaly reduce the amount of the Varnish misses. The reduction didn't happen. The wrong assumption was that each varnish miss is a case where the thumbnail requested has to be generated on the fly by the backend. The data I've just discovered shows that this is very rare for the thumbnail sizes Media Viewer currently uses. The vast majority of Varnish misses merely pulls from Swift a thumbnail that has already been rendered at some earlier point in time and just happens to not have been requested for a while. And that Swift pull + Varnish re-add is what's making the majority of Varnish misses perform worse than hits, not the need to generate the thumbnail with ImageMagick. The bottom line is that the thumbnail prerendering provided insignificant performance gains for this set of sizes. Infrequently requested thumbnails is the main problem, not the fact that they are rendered on the fly the first time they are requested. It seems like the only way to increase image serving performance in our current setup is to increase the expiry value in Varnish and/or increase Varnish capacity. Right now 17% of image requests in Media Viewer are Varnish misses, and 99.5% of those are pulling an existing thumbnail from Swift. Varnish misses are twice as slow as hits on average. I plan to disable pre-rendering next week in order to confirm these findings and determine for certain what percentage of image requests pre-rendering is useful for on the set of sizes Media Viewer currently uses. If you want to dig into the data, the relevant tables on the analytics DB are MultimediaViewerNetworkPerformance* and more specifically the event_varnish*, event_timestamp and event_lastModified columns.

8 13

WikiGrok and EventLogging
by Leila Zia 09 Jan '15

09 Jan '15

Hi, The mobile team is planning to switch WikiGrok on for non-logged in users next week (2014-01-12). The widget will be on on 166,029 article pages in enwiki. There are two EventLogging schema that may collect data heavily and we want to make sure EL can handle the influx of data. The two schema collecting data are: https://meta.wikimedia.org/wiki/Schema:MobileWebWikiGrok https://meta.wikimedia.org/wiki/Schema:MobileWebWikiGrokError and the list of pages affected is in: wgq_page in enwiki.wikigrok_questions. It would be great if someone from the dev side let us know whether we will need sampling. Thanks, Leila

6 16

Per-namespace daily edit numbers
by Gergo Tisza 09 Jan '15

09 Jan '15

I want to check what effect MediaViewer had on file namespace edits. Aggregating the standard MediaWiki dumps over all wikis seems like a pain; is there a more convenient source for that data? Even better if it can be filtered by the editcount of the user at the time of the edit. I looked at the Edit* EventLogging schemas, but those are either fairly recent or not used. Is there any other source where this information could be retrieved from? thanks Gergő

6 8

Per-namespace pageview data from half a year ago
by Gergo Tisza 09 Jan '15

09 Jan '15

I would like to graph the correlation between file namespace page views and MediaViewer image views. Back when MediaViewer was launched, I added a namespace parameter to NavigationTiming to be able to track per-namespace pageviews, but I messed up and it only got deployed around the time MediaViewer was enabled on Commons, so we have no data for the early steps of the deploy process. Do you know of any other source for per-namespace pageview data that is still available for the 2014 April-June period? Technically the raw pagecount files contain the information but aggregating those would be a horribly complicated way of getting this information. Does Hadoop pageview data go back that far? thanks Gergő

4 9

autopromote values - do we know how they affect participation?
by Amir E. Aharoni 08 Jan '15

08 Jan '15

Hi, Is there any research about the influence of autopromote variables on participation, editor retention, and such things? I am mainly talking about $wgAutoConfirmAge and $wgAutoConfirmCount. See the current values at http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php These are different in various languages, and they seem rather random to me. The idea is supposed to be to prevent vandalism without hurting wikiness and editor retention, but are the current values based on any analytics? I went over all the bugs that are mentioned in the file to which I linked above. All of them say "we had a discussion and we reached a consensus". I cannot read all these languages, but my wild guess is that people just threw some numbers around without basing it on analytics, and voted to accept them. The discussion in the English Wikipedia[1] is, non-surprisingly, the longest (47 A4 pages); I didn't read it all, but it doesn't seem to be based on any metrics either. Anecdotally, I can recall many more times when I, as a Wikipedian, had to explain people that they need to do a few more edits to get a permission to move pages, than I had to revert bad page moves by new editors, so there is a possibility that the autopromote values are not actually very good. [1] https://en.wikipedia.org/wiki/Wikipedia:Autoconfirmed_Proposal/Poll -- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

3 3

Only parts of EventLogging events getting written to the database since 2015-01-07 ~1:55
by Christian Aistleitner 08 Jan '15

08 Jan '15

Hi, just a quick heads up that since 2015-01-07 ~1:55 only <30% of the EventLogging events are getting written to the database. It seems a deployment went wrong and validation is no longer working as expected. Somewhere >70% of the messages no longer validate and hence do not get written to the database. The raw log files (pre-validation) are still getting written, so data is not lost, and backfilling is possible. Best regards, Christian -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

6 10

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics January 2015