Analytics December 2014

analytics@lists.wikimedia.org

37 participants
37 discussions

Digging into the December Metrics meeting pageview numbers

by C. Scott Ananian

So, first off: Ironholds made all the numbers used in this metrics meeting available in the tool at http://pentaho.wmflabs.org/pentaho/Home I'm not going to repost the username/password here, but find me (or Ironholds?) on IRC if you're interested in exploring the data. http://pentaho.wmflabs.org/pentaho/Home <username>,<password> - > create new -> new Saiku Analytics -> v 0.3 Ok. With that said, here are some thoughts about the numbers, with some copy & paste from IRC: Q: RoanKattouw: Ironholds: Re the India language graph (97% of hits from India being to enwiki), we are now idly wondering what places are more diverse in those terms RoanKattouw: Like, maybe the USA? RoanKattouw: Is the Spanish- speaking internet more than 3% of the US internet? RoanKattouw: cscott: Basically my question is, what is the % of enwiki hits in the US. Apparently for India it's 97% A: zhwiki and/or eswiki are the top non-enwiki sites in the US; they account for about 1% of traffic. See https://en.wikipedia.org/wiki/User:Cscott/2014_December_metrics and cscott-us-proj.saiku on pentaho. Q: cscott: also i'm very curious about, say, the rise of iran traffic -- is that to enwiki or fawiki? cscott: in general, is the global south reading enwiki? or is mobile traffic to the local wikis exploding? A: Almost all due to enwiki traffic. See https://en.wikipedia.org/wiki/User:Cscott/2014_December_metrics cscott-ir-proj-2.saiku on pentaho, and attached graph. Q: cscott-free: another question: does the decrease in latin america correspond to a decrease in the eswiki project? A: The presented slide listed the following among the "Top Decliners": Country, Month page views (billion), Annual growth rate Ecuador, 0.04, -30.5% Venezuela, 0.09, -28.0% Portugal, 0.05, -23.8% Mexico, 0.34, -23.2% Colombia, 0.14, -23.2% Chile, 0.08, -22.3% Brazil, 0.32, -21.0% Peru, 0.06, -17.8% Countries in the top 25% by total human PVs as of October 2014; annual growth rates based on linear model (May 2013-October 2014) I'm still working on figuring out the answer to this one. As far as I can tell, eswiki page views are pretty flat, and eswiki page views in Ecuador (for instance) are down a little, but now by 30% annually. So there's something mysterious here. Possibly related: commons page views in latin america dropped sharply starting in 2014-06, after mediaviewer was turned on. But that doesn't seem to be quite enough. --scott -- (http://cscott.net)

9 years, 5 months

Re: [Analytics] [WikimediaMobile] Analysis of the hamburger icon

by Jon Robson

You can see this in comparison to other features. Yes it is indeed more useful than items in the menu but as you can see in the graph the features are near identical in terms of clicks tracked. However what striked me as odd was that the number of clicks varied drastically depending on the language (although correlated for all languages they are not collerated per language). Yes this could mean other things such as languages are more important in the given language and thus it is more useful (maybe the language is incomplete) and tied to what you and Nemo suggested it is more prevalent on the screen. Yet i cant help but wonder if it also might hint at something to do with the icons effectiveness especially when I look at well developed wiki's such as Chinese and Japanese when compared to English. By the way, we don't have that much fine grained information with regards to shat type of pages they clicked these features on. It might be useful to try a different logo on one of the projects e.g. Japanese and see how this gets impacted by the change. If there is no drastic change we could probably conclude that indeed my comparison sucks :) We can certainly use clicks as a guideline of whether the icon is getting more effective. I'm ccing analytics in case they have any views on this. On Dec 5, 2014 2:26 AM, "Amir E. Aharoni" <amir.aharoni(a)mail.huji.ac.il> wrote: > The similarity in the numbers is indeed striking, but I don't think that > it says much about the perception of the hamburger icon. I suspect that for > most readers the language button is more useful than *any* of the actions > in the hamburger menu - home, random, nearby, watchlist, settings, log in. > To confirm it, I'd love to see the the numbers for these other actions. > > Also, as Nemo asks, it would be useful to see pages without language links > separated, and to also see page length taken into account somehow - on a > short page it is easier to see the languages button (less or no scrolling), > and there's more motivation to tap it (the hope to read more in another > language). >

9 years, 5 months

Adventures in Clusterland 2014-11-24--2014-11-30

by Christian Aistleitner

Hi, in the week from 2014-11-24–2014-11-30 Andrew, and I [1] worked on the following items around the Analytics Cluster and Analytics related Ops: * Catch-up and meetings around EventLogging issues. * EventLogging's database writer not properly shutting down * Wikipedia Zero graph comparability * Network switch outage in eqiad (details below) Have fun, Christian * Catch-up and meetings around EventLogging issues. There were quite some catch-up discussions and meetings around the recent EventLogging issues. It seems were all on the same page now. * EventLogging's database writer not properly shutting down When having to adhoc increase EventLogging's database throughput, the hot fix was known to come with not too robust exit synchronization. So in case of issues, with the events, the database writer would not properly shut down and restart, but could be left hanging. This has been known beforehand, and was accepted to bring EventLogging up again as soon as possible. The fix for it is not hard, but with the many follow-up meetings, it did not get deployed before the issue first struck [2]. Now with the follow-up meetings done, the fix got reviewed, deployed and is working fine up to now. We backfilled the database from plain-file logs for the affected period. * Wikipedia Zero graph comparability Wikipedia Zero is moving from the Analytics team's dashboards to on-wiki graphs on the (private) zerowiki. But the numbers on the graphs did not match. So we helped to identify which aspects of the different pageview definitions cause the mismatches in the graphs. It seems that the key differences are now understood. * Network switch outage in eqiad During the weekend, a network switch in eqiad went offline [3] and took key machines in the analytics infrastructure offline. We started [4] looking at the affected machines, measuring impact and backfilling. This is not done yet and will take more time. [1] Jeff will refocus on Ops projects outside the realm of Analytics. Many thanks for your great work on Analytics cluster and Analytics related Ops! [2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141125-EventLo… [3] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141130-Eqiad-R… https://phabricator.wikimedia.org/tag/incident-20141129-network/ [4] https://lists.wikimedia.org/pipermail/analytics/2014-November/002819.html https://lists.wikimedia.org/pipermail/analytics/2014-December/002821.html https://phabricator.wikimedia.org/T76334 -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

9 years, 5 months

Adventures in Clusterland 2014-11-17--2014-11-23

by Christian Aistleitner

Hi, in the week from 2014-11-17–2014-11-23 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops: * EventLogging hit throughput limit to database * Unintended EventLogging deploy of faulty code * Outage on master of EventLogging's database shard (db1020) * Outage on master of EventLogging's database shard (db1046) * Debugging Mobile UI dashboard * Upgrades of first machines from the cluster to trusty * Discussions with researchers on how they could take advantage of the cluster * Allow multiple varnishkafkas on caches (details below) Have fun, Christian * EventLogging throughput limit to database One of the teams instrumenting EventLogging silently and drastically ;-) increased the volume of events they are producing [1]. The total volume of events that the EventLogging infrastructure had to handle jumped from ~140 msgs/s to ~220 msgs/s. This was more than EventLogging's database writer could bring to the database. Only ~70% of events made it to the database. We isolated the issue, overcame the throughput limitation of EventLogging's database writer and the database writer (not EventLogging as a whole) can now handle way more events. In addition to that, the database got backfilled from the plain-file logs. * Unintended EventLogging deploy of faulty code It seems that during efforts to bring EventLogging up-to-date on the beta cluster, faulty code unintendedly got deployed to production [2]. With this faulty code base, EventLogging's database writer crashed several times. Known good code got deployed again, and the database got back-filled from plain-file logs. * Outage on master of EventLogging's database shard (db1020) m2-master's mysqld process aborted [3] and hence EventLogging had no database to write to it. Ops quickly failed-over to the slave db1046, and thereby addressed the issue, and we backfilled the EventLogging database from plain-file logs. * Outage on master of EventLogging's database shard (db1046) Shortly after m2-master got failed over to db1046 (see above), db1046 had issues around its threadpool [4]. EventLogging could not connect to the database, and consequently could not write events to it. Ops quickly fixed the issue, and we backfilled the EventLogging database from plain-file logs. * Debugging Mobile UI dashboard The mobile UI dashboard was having issues, and since it is based on EventLogging data, people assumed that the dashboard issues are caused by EventLogging's issues. We helped to debug the dashboard, and point people to the real issue. EventLogging was not the culprit. Regardless, the relevant graph [5] is working again. * Upgrades of first machines from the cluster to trusty After the first efforts to upgrade the Analytics cluster to trusty during the previous week, the analytics1003 Cisco box no longer ran reliably over the weekend [6]. There were kernel panics, it is not yet fully clear what is going on there. The kernel panics seem to occur even if the machine is not running services. analytics1033’s management interface is not working properly. It will be upgraded once this is fixed. * Discussions with researchers on how they could take advantage of the cluster With the increasing amount of data available, researchers are running into issues of how to query the data without grabbing too much resources. So discussions were started on how researchers can instrument the cluster, and for example how to use kafkatee instead of udp2log. * Allow multiple varnishkafkas on caches Up to now, only a single varnishkafka has been running on the caches. But in order to feed performance data into kafka, a second varnishkafka on the caches would help. Together with Ori, work was done to allow running more varnishkafkas on the caches. Look out for statsv :-) [1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141114-EventLo… (The date in the incident report is from the previous week. Nonetheless, it's correct that we only started to work on it this week, as we only noticed while hunting down https://lists.wikimedia.org/pipermail/analytics/2014-November/002798.html . It is known that EventLogging monitoring has some holes. Closing some of them is on the agenda since some time, and we also added it to the actionables on the Incident reports) [2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141118-EventLo… [3] Sadly no public incident report about the database incident. Only on the non-public ops list: https://lists.wikimedia.org/mailman/private/ops/2014-November/043964.html [4] Sadly no public incident report about the database incident. Only on the non-public ops list: https://lists.wikimedia.org/mailman/private/ops/2014-November/044167.html [5] http://mobile-reportcard.wmflabs.org/graphs/ui-daily [6] https://phabricator.wikimedia.org/T1200 -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

9 years, 5 months

Adventures in Clusterland 2014-11-10--2014-11-16

by Christian Aistleitner

Hi, apologies for the long pause since the last update. In the week from 2014-11-10–2014-11-16 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops: * Talk about Kafka at Apache Kafka NYC user group * High-availability test for EventLogging's database failed * Upgrades of first machines from the cluster to trusty (details below) Have fun, Christian * Talk about Kafka at Apache Kafka NYC user group Andrew gave a talk [1] about WMF's Kafka setup and challenges around it at the Apache Kafka NYC user group. That ended up getting in great feedback not only on the talk itself, but also on instrumenting Python within the Hadoop ecosystem. So it helped in more than one way :-) * High-availability test for EventLogging's database failed Ops are in process of moving the database that EventLogging writes to behind a high-availability proxy. A test for that failed [2] (a firewall has been getting in the way) and EventLogging could not write events to the database for ~20 minutes. Ops fixed the firewalling, and we backfilled the database from the plain-file logs. * Upgrades of first machines from the cluster to trusty The first few machines got upgraded to trusty [3]. At first things were looking good. Only a minor issue with grub. But that could be worked around. During that week, things looked mostly smooth for the Trusty upgrade. [1] Google Glass :-) recorded video of the first 45-minutes of the talk is at: https://drive.google.com/folderview?id=0B0B2VcpkcY6wVFR3TFhIVEl5dW8&usp=sha… (downloadable for everyone who signed in to Google :-/ If you know how to reformat that URL into a plan curl-able URL, please let me know) (I first thought that the video is missing audio, but audio is there. It's just very silent.) [2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141113-EventLo… [3] https://phabricator.wikimedia.org/T1200 -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

9 years, 5 months

My transition to Hadoop streaming (thanks to gage & ottomata!)

by Aaron Halfaker

Hey folks, I just finished a blog post about how I'm incorporating hadoop streaming into my workflow. http://socio-technologist.blogspot.com/2014/11/fitting-hadoop-streaming-int… TL;DR: I have strong opinions about Good Ways(TM) to process large datafiles in interesting ways and hadoop streaming will support them nicely. :) Props to ottomata for spending a bunch of time helping me get up to speed with our cluster and to gage for making it easier to find hadoop's error messages. -Aaron

9 years, 5 months

Fwd: [Ops] Network outage for rack C4 in eqiad

by Ori Livneh

See message below about a network outage currently affecting multiple servers in eqiad. The set of affected servers includes gadolinium and hafnium, so udp2log-based web request logging and EventLogging-based metric reporters (e.g., Navigation Timing stats) are affected. ---------- Forwarded message ---------- From: Brandon Black <bblack(a)wikimedia.org> Date: Sat, Nov 29, 2014 at 9:54 PM Subject: [Ops] Network outage for rack C4 in eqiad To: Operations Engineers <ops(a)lists.wikimedia.org> We've lost network access (but not console access) to all the machines in eqiad rack C4 as of ~ 03:50 UTC (about 2 hours back from this email). This is mostly machines in a supporting role; no direct traffic front ends or app servers, etc. Phabricator is down as a result, as are a few monitoring -related bits and pieces. In the logs of asw-c-eqiad, it looks like the virtual chassis member for C4 was "removed" (log paste below). I haven't found any useful remote way to try to make that virtual chassis member restart yet. I'm not sure if it's worth waking anyone up in the middle of the night or anything at this point. Most likely this is going to involve some physical presence (or remote hands) at eqiad. ------------------------------- Nov 30 03:50:14 asw-c-eqiad /kernel: peer_inputs:3690 VKS0 closing connection peer type 24 indx 4 err 5 Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Member 1->1, Mode M->M, 1M 8B, GID 0, Master Unchanged, Members Changed Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: 1M 2L 3L 5L 6L 7L 8B Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Signaling license service Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_SNMP_TRAP7: SNMP trap generated: FRU removal (jnxFruContentsIndex 7, jnxFruL1Index 5, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC: EX4200-48T, 8 POE @ 4/*/*, jnxFruType 3, jnxFruSlot 4) Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 4 offline: Removal Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_IPC_CONNECTION_DROPPED: Dropped IPC connection for FPC 4 Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(4) Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_member_change_delete: member id 4 (my member id 1, my role 1) Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_delete_ifl: IFL resources for bme0.32773 (ifl_index 10) deleted Nov 30 03:50:15 asw-c-eqiad init: can not access /usr/sbin/smihelperd: No such file or directory Nov 30 03:50:15 asw-c-eqiad init: subscriber-management-helper (PID 0) started Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 3, interface vcp-0.32768 came up Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 7, interface vcp-1.32768 came up Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 8, interface vcp-1.32768 came up Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 6, interface vcp-1.32768 went down Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 2, interface vcp-1.32768 came up Nov 30 03:50:17 asw-c-eqiad fpc5 MRVL-L2:mrvl_brg_port_stg_entry_unset(),410:l2ifl not found! ifl 350 Nov 30 03:50:17 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE) failed, err 5 (Invalid) Nov 30 03:50:17 asw-c-eqiad last message repeated 5 times Nov 30 03:50:18 asw-c-eqiad fpc5 MRVL-L2:mrvl_brg_port_stg_delete(),652:Port-STG-UnSet failed(Invalid Params:-2) Nov 30 03:50:18 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE) failed, err 5 (Invalid) Nov 30 03:50:18 asw-c-eqiad last message repeated 5 times Nov 30 03:50:19 asw-c-eqiad fpc5 RT-HAL,rt_entry_delete_msg_proc,3539: l2_halp_vectors->delete failed proto MSTI, len 48 prefix 00350:00254 ------------------------------ _______________________________________________ Ops mailing list Ops(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops

9 years, 5 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics December 2014