Analytics November 2014

analytics@lists.wikimedia.org

29 participants
31 discussions

Adding “php” key to X-Analytics header
by Christian Aistleitner 13 Jan '15

13 Jan '15

Hi, just a quick heads up, that Ops are about to add a “php” key to the X-Analytics header (i.e.: for sampled-1000 logs, hive, ...): https://gerrit.wikimedia.org/r/#/c/156793/ This header will hold the used PHP implementation [1]. Planned deployment is between 2014-09-01 and 2014-09-02. Have fun, Christian [1] https://wikitech.wikimedia.org/wiki/X-Analytics#Keys -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 2

My transition to Hadoop streaming (thanks to gage & ottomata!)
by Aaron Halfaker 01 Dec '14

01 Dec '14

Hey folks, I just finished a blog post about how I'm incorporating hadoop streaming into my workflow. http://socio-technologist.blogspot.com/2014/11/fitting-hadoop-streaming-int… TL;DR: I have strong opinions about Good Ways(TM) to process large datafiles in interesting ways and hadoop streaming will support them nicely. :) Props to ottomata for spending a bunch of time helping me get up to speed with our cluster and to gage for making it easier to find hadoop's error messages. -Aaron

3 2

Fwd: [Ops] Network outage for rack C4 in eqiad
by Ori Livneh 01 Dec '14

01 Dec '14

See message below about a network outage currently affecting multiple servers in eqiad. The set of affected servers includes gadolinium and hafnium, so udp2log-based web request logging and EventLogging-based metric reporters (e.g., Navigation Timing stats) are affected. ---------- Forwarded message ---------- From: Brandon Black <bblack(a)wikimedia.org> Date: Sat, Nov 29, 2014 at 9:54 PM Subject: [Ops] Network outage for rack C4 in eqiad To: Operations Engineers <ops(a)lists.wikimedia.org> We've lost network access (but not console access) to all the machines in eqiad rack C4 as of ~ 03:50 UTC (about 2 hours back from this email). This is mostly machines in a supporting role; no direct traffic front ends or app servers, etc. Phabricator is down as a result, as are a few monitoring -related bits and pieces. In the logs of asw-c-eqiad, it looks like the virtual chassis member for C4 was "removed" (log paste below). I haven't found any useful remote way to try to make that virtual chassis member restart yet. I'm not sure if it's worth waking anyone up in the middle of the night or anything at this point. Most likely this is going to involve some physical presence (or remote hands) at eqiad. ------------------------------- Nov 30 03:50:14 asw-c-eqiad /kernel: peer_inputs:3690 VKS0 closing connection peer type 24 indx 4 err 5 Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Member 1->1, Mode M->M, 1M 8B, GID 0, Master Unchanged, Members Changed Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: 1M 2L 3L 5L 6L 7L 8B Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Signaling license service Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_SNMP_TRAP7: SNMP trap generated: FRU removal (jnxFruContentsIndex 7, jnxFruL1Index 5, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC: EX4200-48T, 8 POE @ 4/*/*, jnxFruType 3, jnxFruSlot 4) Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 4 offline: Removal Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_IPC_CONNECTION_DROPPED: Dropped IPC connection for FPC 4 Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(4) Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_member_change_delete: member id 4 (my member id 1, my role 1) Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_delete_ifl: IFL resources for bme0.32773 (ifl_index 10) deleted Nov 30 03:50:15 asw-c-eqiad init: can not access /usr/sbin/smihelperd: No such file or directory Nov 30 03:50:15 asw-c-eqiad init: subscriber-management-helper (PID 0) started Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 3, interface vcp-0.32768 came up Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 7, interface vcp-1.32768 came up Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 8, interface vcp-1.32768 came up Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 6, interface vcp-1.32768 went down Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 2, interface vcp-1.32768 came up Nov 30 03:50:17 asw-c-eqiad fpc5 MRVL-L2:mrvl_brg_port_stg_entry_unset(),410:l2ifl not found! ifl 350 Nov 30 03:50:17 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE) failed, err 5 (Invalid) Nov 30 03:50:17 asw-c-eqiad last message repeated 5 times Nov 30 03:50:18 asw-c-eqiad fpc5 MRVL-L2:mrvl_brg_port_stg_delete(),652:Port-STG-UnSet failed(Invalid Params:-2) Nov 30 03:50:18 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE) failed, err 5 (Invalid) Nov 30 03:50:18 asw-c-eqiad last message repeated 5 times Nov 30 03:50:19 asw-c-eqiad fpc5 RT-HAL,rt_entry_delete_msg_proc,3539: l2_halp_vectors->delete failed proto MSTI, len 48 prefix 00350:00254 ------------------------------ _______________________________________________ Ops mailing list Ops(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops

4 3

RFC for media file request counts, please chime in
by Erik Zachte 26 Nov '14

26 Nov '14

Since 2008 Wikimedia collects pageview counts for most pages on nearly all wikis. A longstanding request of stakeholders (editors, researchers, GLAM advocates) has been to publish similar counts for media files: images, sounds, videos. A major obstacle to effectuate this was the existing traffic data collecting software. Webstatscollector simply couldn't be scaled up further without incurring huge costs. In 2014 WMF engineers rolled out a new Hadoop based infrastructure, which makes it possible to collect raw request counts for media files. So a few months after releasing extended pageview counts (with mobile/zero added), the time has come to produce similar data dumps for media files. https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count s Please comment onwiki. Thanks Erik Zachte

1 0

Writing EventLogging events to database failed on 2014-11-25 between 03:09 and ~midnight
by Christian Aistleitner 26 Nov '14

26 Nov '14

Hi, I am sorry to announce yet another EventLogging outage :-( EventLogging's database writer service failed to write events to the database between ~2014-11-25T03:09 and 2014-11-26T00:03. The recent adhoc increase of EventLogging's database throughput capacity to address the EventLogging database writing bottle-neck came with the known issue of not too robust exit synchronization of threads within the EventLogging database writer. This exit synchronization issue got the database writer stuck and caused the outage. A fix for that known issue is sitting in gerrit since Sunday, but due to the many meetings and discussions around recent EventLogging issues, the fix did not yet get reviewed and deployed. The still rather empty Incident Report is at https://wikitech.wikimedia.org/wiki/Incident_documentation/20141125-EventLo… I'll fill it with more information tomorrow. Backfilling from the logs is already running, and should also finish tomorrow. Best regards, Christian -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

3 3

Round-up of recent EventLogging issues
by Christian Aistleitner 24 Nov '14

24 Nov '14

Hi, with the recent events around EventLogging, I think a high-level round-up of what happened is overdue. There have been four unrelated issues: * Failed test to have EventLogging access its database through a high-availability proxy * Event volume growing beyond whan EventLogging's database writer could handle * db1020 outage * Untested EventLogging code got accidentally deployed (Please find the details below) For the last three of the four issues, backfilling is still pending, as focus up to now was on getting EventLogging under control again. As it seems EventLogging is under control again (keeping fingers crossed for the second item in the above list), backfilling is next. Sorry for the inconveniences, Christian * Failed test to have EventLogging access its database through a high-availability proxy Production's firewall got in the way [1]. Data got backfilled in the database from the logs. So no data got lost. The switch to a high-availabily proxy happened in the meantime. * Event volume growing beyond whan EventLogging's database writer could handle Event volume increased by not quite 60% over-night and the database writer could not handle the increased volume [2]. The database writer got restructured and got deployed yesterday in the UTC evening. Since then the restructured database writer could easily handle the increased volume. But before declaring victory, we have to wait for a few days and see how it handles hours of increased activity. During the time that the database writer could not handle the event volume, logging to disk could keep up with the increased volume, so backfilling should work, but it is still pending. * db1020 outage The database process of the m2 cluster (the one which EventLogging's database writer writes to) died [1]. Ops handled the issue promptly and failed-over to a slave database. We have logs in plain files for the affected periods. So backfilling should work, but it is still pending. * Untested EventLogging code got accidentally deployed It seems around trying to fix EventLogging for beta, an untested version accidentally got deployed to production. This accidentally deployed version stopped writing to the database from time to time and then started working again [4]. A working version has been deployed again. We have logs in plain files for the affected periods. So backfilling should work, but is still pending. [1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141113-EventLo… [2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141114-EventLo… [3] https://lists.wikimedia.org/mailman/private/ops/2014-November/043964.html [4] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141118-EventLo… -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

4 6

Writing EventLogging events to database failed on 2014-11-22 for between ~19:30 and ~21:00
by Christian Aistleitner 24 Nov '14

24 Nov '14

Hi, as users of gerrit & Co. probably noticed, the m2 master database had issues again on 2014-11-22, and caused EventLogging to not be able to write events to the database on 2014-11-22 between about 19:30 and 21:00. Data for that period has been backfilled from logs again, so the database should again hold good numbers for that period. Just wanted to let you know what happened, in case you notice drops in graphs / dashboards that got generated between the incident and now. Sorry for the inconveniences, Christian -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

3 2

Where to check ongoing month traffic stats?
by Gilles Dubuc 21 Nov '14

21 Nov '14

The report card seems to only show data for the past month. Is there something I could check to verify if there was an overall drop in reader traffic over the last week? The reason I ask is that Media Viewer actions seem to be dropping across the board since November 13th: http://multimedia-metrics.wmflabs.org/dashboards/mmv This trend is seen on all wikis. I'm trying to find the cause of this problem, because we're launching significant UI changes to all wikis today and we'll need to make sure that our EventLogging figures are accurate.

2 3

Writing EventLogging events to database failed on 2014-11-18 for ~50 minutes between 14:14 and 15:02
by Christian Aistleitner 21 Nov '14

21 Nov '14

Hi, the m2 master crashed today (investigation still ongoing), and caused EventLogging to not be able to write events to the database on 2014-11-18 between 14:14 and 15:02. The data for that period is not lost, but is available in backup files, waiting to get injected again into the database. Just wanted to let you know what happened, in case you notice drops in graphs / dashboards during that period. Sorry for the inconveniences, Christian -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

4 5

Analytics Dev Team Commitments 2014-10-30 -- 2014-11-11
by Kevin Leduc 19 Nov '14

19 Nov '14

Hello, We kicked off our next sprint this morning, with the help of some release planning executed during the last 2 weeks. The sprint status is here: http://sb.wmflabs.org/t/analytics-developers/2014-10-30/ The focus of this sprint is working on the backend in preparation to display new data in Vital Signs. Bug ID Component Summary Points 72740 Dashiki Story: Vital Signs User selects the Daily Pageviews metrics 34 72741 EventLogging List tables/schemas with data retention needs 0 72642 EventLogging Story: Identify and direct the purging of Event logging raw logs older than 90 days in stat1002 0 67450 EventLogging database consumer could batch inserts (sometimes) 34 72746 Wikimetrics Story: WikimetricsUser tags a cohort using a pre-defined tag 5 72635 Wikimetrics report table performance, cleanup, and number of items 13 That’s 86 points in 4 stories. The bugs with 0 points are tasks for the team to track and follow up on, and the work mostly falls on other teams. Regards, Kevin Leduc

1 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics November 2014