Hi,
just a quick heads up, that Ops are about to add a “php” key to the
X-Analytics header (i.e.: for sampled-1000 logs, hive, ...):
https://gerrit.wikimedia.org/r/#/c/156793/
This header will hold the used PHP implementation [1].
Planned deployment is between 2014-09-01 and 2014-09-02.
Have fun,
Christian
[1] https://wikitech.wikimedia.org/wiki/X-Analytics#Keys
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hey folks,
I just finished a blog post about how I'm incorporating hadoop streaming
into my workflow.
http://socio-technologist.blogspot.com/2014/11/fitting-hadoop-streaming-int…
TL;DR: I have strong opinions about Good Ways(TM) to process large
datafiles in interesting ways and hadoop streaming will support them
nicely. :)
Props to ottomata for spending a bunch of time helping me get up to speed
with our cluster and to gage for making it easier to find hadoop's error
messages.
-Aaron
See message below about a network outage currently affecting multiple
servers in eqiad. The set of affected servers includes gadolinium
and hafnium, so udp2log-based web request logging and EventLogging-based
metric reporters (e.g., Navigation Timing stats) are affected.
---------- Forwarded message ----------
From: Brandon Black <bblack(a)wikimedia.org>
Date: Sat, Nov 29, 2014 at 9:54 PM
Subject: [Ops] Network outage for rack C4 in eqiad
To: Operations Engineers <ops(a)lists.wikimedia.org>
We've lost network access (but not console access) to all the machines in
eqiad rack C4 as of ~ 03:50 UTC (about 2 hours back from this email). This
is mostly machines in a supporting role; no direct traffic front ends or
app servers, etc. Phabricator is down as a result, as are a few monitoring
-related bits and pieces.
In the logs of asw-c-eqiad, it looks like the virtual chassis member for C4
was "removed" (log paste below). I haven't found any useful remote way to
try to make that virtual chassis member restart yet. I'm not sure if it's
worth waking anyone up in the middle of the night or anything at this
point. Most likely this is going to involve some physical presence (or
remote hands) at eqiad.
-------------------------------
Nov 30 03:50:14 asw-c-eqiad /kernel: peer_inputs:3690 VKS0 closing
connection peer type 24 indx 4 err 5
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Member 1->1, Mode
M->M, 1M 8B, GID 0, Master Unchanged, Members Changed
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: 1M 2L 3L 5L 6L 7L 8B
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Signaling license
service
Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_SNMP_TRAP7: SNMP trap
generated: FRU removal (jnxFruContentsIndex 7, jnxFruL1Index 5,
jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC: EX4200-48T, 8 POE @
4/*/*, jnxFruType 3, jnxFruSlot 4)
Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_FRU_OFFLINE_NOTICE:
Taking FPC 4 offline: Removal
Nov 30 03:50:15 asw-c-eqiad chassisd[1512]:
CHASSISD_IPC_CONNECTION_DROPPED: Dropped IPC connection for FPC 4
Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_IFDEV_DETACH_FPC:
ifdev_detach_fpc(4)
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_member_change_delete:
member id 4 (my member id 1, my role 1)
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_delete_ifl: IFL
resources for bme0.32773 (ifl_index 10) deleted
Nov 30 03:50:15 asw-c-eqiad init: can not access /usr/sbin/smihelperd: No
such file or directory
Nov 30 03:50:15 asw-c-eqiad init: subscriber-management-helper (PID 0)
started
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 3, interface vcp-0.32768
came up
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 7, interface vcp-1.32768
came up
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 8, interface vcp-1.32768
came up
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 6, interface vcp-1.32768
went down
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 2, interface vcp-1.32768
came up
Nov 30 03:50:17 asw-c-eqiad fpc5
MRVL-L2:mrvl_brg_port_stg_entry_unset(),410:l2ifl not found! ifl 350
Nov 30 03:50:17 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE)
failed, err 5 (Invalid)
Nov 30 03:50:17 asw-c-eqiad last message repeated 5 times
Nov 30 03:50:18 asw-c-eqiad fpc5
MRVL-L2:mrvl_brg_port_stg_delete(),652:Port-STG-UnSet failed(Invalid
Params:-2)
Nov 30 03:50:18 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE)
failed, err 5 (Invalid)
Nov 30 03:50:18 asw-c-eqiad last message repeated 5 times
Nov 30 03:50:19 asw-c-eqiad fpc5 RT-HAL,rt_entry_delete_msg_proc,3539:
l2_halp_vectors->delete failed proto MSTI, len 48 prefix 00350:00254
------------------------------
_______________________________________________
Ops mailing list
Ops(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ops
Since 2008 Wikimedia collects pageview counts for most pages on nearly all
wikis. A longstanding request of stakeholders (editors, researchers, GLAM
advocates) has been to publish similar counts for media files: images,
sounds, videos. A major obstacle to effectuate this was the existing traffic
data collecting software. Webstatscollector simply couldn't be scaled up
further without incurring huge costs. In 2014 WMF engineers rolled out a new
Hadoop based infrastructure, which makes it possible to collect raw request
counts for media files. So a few months after releasing extended pageview
counts (with mobile/zero added), the time has come to produce similar data
dumps for media files.
https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_count
s
Please comment onwiki.
Thanks
Erik Zachte
Hi,
I am sorry to announce yet another EventLogging outage :-(
EventLogging's database writer service failed to write events to the
database between ~2014-11-25T03:09 and 2014-11-26T00:03.
The recent adhoc increase of EventLogging's database throughput
capacity to address the EventLogging database writing bottle-neck came
with the known issue of not too robust exit synchronization of threads
within the EventLogging database writer. This exit synchronization
issue got the database writer stuck and caused the outage.
A fix for that known issue is sitting in gerrit since Sunday, but due
to the many meetings and discussions around recent EventLogging
issues, the fix did not yet get reviewed and deployed.
The still rather empty Incident Report is at
https://wikitech.wikimedia.org/wiki/Incident_documentation/20141125-EventLo…
I'll fill it with more information tomorrow.
Backfilling from the logs is already running, and should also finish
tomorrow.
Best regards,
Christian
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hi,
with the recent events around EventLogging, I think a high-level
round-up of what happened is overdue.
There have been four unrelated issues:
* Failed test to have EventLogging access its database through a high-availability proxy
* Event volume growing beyond whan EventLogging's database writer could handle
* db1020 outage
* Untested EventLogging code got accidentally deployed
(Please find the details below)
For the last three of the four issues, backfilling is still pending,
as focus up to now was on getting EventLogging under control again. As
it seems EventLogging is under control again (keeping fingers crossed
for the second item in the above list), backfilling is next.
Sorry for the inconveniences,
Christian
* Failed test to have EventLogging access its database through a high-availability proxy
Production's firewall got in the way [1].
Data got backfilled in the database from the logs. So no data got lost.
The switch to a high-availabily proxy happened in the meantime.
* Event volume growing beyond whan EventLogging's database writer could handle
Event volume increased by not quite 60% over-night and the database
writer could not handle the increased volume [2].
The database writer got restructured and got deployed yesterday in the
UTC evening. Since then the restructured database writer could easily
handle the increased volume. But before declaring victory, we have
to wait for a few days and see how it handles hours of increased
activity.
During the time that the database writer could not handle the event
volume, logging to disk could keep up with the increased volume, so
backfilling should work, but it is still pending.
* db1020 outage
The database process of the m2 cluster (the one which EventLogging's
database writer writes to) died [1]. Ops handled the issue promptly
and failed-over to a slave database.
We have logs in plain files for the affected periods. So backfilling
should work, but it is still pending.
* Untested EventLogging code got accidentally deployed
It seems around trying to fix EventLogging for beta, an untested
version accidentally got deployed to production. This accidentally
deployed version stopped writing to the database from time to time and
then started working again [4].
A working version has been deployed again.
We have logs in plain files for the affected periods. So backfilling
should work, but is still pending.
[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141113-EventLo…
[2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141114-EventLo…
[3] https://lists.wikimedia.org/mailman/private/ops/2014-November/043964.html
[4] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141118-EventLo…
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hi,
as users of gerrit & Co. probably noticed, the m2 master database had
issues again on 2014-11-22, and caused EventLogging to not be able to
write events to the database on 2014-11-22 between
about 19:30 and 21:00.
Data for that period has been backfilled from logs again, so the
database should again hold good numbers for that period.
Just wanted to let you know what happened, in case you notice drops in
graphs / dashboards that got generated between the incident and now.
Sorry for the inconveniences,
Christian
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
The report card seems to only show data for the past month. Is there
something I could check to verify if there was an overall drop in reader
traffic over the last week?
The reason I ask is that Media Viewer actions seem to be dropping across
the board since November 13th:
http://multimedia-metrics.wmflabs.org/dashboards/mmv This trend is seen on
all wikis. I'm trying to find the cause of this problem, because we're
launching significant UI changes to all wikis today and we'll need to make
sure that our EventLogging figures are accurate.
Hi,
the m2 master crashed today (investigation still ongoing), and caused
EventLogging to not be able to write events to the database on
2014-11-18 between 14:14 and 15:02.
The data for that period is not lost, but is available in backup
files, waiting to get injected again into the database.
Just wanted to let you know what happened, in case you notice drops in
graphs / dashboards during that period.
Sorry for the inconveniences,
Christian
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hello,
We kicked off our next sprint this morning, with the help of some release
planning executed during the last 2 weeks. The sprint status is here:
http://sb.wmflabs.org/t/analytics-developers/2014-10-30/
The focus of this sprint is working on the backend in preparation to
display new data in Vital Signs.
Bug ID
Component
Summary
Points
72740
Dashiki
Story: Vital Signs User selects the Daily Pageviews metrics
34
72741
EventLogging
List tables/schemas with data retention needs
0
72642
EventLogging
Story: Identify and direct the purging of Event logging raw logs older
than 90 days in stat1002
0
67450
EventLogging
database consumer could batch inserts (sometimes)
34
72746
Wikimetrics
Story: WikimetricsUser tags a cohort using a pre-defined tag
5
72635
Wikimetrics
report table performance, cleanup, and number of items
13
That’s 86 points in 4 stories.
The bugs with 0 points are tasks for the team to track and follow up on,
and the work mostly falls on other teams.
Regards,
Kevin Leduc