See message below about a network outage currently affecting multiple servers in eqiad. The set of affected servers includes gadolinium and hafnium, so udp2log-based web request logging and EventLogging-based metric reporters (e.g., Navigation Timing stats) are affected.
---------- Forwarded message ---------- From: Brandon Black bblack@wikimedia.org Date: Sat, Nov 29, 2014 at 9:54 PM Subject: [Ops] Network outage for rack C4 in eqiad To: Operations Engineers ops@lists.wikimedia.org
We've lost network access (but not console access) to all the machines in eqiad rack C4 as of ~ 03:50 UTC (about 2 hours back from this email). This is mostly machines in a supporting role; no direct traffic front ends or app servers, etc. Phabricator is down as a result, as are a few monitoring -related bits and pieces.
In the logs of asw-c-eqiad, it looks like the virtual chassis member for C4 was "removed" (log paste below). I haven't found any useful remote way to try to make that virtual chassis member restart yet. I'm not sure if it's worth waking anyone up in the middle of the night or anything at this point. Most likely this is going to involve some physical presence (or remote hands) at eqiad.
------------------------------- Nov 30 03:50:14 asw-c-eqiad /kernel: peer_inputs:3690 VKS0 closing connection peer type 24 indx 4 err 5 Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Member 1->1, Mode M->M, 1M 8B, GID 0, Master Unchanged, Members Changed Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: 1M 2L 3L 5L 6L 7L 8B Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Signaling license service Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_SNMP_TRAP7: SNMP trap generated: FRU removal (jnxFruContentsIndex 7, jnxFruL1Index 5, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC: EX4200-48T, 8 POE @ 4/*/*, jnxFruType 3, jnxFruSlot 4) Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 4 offline: Removal Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_IPC_CONNECTION_DROPPED: Dropped IPC connection for FPC 4 Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(4) Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_member_change_delete: member id 4 (my member id 1, my role 1) Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_delete_ifl: IFL resources for bme0.32773 (ifl_index 10) deleted Nov 30 03:50:15 asw-c-eqiad init: can not access /usr/sbin/smihelperd: No such file or directory Nov 30 03:50:15 asw-c-eqiad init: subscriber-management-helper (PID 0) started Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 3, interface vcp-0.32768 came up Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 7, interface vcp-1.32768 came up Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 8, interface vcp-1.32768 came up Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 6, interface vcp-1.32768 went down Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 2, interface vcp-1.32768 came up Nov 30 03:50:17 asw-c-eqiad fpc5 MRVL-L2:mrvl_brg_port_stg_entry_unset(),410:l2ifl not found! ifl 350 Nov 30 03:50:17 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE) failed, err 5 (Invalid) Nov 30 03:50:17 asw-c-eqiad last message repeated 5 times Nov 30 03:50:18 asw-c-eqiad fpc5 MRVL-L2:mrvl_brg_port_stg_delete(),652:Port-STG-UnSet failed(Invalid Params:-2) Nov 30 03:50:18 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE) failed, err 5 (Invalid) Nov 30 03:50:18 asw-c-eqiad last message repeated 5 times Nov 30 03:50:19 asw-c-eqiad fpc5 RT-HAL,rt_entry_delete_msg_proc,3539: l2_halp_vectors->delete failed proto MSTI, len 48 prefix 00350:00254 ------------------------------
_______________________________________________ Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops
Hi,
On Sun, Nov 30, 2014 at 12:34:27AM -0800, Ori Livneh wrote:
See message below about a network outage currently affecting multiple servers in eqiad.
The network is up again, and the affected machines are again reachable.
TL;DR: +------------------------------+-----------+--------------------------------+ | Dataset | Affected? | Will be backfilled? | +------------------------------+-----------+--------------------------------+ | Analytics slave databases | no | --- | | Analytics cluster | no | --- | | Pagecounts-all-sites | no | --- | | Pagecounts-raw | yes | yes, not completed yet | | TSVs | yes | no | | EventLogging database | yes | yes, done | | EventLogging graphite graphs | yes | no | | geowiki | yes | yes, done | | Wikipedia Zero graphs | yes | Excluded 2014-11-30 from plots | +------------------------------+-----------+--------------------------------+
I'll track updates on
https://phabricator.wikimedia.org/T76334
Best regards, Christian
* Pagecounts-raw
pagecounts-20141130-040000.gz pagecounts-20141130-050000.gz pagecounts-20141130-060000.gz pagecounts-20141130-070000.gz pagecounts-20141130-080000.gz pagecounts-20141130-090000.gz pagecounts-20141130-100000.gz pagecounts-20141130-110000.gz projectcounts-20141130-040000 projectcounts-20141130-050000 projectcounts-20141130-060000 projectcounts-20141130-070000 projectcounts-20141130-080000 projectcounts-20141130-090000 projectcounts-20141130-100000 projectcounts-20141130-110000
are bad.
We'll backfill them from pagecounts-all-sites.
If you're still using pagecounts-raw, please consider switching to pagecounts-all-sites:
https://wikitech.wikimedia.org/wiki/Analytics/Pagecounts-all-sites
* TSVs:
All udp2log streams are affected. Calling out only the most prominent ones: ** sampled-1000 TSVs ** mobile-sampled-100 TSVs ** zero TSVs ** edits TSVs
They are all missing data between 2014-11-30T03:50 and 2014-11-30T10:13.
Properly backfilling them from the cluster would be possible, but this would need serious data massaging. If not one says the data for 2014-11-30 is badly needed, I would not backfill the TSVs.
* EventLogging:
** Database is up and running and backfilled. No artifacts are expected.
** EventLogging stats on graphite Note that this item is only about the graphs in graphite. The data in the database (see above) is ok!)
The overall counts are ok and should not show artifacts. The per schema counts are basically blank for 2014-11-30. Backfilling them would be really time consuming, and the historic parts of those graphs do not seem to be used anyways. So I suggest to not backfill here.
* geowiki:
Data for the affected period has been backfilled.
* Wikipedia Zero graphs:
2014-11-30 has been added to the list of dates that will not show up in the Wikipedia Zero plots.
Thanks, Christian! :)
What do we use geowiki for, out of interest?
On 1 December 2014 at 06:59, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
On Sun, Nov 30, 2014 at 12:34:27AM -0800, Ori Livneh wrote:
See message below about a network outage currently affecting multiple servers in eqiad.
The network is up again, and the affected machines are again reachable.
TL;DR:
+------------------------------+-----------+--------------------------------+ | Dataset | Affected? | Will be backfilled? |
+------------------------------+-----------+--------------------------------+ | Analytics slave databases | no | --- | | Analytics cluster | no | --- | | Pagecounts-all-sites | no | --- | | Pagecounts-raw | yes | yes, not completed yet | | TSVs | yes | no | | EventLogging database | yes | yes, done | | EventLogging graphite graphs | yes | no | | geowiki | yes | yes, done | | Wikipedia Zero graphs | yes | Excluded 2014-11-30 from plots |
+------------------------------+-----------+--------------------------------+
I'll track updates on
https://phabricator.wikimedia.org/T76334
Best regards, Christian
Pagecounts-raw
pagecounts-20141130-040000.gz pagecounts-20141130-050000.gz pagecounts-20141130-060000.gz pagecounts-20141130-070000.gz pagecounts-20141130-080000.gz pagecounts-20141130-090000.gz pagecounts-20141130-100000.gz pagecounts-20141130-110000.gz projectcounts-20141130-040000 projectcounts-20141130-050000 projectcounts-20141130-060000 projectcounts-20141130-070000 projectcounts-20141130-080000 projectcounts-20141130-090000 projectcounts-20141130-100000 projectcounts-20141130-110000
are bad.
We'll backfill them from pagecounts-all-sites.
If you're still using pagecounts-raw, please consider switching to pagecounts-all-sites:
https://wikitech.wikimedia.org/wiki/Analytics/Pagecounts-all-sites
- TSVs:
All udp2log streams are affected. Calling out only the most prominent ones: ** sampled-1000 TSVs ** mobile-sampled-100 TSVs ** zero TSVs ** edits TSVs
They are all missing data between 2014-11-30T03:50 and 2014-11-30T10:13.
Properly backfilling them from the cluster would be possible, but this would need serious data massaging. If not one says the data for 2014-11-30 is badly needed, I would not backfill the TSVs.
- EventLogging:
** Database is up and running and backfilled. No artifacts are expected.
** EventLogging stats on graphite Note that this item is only about the graphs in graphite. The data in the database (see above) is ok!)
The overall counts are ok and should not show artifacts. The per schema counts are basically blank for 2014-11-30. Backfilling them would be really time consuming, and the historic parts of those graphs do not seem to be used anyways. So I suggest to not backfill here.
- geowiki:
Data for the affected period has been backfilled.
- Wikipedia Zero graphs:
2014-11-30 has been added to the list of dates that will not show up in the Wikipedia Zero plots.
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks Christian. I do not believe that we need to backfill the TSVs that are filled from the udp2log stream.
Oliver -- GLEE uses the geo-edit data.
-Toby
On Mon, Dec 1, 2014 at 4:57 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Thanks, Christian! :)
What do we use geowiki for, out of interest?
On 1 December 2014 at 06:59, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
On Sun, Nov 30, 2014 at 12:34:27AM -0800, Ori Livneh wrote:
See message below about a network outage currently affecting multiple servers in eqiad.
The network is up again, and the affected machines are again reachable.
TL;DR:
+------------------------------+-----------+--------------------------------+ | Dataset | Affected? | Will be backfilled? |
+------------------------------+-----------+--------------------------------+ | Analytics slave databases | no | --- | | Analytics cluster | no | --- | | Pagecounts-all-sites | no | --- | | Pagecounts-raw | yes | yes, not completed yet | | TSVs | yes | no | | EventLogging database | yes | yes, done | | EventLogging graphite graphs | yes | no | | geowiki | yes | yes, done | | Wikipedia Zero graphs | yes | Excluded 2014-11-30 from plots |
+------------------------------+-----------+--------------------------------+
I'll track updates on
https://phabricator.wikimedia.org/T76334
Best regards, Christian
Pagecounts-raw
pagecounts-20141130-040000.gz pagecounts-20141130-050000.gz pagecounts-20141130-060000.gz pagecounts-20141130-070000.gz pagecounts-20141130-080000.gz pagecounts-20141130-090000.gz pagecounts-20141130-100000.gz pagecounts-20141130-110000.gz projectcounts-20141130-040000 projectcounts-20141130-050000 projectcounts-20141130-060000 projectcounts-20141130-070000 projectcounts-20141130-080000 projectcounts-20141130-090000 projectcounts-20141130-100000 projectcounts-20141130-110000
are bad.
We'll backfill them from pagecounts-all-sites.
If you're still using pagecounts-raw, please consider switching to pagecounts-all-sites:
https://wikitech.wikimedia.org/wiki/Analytics/Pagecounts-all-sites
- TSVs:
All udp2log streams are affected. Calling out only the most prominent ones: ** sampled-1000 TSVs ** mobile-sampled-100 TSVs ** zero TSVs ** edits TSVs
They are all missing data between 2014-11-30T03:50 and 2014-11-30T10:13.
Properly backfilling them from the cluster would be possible, but this would need serious data massaging. If not one says the data for 2014-11-30 is badly needed, I would not backfill the TSVs.
- EventLogging:
** Database is up and running and backfilled. No artifacts are expected.
** EventLogging stats on graphite Note that this item is only about the graphs in graphite. The data in the database (see above) is ok!)
The overall counts are ok and should not show artifacts. The per schema counts are basically blank for 2014-11-30. Backfilling them would be really time consuming, and the historic parts of those graphs do not seem to be used anyways. So I suggest to not backfill here.
- geowiki:
Data for the affected period has been backfilled.
- Wikipedia Zero graphs:
2014-11-30 has been added to the list of dates that will not show up in the Wikipedia Zero plots.
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics