So, first off: Ironholds made all the numbers used in this metrics
meeting available in the tool at
http://pentaho.wmflabs.org/pentaho/Home
I'm not going to repost the username/password here, but find me (or
Ironholds?) on IRC if you're interested in exploring the data.
http://pentaho.wmflabs.org/pentaho/Home <username>,<password> - >
create new -> new Saiku Analytics -> v 0.3
Ok. With that said, here are some thoughts about the numbers, with
some copy & paste from IRC:
Q:
RoanKattouw: Ironholds: Re the India language graph (97% of hits from
India being to enwiki), we are now idly wondering what places are more
diverse in those terms
RoanKattouw: Like, maybe the USA?
RoanKattouw: Is the Spanish- speaking internet more than 3% of the US internet?
RoanKattouw: cscott: Basically my question is, what is the % of enwiki
hits in the US. Apparently for India it's 97%
A:
zhwiki and/or eswiki are the top non-enwiki sites in the US; they
account for about 1% of traffic.
See https://en.wikipedia.org/wiki/User:Cscott/2014_December_metrics
and cscott-us-proj.saiku on pentaho.
Q:
cscott: also i'm very curious about, say, the rise of iran traffic --
is that to enwiki or fawiki?
cscott: in general, is the global south reading enwiki? or is mobile
traffic to the local wikis exploding?
A: Almost all due to enwiki traffic.
See https://en.wikipedia.org/wiki/User:Cscott/2014_December_metrics
cscott-ir-proj-2.saiku on pentaho, and attached graph.
Q:
cscott-free: another question: does the decrease in latin america
correspond to a decrease in the eswiki project?
A:
The presented slide listed the following among the "Top Decliners":
Country, Month page views (billion), Annual growth rate
Ecuador, 0.04, -30.5%
Venezuela, 0.09, -28.0%
Portugal, 0.05, -23.8%
Mexico, 0.34, -23.2%
Colombia, 0.14, -23.2%
Chile, 0.08, -22.3%
Brazil, 0.32, -21.0%
Peru, 0.06, -17.8%
Countries in the top 25% by total human PVs as of October 2014; annual
growth rates based on linear model (May 2013-October 2014)
I'm still working on figuring out the answer to this one. As far as I
can tell, eswiki page views are pretty flat, and eswiki page views in
Ecuador (for instance) are down a little, but now by 30% annually. So
there's something mysterious here.
Possibly related: commons page views in latin america dropped sharply
starting in 2014-06, after mediaviewer was turned on. But that
doesn't seem to be quite enough.
--scott
--
(http://cscott.net)
You can see this in comparison to other features.
Yes it is indeed more useful than items in the menu but as you can see in
the graph the features are near identical in terms of clicks tracked.
However what striked me as odd was that the number of clicks varied
drastically depending on the language (although correlated for all
languages they are not collerated per language).
Yes this could mean other things such as languages are more important in
the given language and thus it is more useful (maybe the language is
incomplete) and tied to what you and Nemo suggested it is more prevalent on
the screen. Yet i cant help but wonder if it also might hint at something
to do with the icons effectiveness especially when I look at well developed
wiki's such as Chinese and Japanese when compared to English.
By the way, we don't have that much fine grained information with regards
to shat type of pages they clicked these features on.
It might be useful to try a different logo on one of the projects e.g.
Japanese and see how this gets impacted by the change. If there is no
drastic change we could probably conclude that indeed my comparison sucks
:) We can certainly use clicks as a guideline of whether the icon is
getting more effective.
I'm ccing analytics in case they have any views on this.
On Dec 5, 2014 2:26 AM, "Amir E. Aharoni" <amir.aharoni(a)mail.huji.ac.il>
wrote:
> The similarity in the numbers is indeed striking, but I don't think that
> it says much about the perception of the hamburger icon. I suspect that for
> most readers the language button is more useful than *any* of the actions
> in the hamburger menu - home, random, nearby, watchlist, settings, log in.
> To confirm it, I'd love to see the the numbers for these other actions.
>
> Also, as Nemo asks, it would be useful to see pages without language links
> separated, and to also see page length taken into account somehow - on a
> short page it is easier to see the languages button (less or no scrolling),
> and there's more motivation to tap it (the hope to read more in another
> language).
>
Hi,
in the week from 2014-11-24–2014-11-30 Andrew, and I [1] worked on the
following items around the Analytics Cluster and Analytics related
Ops:
* Catch-up and meetings around EventLogging issues.
* EventLogging's database writer not properly shutting down
* Wikipedia Zero graph comparability
* Network switch outage in eqiad
(details below)
Have fun,
Christian
* Catch-up and meetings around EventLogging issues.
There were quite some catch-up discussions and meetings around the
recent EventLogging issues. It seems were all on the same page now.
* EventLogging's database writer not properly shutting down
When having to adhoc increase EventLogging's database throughput, the
hot fix was known to come with not too robust exit synchronization. So
in case of issues, with the events, the database writer would not
properly shut down and restart, but could be left hanging. This has
been known beforehand, and was accepted to bring EventLogging up again
as soon as possible.
The fix for it is not hard, but with the many follow-up meetings, it
did not get deployed before the issue first struck [2]. Now with the
follow-up meetings done, the fix got reviewed, deployed and is working
fine up to now.
We backfilled the database from plain-file logs for the affected period.
* Wikipedia Zero graph comparability
Wikipedia Zero is moving from the Analytics team's dashboards to
on-wiki graphs on the (private) zerowiki. But the numbers on the
graphs did not match. So we helped to identify which aspects of the
different pageview definitions cause the mismatches in the graphs. It
seems that the key differences are now understood.
* Network switch outage in eqiad
During the weekend, a network switch in eqiad went offline [3] and
took key machines in the analytics infrastructure offline. We started
[4] looking at the affected machines, measuring impact and
backfilling.
This is not done yet and will take more time.
[1] Jeff will refocus on Ops projects outside the realm of
Analytics. Many thanks for your great work on Analytics cluster and
Analytics related Ops!
[2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141125-EventLo…
[3]
https://wikitech.wikimedia.org/wiki/Incident_documentation/20141130-Eqiad-R…https://phabricator.wikimedia.org/tag/incident-20141129-network/
[4]
https://lists.wikimedia.org/pipermail/analytics/2014-November/002819.htmlhttps://lists.wikimedia.org/pipermail/analytics/2014-December/002821.htmlhttps://phabricator.wikimedia.org/T76334
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hi,
in the week from 2014-11-17–2014-11-23 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* EventLogging hit throughput limit to database
* Unintended EventLogging deploy of faulty code
* Outage on master of EventLogging's database shard (db1020)
* Outage on master of EventLogging's database shard (db1046)
* Debugging Mobile UI dashboard
* Upgrades of first machines from the cluster to trusty
* Discussions with researchers on how they could take advantage of the cluster
* Allow multiple varnishkafkas on caches
(details below)
Have fun,
Christian
* EventLogging throughput limit to database
One of the teams instrumenting EventLogging silently and
drastically ;-) increased the volume of events they are producing [1].
The total volume of events that the EventLogging infrastructure had to
handle jumped from ~140 msgs/s to ~220 msgs/s. This was more than
EventLogging's database writer could bring to the database.
Only ~70% of events made it to the database.
We isolated the issue, overcame the throughput limitation of
EventLogging's database writer and the database writer
(not EventLogging as a whole) can now handle way more events.
In addition to that, the database got backfilled from the plain-file
logs.
* Unintended EventLogging deploy of faulty code
It seems that during efforts to bring EventLogging up-to-date on the
beta cluster, faulty code unintendedly got deployed to
production [2]. With this faulty code base, EventLogging's database
writer crashed several times.
Known good code got deployed again, and the database got back-filled
from plain-file logs.
* Outage on master of EventLogging's database shard (db1020)
m2-master's mysqld process aborted [3] and hence EventLogging had no
database to write to it. Ops quickly failed-over to the slave db1046,
and thereby addressed the issue, and we backfilled the EventLogging
database from plain-file logs.
* Outage on master of EventLogging's database shard (db1046)
Shortly after m2-master got failed over to db1046 (see above), db1046
had issues around its threadpool [4]. EventLogging could not connect to
the database, and consequently could not write events to it. Ops
quickly fixed the issue, and we backfilled the EventLogging database
from plain-file logs.
* Debugging Mobile UI dashboard
The mobile UI dashboard was having issues, and since it is based on
EventLogging data, people assumed that the dashboard issues are caused
by EventLogging's issues. We helped to debug the dashboard, and
point people to the real issue. EventLogging was not the culprit.
Regardless, the relevant graph [5] is working again.
* Upgrades of first machines from the cluster to trusty
After the first efforts to upgrade the Analytics cluster to trusty
during the previous week, the analytics1003 Cisco box no longer ran
reliably over the weekend [6]. There were kernel panics, it is not yet
fully clear what is going on there. The kernel panics seem to occur
even if the machine is not running services.
analytics1033’s management interface is not working properly. It will
be upgraded once this is fixed.
* Discussions with researchers on how they could take advantage of the cluster
With the increasing amount of data available, researchers are running
into issues of how to query the data without grabbing too much
resources. So discussions were started on how researchers can
instrument the cluster, and for example how to use kafkatee instead of
udp2log.
* Allow multiple varnishkafkas on caches
Up to now, only a single varnishkafka has been running on the
caches. But in order to feed performance data into kafka, a second
varnishkafka on the caches would help. Together with Ori, work was
done to allow running more varnishkafkas on the caches.
Look out for statsv :-)
[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141114-EventLo…
(The date in the incident report is from the previous
week. Nonetheless, it's correct that we only started to work on it
this week, as we only noticed while hunting down
https://lists.wikimedia.org/pipermail/analytics/2014-November/002798.html
. It is known that EventLogging monitoring has some holes. Closing
some of them is on the agenda since some time, and we also added it to
the actionables on the Incident reports)
[2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141118-EventLo…
[3] Sadly no public incident report about the database incident. Only
on the non-public ops list:
https://lists.wikimedia.org/mailman/private/ops/2014-November/043964.html
[4] Sadly no public incident report about the database incident. Only
on the non-public ops list:
https://lists.wikimedia.org/mailman/private/ops/2014-November/044167.html
[5] http://mobile-reportcard.wmflabs.org/graphs/ui-daily
[6] https://phabricator.wikimedia.org/T1200
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hi,
apologies for the long pause since the last update.
In the week from 2014-11-10–2014-11-16 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* Talk about Kafka at Apache Kafka NYC user group
* High-availability test for EventLogging's database failed
* Upgrades of first machines from the cluster to trusty
(details below)
Have fun,
Christian
* Talk about Kafka at Apache Kafka NYC user group
Andrew gave a talk [1] about WMF's Kafka setup and challenges around
it at the Apache Kafka NYC user group. That ended up getting in great
feedback not only on the talk itself, but also on instrumenting Python
within the Hadoop ecosystem.
So it helped in more than one way :-)
* High-availability test for EventLogging's database failed
Ops are in process of moving the database that EventLogging writes to
behind a high-availability proxy. A test for that failed [2]
(a firewall has been getting in the way) and EventLogging could not
write events to the database for ~20 minutes.
Ops fixed the firewalling, and we backfilled the database from the
plain-file logs.
* Upgrades of first machines from the cluster to trusty
The first few machines got upgraded to trusty [3].
At first things were looking good. Only a minor issue with grub. But
that could be worked around.
During that week, things looked mostly smooth for the Trusty upgrade.
[1] Google Glass :-) recorded video of the first 45-minutes of the
talk is at:
https://drive.google.com/folderview?id=0B0B2VcpkcY6wVFR3TFhIVEl5dW8&usp=sha…
(downloadable for everyone who signed in to Google :-/ If you know how
to reformat that URL into a plan curl-able URL, please let me know)
(I first thought that the video is missing audio, but audio is
there. It's just very silent.)
[2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141113-EventLo…
[3] https://phabricator.wikimedia.org/T1200
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hey folks,
I just finished a blog post about how I'm incorporating hadoop streaming
into my workflow.
http://socio-technologist.blogspot.com/2014/11/fitting-hadoop-streaming-int…
TL;DR: I have strong opinions about Good Ways(TM) to process large
datafiles in interesting ways and hadoop streaming will support them
nicely. :)
Props to ottomata for spending a bunch of time helping me get up to speed
with our cluster and to gage for making it easier to find hadoop's error
messages.
-Aaron
See message below about a network outage currently affecting multiple
servers in eqiad. The set of affected servers includes gadolinium
and hafnium, so udp2log-based web request logging and EventLogging-based
metric reporters (e.g., Navigation Timing stats) are affected.
---------- Forwarded message ----------
From: Brandon Black <bblack(a)wikimedia.org>
Date: Sat, Nov 29, 2014 at 9:54 PM
Subject: [Ops] Network outage for rack C4 in eqiad
To: Operations Engineers <ops(a)lists.wikimedia.org>
We've lost network access (but not console access) to all the machines in
eqiad rack C4 as of ~ 03:50 UTC (about 2 hours back from this email). This
is mostly machines in a supporting role; no direct traffic front ends or
app servers, etc. Phabricator is down as a result, as are a few monitoring
-related bits and pieces.
In the logs of asw-c-eqiad, it looks like the virtual chassis member for C4
was "removed" (log paste below). I haven't found any useful remote way to
try to make that virtual chassis member restart yet. I'm not sure if it's
worth waking anyone up in the middle of the night or anything at this
point. Most likely this is going to involve some physical presence (or
remote hands) at eqiad.
-------------------------------
Nov 30 03:50:14 asw-c-eqiad /kernel: peer_inputs:3690 VKS0 closing
connection peer type 24 indx 4 err 5
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Member 1->1, Mode
M->M, 1M 8B, GID 0, Master Unchanged, Members Changed
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: 1M 2L 3L 5L 6L 7L 8B
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Signaling license
service
Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_SNMP_TRAP7: SNMP trap
generated: FRU removal (jnxFruContentsIndex 7, jnxFruL1Index 5,
jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC: EX4200-48T, 8 POE @
4/*/*, jnxFruType 3, jnxFruSlot 4)
Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_FRU_OFFLINE_NOTICE:
Taking FPC 4 offline: Removal
Nov 30 03:50:15 asw-c-eqiad chassisd[1512]:
CHASSISD_IPC_CONNECTION_DROPPED: Dropped IPC connection for FPC 4
Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_IFDEV_DETACH_FPC:
ifdev_detach_fpc(4)
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_member_change_delete:
member id 4 (my member id 1, my role 1)
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_delete_ifl: IFL
resources for bme0.32773 (ifl_index 10) deleted
Nov 30 03:50:15 asw-c-eqiad init: can not access /usr/sbin/smihelperd: No
such file or directory
Nov 30 03:50:15 asw-c-eqiad init: subscriber-management-helper (PID 0)
started
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 3, interface vcp-0.32768
came up
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 7, interface vcp-1.32768
came up
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 8, interface vcp-1.32768
came up
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 6, interface vcp-1.32768
went down
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 2, interface vcp-1.32768
came up
Nov 30 03:50:17 asw-c-eqiad fpc5
MRVL-L2:mrvl_brg_port_stg_entry_unset(),410:l2ifl not found! ifl 350
Nov 30 03:50:17 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE)
failed, err 5 (Invalid)
Nov 30 03:50:17 asw-c-eqiad last message repeated 5 times
Nov 30 03:50:18 asw-c-eqiad fpc5
MRVL-L2:mrvl_brg_port_stg_delete(),652:Port-STG-UnSet failed(Invalid
Params:-2)
Nov 30 03:50:18 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE)
failed, err 5 (Invalid)
Nov 30 03:50:18 asw-c-eqiad last message repeated 5 times
Nov 30 03:50:19 asw-c-eqiad fpc5 RT-HAL,rt_entry_delete_msg_proc,3539:
l2_halp_vectors->delete failed proto MSTI, len 48 prefix 00350:00254
------------------------------
_______________________________________________
Ops mailing list
Ops(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ops