Hi,
in the week from 2014-12-08–2014-12-14 Andrew, and I worked on the
following items around the Analytics Cluster and Analytics related
Ops:
* Stat1001 behind misc-web
* Compression analysis for storing xmldumps in cluster
* EventLogging replication lag
(details below)
Have fun,
Christian
* Stat1001 behind misc-web
stat1001 (which handles stats.wikimedia.org, and
datasets.wikimedia.org) got moved behind misc-web. This makes stat1001
use the WMF standard SSL setup, and removes certificate issues (Like
T74805 [1]).
So URLs like
https://datasets.wikimedia.org/public-datasets/
(note the s in https) should finally work without warnings/errors.
* Compression analysis for making xmldumps available in cluster
More research around making xmldumps available in the custer has been
done. The numbers can be found on
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/xmldumps#Results
* EventLogging replication lag
EventLogging replication got stuck. Only for some tables. This was a
combination of EventLogging being liberal in what characters are
allowed in table names, but the replication being very defensive.
Sean made the blocked replication behave again (thanks!), and
replication caught up. Restrictions on table naming got set up and are
still getting tuned a bit.
[1] https://phabricator.wikimedia.org/T74805
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hi,
in the week from 2014-12-01–2014-12-07 Andrew, and I worked on the
following items around the Analytics Cluster and Analytics related
Ops:
* Change in SSL setup causing pagecounts-raw to be off ... temporary
* Preparing for vlan move of stats machines
* Ganglia -> Graphite -> Grafana
* Wikipedia Zero graph comparability
(details below)
Have fun,
Christian
* Change in SSL setup causing pagecounts-raw to be off ... temporary
Ops changed the SSL setup from dedicated SSL terminators to
cache-local SSL terminators for eqiad and esams. This change came a
bit as a surprise to us, and (as expected) made webstatscollector's C
implementation (pagecounts-raw) overcount HTTPS traffic.
We adjusted webstatscollector's C implementation accordingly.
While some weeks back that would be the end of the story and we'd just
be left with a few days of broken data, we now have the data in the
cluster, and have a Hive implementation too. So we could effectively
backfill pagecounts-raw for the affected days.
Up to my knowledge, this is the first time we could cover/mitigate a
webstatscollector on the udp2log pipeline issue through the cluster.
And pagecounts-raw has good data again for the affected period :-)
* Preparing for vlan move of stats machines
To develop infrastructure and research pipelines, devs and researchers
would need some more basic development tools (E.g.: Maven, Virtualenv)
on stat100[123] that Ops would prefer us not to use in the machines'
current vlan. Hence, preparations started to move stat100[123] into the
separate analytics vlan. This will address the concerns of Ops, while it
still allows to install the needed tools.
* Ganglia -> Graphite -> Grafana
Ops is more and more moving from ganglia to graphite to do checks on
numbers. So work has been started to look into graphite a bit more and
on how to instrument it to perform checks. The cluster got
re-configured to get the key metrics get fed into graphite. For
dashboarding, it seems grafana might give a kibana-like interface. And
http://grafana.wikimedia.org/#/dashboard/db/kafka
got setup to provide a high-level, realtime view on kafka.
* Wikipedia Zero graph comparability
Following up from the previous week, the Wikipedia Zero had further
concerns about the differences between their new on-wiki graphs and
the Analytics team's dashboards. We identified and explained the
differences for them.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
I am kicking off this thread after a good conversation with Nuria and Kaldari on pain points and opportunities we have around data QA for EventLogging.
Kaldari, Leila and I have gone through several rounds of data QA before and after the deployment of new features on Mobile and we haven’t found yet a good solution to catch data quality issues early enough in the deployment cycle. Data quality issues with EventLogging typically fall under one of these 5 scenarios:
1) events are logged and schema-compliant but don’t capture data correctly (for example: a wrong value is logged; event counts that should match don’t)
2) events are logged but are not schema-compliant (e.g.: a required field is missing)
3) events are missing due to issues with the instrumentation (e.g.: a UI element is not instrumented)
4) events are missing due to client issues (a specific UI element is not correctly rendered on a given browser/platform and as a result the event is not fired)
5) events are missing due to EventLogging outages
In the early days, Ori and I floated the idea of unit tests for instrumentation to capture constraint violations that are not easily detected via manual testing or the existing client-side validation, but this never happened. When it comes to feature deployments, beta labs is a great starting point for running manual data QA in an environment that is as close as possible to prod. However, there are types of data quality issues that we only discover when collecting data at scale and in the wild (on browsers/platforms that we don’t necessarily test for internally).
Having a full-fledged set of unit tests for data would be terrific, but in the short term I’d like to find a better way to at least identify events that fail validation as early as possible.
- the SQL log database has real-time data but only for event that pass client-side validation
- the JSON logfiles on stat1003 include invalid events, but the data is only rsync’ed from vanadium once a day
is there a way to inspect invalid events in near real time without having access to vanadium? For example, could we create either a dedicated database to write invalid events only or a logfile for validation errors rsync’ed to stat1003 more frequently than once a day?
Thoughts?
Dario
Hey all,
As people probably(?) know, the WMF has replaced Bugzilla with Phabricator (
https://phabricator.wikimedia.org/) . This is also taking over from a host
of other services, including RT. Analytics Engineering has already switched
over, as have a lot of teams, but R&D has not - instead, we use Trello (
https://trello.com/b/k5N0ivoM/research-and-data). I think that if we're
going to switch over, we should probably do it reasonably soon (the longer
we wait, the more things we have to port). This thread is to have the
switch-or-not conversation in. I'll start ;p.
I'd like to strongly advocate that we switch to phabricator, for several
reasons. Even were Phabricator less-good than Trello, there's an inherent
advantage in consolidating systems. It means fewer logins to maintain, and
a less-distributed workset. By extension, it means a reduced barrier for
interacting with other teams, or volunteers, and vice versa.
But actually, Phab isn't worse than Trello: it's better. For one thing,
it's better at letting us work with other teams.
We're dependent on Analytics Engineering (on Phabricator), and work with
the VE team (on Phabricator), Fundraising (on Mingle), Mobile (also on
Trello)....the list goes on and on. The trello model, in which everything
is split out into different boards you may or may not have access to,
combined with the distribution of teams across platforms, makes it a
constant pain to bring people into conversations and work on problems that
are both our problem + AnEng's problem, or our problem + customer's
problem. People need to cross the streams or juggle multiple logins.
With Phabricator, it's a lot easier to see what everyone is doing, keep
abreast of the general gestalt in movement/WMF work, and chip in on tasks
that don't officially belong to your team. And because a lot of teams use
it, the responsiveness from customers when we ask questions is a lot better.
Phab also seems to, at least for me, naturally fit my work process better.
I think of a research project ("find out how long mobile sessions are") as
actually being a series of individual tasks - "find out what a session is",
"work out how to measure it", "measure it". Trello doesn't really have
support for that kind of heirarchical, dependent, chunked work. It has
checklists but they don't allow for any actual data segmentation or detail.
Alternately you can write multiple cards and link them together, but this
is entirely ad-hoc; there's no structure to it, it's not obvious without
reading each card what the relationship is, and you have to do the heavy
lifting yourself.
Phabricator is designed for precisely this model, because that's how
engineering work tends to break down. It's built-in, fully supported, and
extracting the tree is easy.
So those are the reasons I have, off the top of my head. Other reasons?
Counter-arguments? Post em here.
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
I have updated our team's workshop entry at the WMDS. The Analytics
Engineering team wants lead an EventLogging workshop. If you are
interested in attending, please add your name to the list in this section:
https://www.mediawiki.org/wiki/MediaWiki_Developer_Summit_2015#Setting_up_E…
The more people add their name to the list, the more likely this will
happen!
So, we've had conversations about detecting SSL terminators, for two
reasons:
1. It would allow us to know when, particularly, we should trust
x_forwarded_for fields for geolocation;
2. More importantly, it would allow us to reliably exclude traffic from
internal IP ranges without excluding SSL traffic.
Aaron talked to Ops about this problem (notes at
http://etherpad.wikimedia.org/p/ssl_terminators) - in conversation with
Ori, though, I found out that this approach won't actually work, because
caches != SSL terminators, all the time.
So: what's the right approach? How do we find these things easily and
automagically.
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
Hi,
On http://stats.wikimedia.org/wikimedia/squids/SquidReportClients.htm there
are statistics for the numbers of people who use different browsers.
Are there statistics that show separate numbers for people who read
articles and people who edit them?
--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
“We're living in pieces,
I want to live in peace.” – T. Moore
Hey folks,
I was talking to ottomata today about developing a schema for processing
revisions in Hadoop. We came across a deep problem with field names that
I'd like to discuss because I want people to be aware of the problem.
To explain this, I'll use an example. Let's say you want to get the
namespace of this page:
https://en.wikipedia.org/wiki/Biology
In javascript, this is represented as the variable *wgNamespaceNumber*.
In the database, this is represented as *page.page_namespace*
In the XML database dump, this is represented as the value at *<page><ns> *or
*<namespaces><namespace.key> *depending where you are.
Right now, ottomata and I are considering the more descriptive name
*page_namespace_id* since the value of all of these valiables/fields is an
identifier -- not a name. I think that this is a *good* name if we
consider it in a vacuum, but if we choose it, we'll add yet another name
for wiki devs & analysts to be aware of.
Given the context of this decision, my instinct is to choose the least
surprising name. Since I mostly work with the database, that would mean
I'd choose *page_namespace*.
This is just one example of such nonsense. The decisions we make in
formats that we produce now can have immeasurable effects on the sanity of
others. I hope that the decisions we make today will minimize such pain,
but it's hard to know for sure.
-Aaron
Hi everyone,
From some initial tests it appears to me that EventLogging is not
logging events from Linux/Firefox when Adblock is enabled. I'm on Ubuntu
14.04, Firefox 34.0, and Adblock Plus 2.6.6. When I disable Adblock, I see
event.gif?{...} in Console, when I enable it, I don't. Just to make sure,
I've checked the EL tables and my events don't get registered there.
I'd be happy to sit with someone to troubleshoot before 4pm (PST), after
5:30pm (PST) or tomorrow.
Leila