people from gerrit's “Analytics” group  currently hold
* Push (including Force Push)
* Push Merge Commit
* Forge Author Identiy
* Forge Committer Identity
permissions on “analytics/*” projects in gerrit. But those permissions
got and get in the way one way or the other.
Do we need those permissions for our repos?
If no one objects, I'll start removing them on 2014-04-28.
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
As you've probably heard, last week we deployed ulsfo in production,
reducing latency for Oceania, East/Southeast Asia & US/Canada
pacific/west coast states. My estimation of the user base affected by
this is 360 million users (as in, Internet users, not Wikipedia users).
I was wondering if you have an easy way to measure and plot the impact
in page load time, perhaps using Navigation Timing data?
The operations team has spent a considerable amount of time and money to
deploy ulsfo and I believe it'd be useful for us and the organization at
large to be able to quantify this effort.
The exact dates of the rollout by country/region codes can be found in
operations/dns' git history:
(the commits should be self-explanatory, but I'd be happy to clarify if
Thanks for the interesting information!
"List of total drama characters" is shown as having approximately 25k revisions. But "List of total drama characters" is a redirect with a single edit, and the end page of the redirect "Total drama" has only 3,270 revisions according go xtools. What happened here?
I’m very excited to share some updates from ops on analytics-store.eqiad.wmnet  aka “the one box to rule them all”.
This box (which you access with the “research" SQL credentials) gives you:
1) read access to replicas of all production DBs consolidated on a single machine
2) read access to all EventLogging data via the log DB
3) read/write access to a shared staging DB that can be used as scratch space for temporary tables (similar to the staging DB on s1-analytics). If you create tables on staging, please prefix them with your shell user id (e.g. dartar_foo).
This is one of the best news I got from ops since I joined WMF and it will make my work way easier – thanks Sean and anybody else who helped make this happen.
Ops is also working on a solution to consolidate all credentials for analytics databases in a single place, via the creation of a “researcher” user group . I’lll send a not one the list when this is completed
 a CNAME for dbstore1002.eqaid.wmnet.
I'd like to hear from stakeholders about purging old data from the
eventlogging database. Yes, no, why [not], etc.
I understand from Ori that there is a 90 day retention policy, and that
purging has been discussed previously but not addressed for various
reasons. Certainly there are many timestamps older than 90 days still in
the db, and apparently largely untouched by queries?
Perhaps we're in a better position now to do this properly what with data
now in multiple places: log files, database, hadoop...
Can we please purge stuff? :-)
DBA @ WMF
At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming
connections. At some point during the subsequent hour, MariaDB had either
crashed or been manually restarted. Sean noticed that the database was
choking on some queries from the researchers and notified the wmfresearch
During the time that the database server was out or rejecting connection,
the EventLogging writer that writes to db1047 was repeatedly failing to
connect to it:
sqlalchemy.exc.OperationalError: (OperationalError) (2003, "Can't connect
to MySQL server on 'db1047.eqiad.wmnet' (111)")
The Upstart job for EventLogging is configured to re-spawn the writer, up
to a certain threshold of failures. Because the writer repeatedly failed to
connect, it hit the threshold, and was not re-spawned.
This triggered an Icinga alert:
[00:04:24] <icinga-wm> PROBLEM - Check status of defined EventLogging jobs
on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs:
This alert was not responded to. I finally got pinged by Tillman, who
noticed the blog visitor stats report was blank, and by Gilles, who noticed
image loading performance data was missing.
We have to fix this. The level of maintenance that EventLogging gets is not
proportional to its usage across the organization. Analytics, I really need
you to step up your involvement.
It was not long ago that EventLogging was running reliably for months at a
time. What has changed is not system load, but the owner seat becoming
vacant, leading to a gradual deterioration of the quality of monitoring and
Sean proposed moving the EventLogging database to m2, so that it runs on
separate hardware from the research databases. I think he's right. I filed <
https://rt.wikimedia.org/Ticket/Display.html?id=7081> to request the
There is some code rot around the Ganglia and Graphite monitoring code for
EventLogging. I don't think it would take much to fix. Could the Analytics
team take this on?
The Puppet code is well-documented. <
https://wikitech.wikimedia.org/wiki/EventLogging> could use some updating,
but it is mostly current.
Finally, I think EventLogging Icinga alerts should have a higher profile,
and possibly page someone. Issues can usually be debugged using the
eventloggingctl tool on Vanadium and by inspecting the log files on
Before someone emails me about this... :-)
s1-analytics-slave eventlogging replication is starting to lag again
(enwiki replication is ok).
I noticed that new eventlogging tables are using InnoDB instead of TokuDB
on that slave. The issue is being fixed and we should be back up to speed
within the day.