people from gerrit's “Analytics” group  currently hold
* Push (including Force Push)
* Push Merge Commit
* Forge Author Identiy
* Forge Committer Identity
permissions on “analytics/*” projects in gerrit. But those permissions
got and get in the way one way or the other.
Do we need those permissions for our repos?
If no one objects, I'll start removing them on 2014-04-28.
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
As you've probably heard, last week we deployed ulsfo in production,
reducing latency for Oceania, East/Southeast Asia & US/Canada
pacific/west coast states. My estimation of the user base affected by
this is 360 million users (as in, Internet users, not Wikipedia users).
I was wondering if you have an easy way to measure and plot the impact
in page load time, perhaps using Navigation Timing data?
The operations team has spent a considerable amount of time and money to
deploy ulsfo and I believe it'd be useful for us and the organization at
large to be able to quantify this effort.
The exact dates of the rollout by country/region codes can be found in
operations/dns' git history:
(the commits should be self-explanatory, but I'd be happy to clarify if
At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming
connections. At some point during the subsequent hour, MariaDB had either
crashed or been manually restarted. Sean noticed that the database was
choking on some queries from the researchers and notified the wmfresearch
During the time that the database server was out or rejecting connection,
the EventLogging writer that writes to db1047 was repeatedly failing to
connect to it:
sqlalchemy.exc.OperationalError: (OperationalError) (2003, "Can't connect
to MySQL server on 'db1047.eqiad.wmnet' (111)")
The Upstart job for EventLogging is configured to re-spawn the writer, up
to a certain threshold of failures. Because the writer repeatedly failed to
connect, it hit the threshold, and was not re-spawned.
This triggered an Icinga alert:
[00:04:24] <icinga-wm> PROBLEM - Check status of defined EventLogging jobs
on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs:
This alert was not responded to. I finally got pinged by Tillman, who
noticed the blog visitor stats report was blank, and by Gilles, who noticed
image loading performance data was missing.
We have to fix this. The level of maintenance that EventLogging gets is not
proportional to its usage across the organization. Analytics, I really need
you to step up your involvement.
It was not long ago that EventLogging was running reliably for months at a
time. What has changed is not system load, but the owner seat becoming
vacant, leading to a gradual deterioration of the quality of monitoring and
Sean proposed moving the EventLogging database to m2, so that it runs on
separate hardware from the research databases. I think he's right. I filed <
https://rt.wikimedia.org/Ticket/Display.html?id=7081> to request the
There is some code rot around the Ganglia and Graphite monitoring code for
EventLogging. I don't think it would take much to fix. Could the Analytics
team take this on?
The Puppet code is well-documented. <
https://wikitech.wikimedia.org/wiki/EventLogging> could use some updating,
but it is mostly current.
Finally, I think EventLogging Icinga alerts should have a higher profile,
and possibly page someone. Issues can usually be debugged using the
eventloggingctl tool on Vanadium and by inspecting the log files on
The speed bumps from the eventlogging migration are almost ironed out:
1. db1048 has had the eventlogging uuid fields made formally UNIQUE KEY. I
gather Ori will now run some validation against logs to check for remaining
2. db1046 which died mid-migration has been restored and is catching up.
This doesn't really affect Analytics except that it's to be part of
db1047's replication chain for eventlogging.
3. db1047 is finishing up reloading log data and removing the CONNECT
federated tables involved in bug 64445.
As something of a consolation prize, "analytics-store.eqiad.wmnet" is now
open for SELECT queries from the 'research' user. This box:
- Is a CNAME for dbstore1002.eqaid.wmnet.
- Replicates all wikis in one place.
- Can be hammered. Please feel free.
- Can have scratch space for temporary writes (but doesn't yet).
- Can replicate eventlogging too (but doesn't yet).
I would appreciate if anyone has some suitable read-only reports to try
out, please do so and report back.
DBA @ WMF
We are changing EventLogging to write events to m2 instead of db1047. The
migration will take up to 12 hours (but probably less). Also, we may end
up with gaps in the data written to the database throughout this period.
We will reply to this thread once the migration is complete.
Samuel Klein, 30/04/2014 05:35:
> Asking 1/1000 users of tool X a single open-ended question ("please
> give us feedback on X" or "how is X working for you"?) can be a handy
> way to encourage brief input from a cross-section of users,
We do have such a feature in MediaWiki, though: mediawiki.feedback.js.
It's just a JavaScrip popup which saves the comment to a page on the wiki.
> many of
> which would not otherwise comment at all. And for some tools (such as
> UploadWizard) there is no obvious place to leave comments, and opening
> Bugzilla is a new-tab + multi-step process away.
UploadWizard (like VisualEditor) uses what above. Maybe it needs an
option to be offered more prominently under some conditions?
This is probably the most viable option here, almost no technical effort
and more value in output.
Tilman Bayer, 29/04/2014 21:58:
> might be worth revisiting
> LimeSurvey, which appears to have undergone a complete rewrite since
> that installation was removed from WMF servers for security concerns
> around 2011.
+1. It will need to be done anyway, at some point, e.g. if a general
editor survey is tried again.
> support than other solutions [...]
> lack of integrated language support in
> Surveymonkey, or just because the focus was on per-project results
Agreed on all the rest but this point specifically. It seems
surveymonkey is really out of question. However, how many languages does
LimeSurvey says 50; it is translated on a public instance of GlotPress.
GlotPress is from Automattic and is used to make some Wordpress locale,
hence some translatewiki.net have experience with it. However I wasn't
able to gather much information about it, I only know that it's yet
another web tool for .po format; maybe Stu can put us in contact with
someone with more insight (especially on how much it's used and how
prioritary for Automattic)?
Some of the graphs  on the report card are not rendering due to
what seems like some sort of EventLogging outage
When I query the database I get errors such as ""MySQL said: Table
'MobileWebEditing_7675117' is marked as crashed and should be
Let's keep an eye on this...