Hi,
in the week from 2014-09-01–2014-09-07 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops:
* Investigating ways to allow queries across MediaWiki and Hadoop databases * Deployment of webstatscollector's ulsfo https fix * Re-run reports due to slave lag * X-Analytics tag for used PHP engine * Digging deeper into analytics1021 issues
(details below)
Have fun, Christian
* Investigating ways to allow queries across MediaWiki and Hadoop databases
Currently data from Hadoop is fully separated from the our wiki's databases, which it hard to query across the two different kinds of databases, and hence makes researcher's life harder. Of the available solutions to overcome this issue, Scoop seems like a suitable approach. Scoop allows to import data from MediaWiki databases into HDFS, and query them from within Hadoop. It was looked at how Scoop imports work, and discussions were started with researchers on which imports would be useful and which would not.
* Deployment of webstatscollector's ulsfo https fix
The fix that stops webstatscollector to count ulsfo https requests twice got deployed.
* Re-run reports due to slave lag
The annonced schema changes caused more slave lag than some reports could cope with, so we had to re-run a few reports by hand to make up for the slave lag.
* X-Analytics tag for used PHP engine
Ops added a “php” tag to the X-Analytics header. This header allows to identify which PHP implementation got used to serve requests.
* Digging deeper into analytics1021 issues
Despite the recent buffer increases, analytics1021 still from time to time fails to act as proper partition leader. Since the failure is not reproducible manually, debugging is tricky ... and time consuming. We added some more monitoring, and waited for the issue to re-appear. It seems that from time to time bursts of disk writes free up lots memory on analytics1021. During these write-out phases, the processes on analytics are getting starved. If starvation takes to long, analytics1021 gets (correctly) kicked out of the partition leader role. We now need to find the source of those write bursts, to see if they are the real issue, or just the symptom of a different issue.