Hi,
in the week from 2014-09-15–2014-09-21 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops:
* Using kafkatee to generate TSVs * Bringing Webstatscollector to Hive * TSV generation through Hive * Logstash demo * Reorganizing Wikimetrics mounts * Stream to Universities * Analytics1021 issues not an artifact of kafka consumers * X-Analytics php tag missing/wrong for some requests (Bug 70463)
(details below)
Have fun, Christian
* Using kafkatee to generate TSVs
For meeting the overall plan of ceasing to rely on udp2log for Analytics tasks, we wanted to use kafkatee as drop in replacement for udp2log. While initial tests were positive, kafkatee did not run smoothly when trying to use in production, as it for example dropped some partitions, and didn't update offset files. Both of which being blockers for its use.
We're in contact with the kafkatee developer, and producing the necessary logs for him to be able to debug it. But the issues have not yet been resolved.
* Bringing Webstatscollector to Hive
We produced a first running Hive/Oozie implementation of webstatscollector. Code still need polishing, but it's working. Once in production, this code will be the first real-world use of the cluster.
* TSV generation through Hive
Since kafkatee showed some severe issues for us (see above), we discussed a plan B to move off of udp2log. After the initial checks, it seems generating the TSVs through Hive could work out. It would come with some nice benefits (like being able to re-run files, or better controlling when which data flows into it), but also some real downsides (like adding filters requiring implementation instead of configuration, and no longer being able to use the existing tooling around udp2log (think udp-filters to geolocate))
So we're still targeting to use kafkatee. But if it does not work out, there are no immediate blockers for a Hive-based move away from udp2log.
* Logstash demo
In order to raise visibility around Logstash and it's usefulness around Hive and Hadoop, there was a demo session that showed the basic workflows.
* Reorganizing Wikimetrics mounts
Wikimetrics ran out database disk space on the labs instances, so more space got allocated and contents of the instances has been reshuffled a bit to take better use of available disk space.
* Stream to Universities
Since some years some aspects of the udp2log multicast got streamed to Universities for research purposes. Those streams caused pain on many levels, and this week, the last one of those legacy streams could get turned off.
* Analytics1021 issues not an artifact of kafka consumers
Around analytics1021, progress has been slow, as the issue on analytics1021 only occur sporadically. But kafka consumers got ruled out as culprit for dropping messages, since the missing lines have been identified to be already missing in kafka.
* X-Analytics php tag missing/wrong for some requests (Bug 70463)
The php={zend,hhvm} tagging happened twice for bits. Ops fixed the double tagging, but now some requests don't see a tag at all. While this is expected for some cases, Ops assume that some HHVM requests come with php=zend tags. They are working on it.