Hi,
in the week from 2014-09-22–2014-09-28 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops:
* Accessing HDFS through plain file system * Bringing Webstatscollector to Hive * Presentation of Sqoop * Using kafkatee to generate TSVs * Packetloss alerts (Bug 71116)
(details below)
Have fun, Christian
* Accessing HDFS through plain file system
As by-product of preparing to get cluster generated datasets to the webservers, hdfs got (read-only) mounted on stat1002 into the plain file system at /mnt/hdfs.
So you can now for example access the HDFS data files directly from
/mnt/hdfs/wmf/data
on stat1002.
Also, you no longer need to setup ssh tunnels and some such to get to your logs. You can now just look at them from
/mnt/hdfs/var/log/hadoop-yarn/apps/
as plain files, and grep, tail, ... them.
* Bringing Webstatscollector to Hive
The webstatscollector reimplementation in Hive got merged and is producing data since 2014-09-23. This implementation is ** no longer subject to the contiuous packet loss on udp2log [1], ** can rerun jobs if needed, ** contains pagecounts for all sites.
While researchers could already use the data on stat1002, legal sign-off for publishing it to the public was still missing. (We got it in the meantime, so publishing is imminent. But that will be reported in the next weekly email)
* Presentation of Sqoop
More research around how to get MediaWiki databases into Hadoop was done, and Sqoop is the tool of choice at this point in time. The current possibilities of Sqoop and how one can use it to import data into Hadoop has been demoed and discussed with researchers.
* Using kafkatee to generate TSVs
Discussions around kafkatee are still going on. But there is no solution yet.
* TSV generation through Hive
Since kafkatee issues are not yet resolved, we followed up on previous week's initial screening by doing a more thorough check on the feasibility of generating the TSVs through Hive instead of kafkatee.
We cannot only cover the immediately needed TSVs for our researches, but also cover the glam and fundraising tsvs. So we could do fully without kafkatee.
Looking more closely at the implementation blockers, it seems there is only a geocoding blocker. But we can overcome it with a little Java coding.
We gave it a shot with the sampled-1000, mobile-sampled-100, and zero tsvs, vetted the Hive-produced data and it worked smoothly.
So the way forward is solid and paved, in case kafkatee issues cannot get resolved soonish.
* Packetloss alerts (Bug 71116)
On 2014-09-20 there were alerts around packet loss on udp2log. But they turned out to be an artifact of on ULSFO outage [2].
[1] This is the first dataset that's coming out of the cluster, and the cluster still has minor hiccups from time to time. But data quality and reliability is already several orders of magnitude better than udp2log has ever been. So it's already the prefered source for datasets.
[2] https://lists.wikimedia.org/mailman/private/ops/2014-September/040429.html