Adventures in Clusterland 2014-09-22--2014-09-28 - Analytics

4 Oct 2014


      Hi,
in the week from 2014-09-22–2014-09-28 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* Accessing HDFS through plain file system
* Bringing Webstatscollector to Hive
* Presentation of Sqoop
* Using kafkatee to generate TSVs
* Packetloss alerts (Bug 71116)
(details below)
Have fun,
Christian
* Accessing HDFS through plain file system
As by-product of preparing to get cluster generated datasets to the
webservers, hdfs got (read-only) mounted on stat1002 into the plain
file system at /mnt/hdfs.
So you can now for example access the HDFS data files directly from
/mnt/hdfs/wmf/data
on stat1002.
Also, you no longer need to setup ssh tunnels and some such to get to
your logs. You can now just look at them from
/mnt/hdfs/var/log/hadoop-yarn/apps/
as plain files, and grep, tail, ... them.
* Bringing Webstatscollector to Hive
The webstatscollector reimplementation in Hive got merged and is
producing data since 2014-09-23. This implementation is
** no longer subject to the contiuous packet loss on udp2log [1],
** can rerun jobs if needed,
** contains pagecounts for all sites.
While researchers could already use the data on stat1002, legal
sign-off for publishing it to the public was still missing.
(We got it in the meantime, so publishing is imminent. But that will
be reported in the next weekly email)
* Presentation of Sqoop
More research around how to get MediaWiki databases into Hadoop was
done, and Sqoop is the tool of choice at this point in time.
The current possibilities of Sqoop and how one can use it to import
data into Hadoop has been demoed and discussed with researchers.
* Using kafkatee to generate TSVs
Discussions around kafkatee are still going on. But there is no
solution yet.
* TSV generation through Hive
Since kafkatee issues are not yet resolved, we followed up on previous
week's initial screening by doing a more thorough check on the
feasibility of generating the TSVs through Hive instead of kafkatee.
We cannot only cover the immediately needed TSVs for our researches,
but also cover the glam and fundraising tsvs. So we could do fully
without kafkatee.
Looking more closely at the implementation blockers, it seems there is
only a geocoding blocker. But we can overcome it with a little Java
coding.
We gave it a shot with the sampled-1000, mobile-sampled-100, and zero
tsvs, vetted the Hive-produced data and it worked smoothly.
So the way forward is solid and paved, in case kafkatee issues cannot
get resolved soonish.
* Packetloss alerts (Bug 71116)
On 2014-09-20 there were alerts around packet loss on udp2log. But
they turned out to be an artifact of on ULSFO outage [2].
[1] This is the first dataset that's coming out of the cluster, and
the cluster still has minor hiccups from time to time. But data
quality and reliability is already several orders of magnitude better
than udp2log has ever been. So it's already the prefered source for
datasets.
[2] https://lists.wikimedia.org/mailman/private/ops/2014-September/040429.html
-- 
---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ----
                           Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3     Email:  christian@quelltextlich.at
4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
                             Fax:            +43 7946 / 20 5 81
                             Homepage: http://quelltextlich.at/
---------------------------------------------------------------