State of Analytics cluster right now - Analytics

25 Dec 2012

Hi folks,

David and Andrew gave me an in depth overview of the current state of
the analytics cluster, so I figured it might be good for me to clean
up the notes from that and send them here, both to check my
understanding of these things, and in hopes that this may be useful.

The official docs for this are here:
https://www.mediawiki.org/wiki/Analytics/Kraken

One prominent item on that page is the System Architecture Overview,
which is out of date:
https://upload.wikimedia.org/wikipedia/mediawiki/3/38/Kraken_flow_diagram.p…

In particular, here are the things that are different:
"Pixel service" - no longer will have bundlers.
"ReqLog service" - no longer will have bundlers.  Currently udp2log.
More on this below.
"Bundle queue [Cassandra]" - Currently a combination of udp2log and Kafka.
"Datawarehouse [Cassandra]" - this is Hadoop now

The long term plan is to use Kafka here.  Short term, udp2log is in
use (piped from udp2log into Kafka in a couple cases)

Here's what the current data pipeline looks like for pageviews going
to Hadoop and out to useful information to analysts, using the
terminology from the Kraken flow diagram (even though it's probably
not appropriate anymore in many cases).  These are each separate hosts
or clusters unless otherwise noted:
1. ReqLog service: Squid/Varnish udp logging emitters
2. Bundle queue: udp2log
3. Bundle queue: Kafka producers (on udp2log host)
4. Bundle queue: Kafka broker
5. Canonicalization topology: Kafka consumer
6. Data Warehouse (ETL phase): Log storage on Hadoop cluster
7. MapReduce Jobs:  Pig scripts (running on Hadoop cluster)
8. Data Warehouse (Processing): Processed data, usually as CSVs
(stored on Hadoop cluster)

Routing of Kafka producers to Kafka brokers is done via Zookeeper.

The glorious future looks something like:
1. ReqLog service: Squid/Varnish Kafka producers
2. Bundle queue: Kafka broker
3. Canonicalization topology: Multiple Kafka consumers/Storm Spouts
(on Storm cluster)
4. Canonicalization topology: Storm bolts for
anonymization/geotagging, etc (on Storm cluster)
5. Data Warehouse (ETL phase): Log storage on Hadoop cluster
6. MapReduce Jobs:  Pig scripts (running on Hadoop cluster)
7. Data Warehouse (Processing): Processed data, usually as CSVs
(stored on Hadoop cluster)

More detail on the pipeline:

ReqLog service:
Currently udp2log - 1:100 sampling for most things, plus finer
granularity for some things. We eventually need to decide if we're
going to put Kafka providers directly on the Varnish hosts in
production, or if we're going to continue to emit the logs via udp to
a udp2log collector.

Pixel service:
We didn't get into talking about this part of the architecture.
Rather than restating my jumbled recollection of what's going on here
(randomly throwing around terms like "ClickTracking extension",
"EventLogging extension", "varnishncsa"), I'll ask that someone
else
venture a correct assessment of current/future state of this.

Bundle queue:
Kafka producers produce from udp2log for now (one per stream).  We
currently have 2 Kafka brokers, managed via Zookeeper.  Raw logs are
persisted for one week sliding window, straight to disk.  Not
sanitized, so this host will be locked down for the forseable future.
If we weren't sampling, we would typically get about 16.8 terabytes in
one week's sliding window of uncompressed data. Snappy compression
(not implemented yet) should reduce size by roughly 2/3.

Apache Zookeeper (not in the diagram) - Zookeeper provides
application-level routing of data, keeping track of unavailable nodes,
and signaling to senders to use different hosts when necessary.

We have 3 Zookeeper nodes.  Thread about zookeeper kafka failover
http://lists.wikimedia.org/pipermail/analytics/2012-August/000095.html

Canonicalization Topology:
We currently have a single Kafka consumer - hourly cron job, pulling
from brokers, and pushing the raw logs straight into Hadoop (in
approximately 1 gig chunks).  We'd like to do much more processing at
this stage.  Most importantly, we'd like to anonymize the data at this
stage so that we don't have to be stingy about who we give access to
the Hadoop cluster.

The plan is to use Storm <http://storm-project.net/> as an ETL layer.
There is some custom dev work that needs to happen at this stage,
since log anonymization is not something that we're aware of any
off-the-shelf solutions that fit into this architecture (in
particular, with the ability to do the anonymization in real-time).
We plan to do a formal privacy and security review before relying on
whatever solution that we develop.

Glossary for Storm:
"Worker nodes": run Bolts.
"Bolt": a small unit of processing
"Topology": arrangement of bolts, flow of data.  Bolts operates on
data.  Bolts can be writte in any language.
"Tuple":  Unit of data in storm.
"Spout": source of tuples.
"Nimbus":  job manager/scheduler, doesn't actually do any processing.
If Nimbus goes down, topology still works.

We plan to have 8 hosts running as Storm workers.  This is probably
more than we need; it may be as few as 2-3.  If we're right about
this, it's easy enough to put that hardware to other uses.

Data Warehouse:
We're using Hadoop/HDFS to store data.  The logs currently come in
hourly blocks of 1:100 sampled data, which translates into lots of
1gig files stored hierarchically by date/time.

HDFS is block filesystem with 256 mb blocks.  Blocks are replicated 3
times over, each stored on different physical hardware.  A central
NameNode coordinates what gets stored where.

MapReduce Jobs:
Our Batch Processing System is Hadoop/MapReduce under the hood.  "Hue"
is used to provide web management user interface.  "Oozie" is
technology which manages XML-based workflow definition, and does
things like schedule regular updates to data.  "Pig" and "Hive" are
two different layers for building MapReduce jobs.

Pig is a simple domain-specific language for defining batch processing
jobs.  For those that have access to the analytics cluster, here's the
blog stats pig script:
http://hue.analytics.wikimedia.org/filebrowser/view/user/diederik/blog/blog…

Hive is a different technology, which turns SQL queries into Map/Reduce jobs.

Pig is currently the preferred tool, but we'll support Hive if it catches on.

A few other notes:
* Infrastructure and Hardware
** https://www.mediawiki.org/wiki/Analytics/Kraken/Infrastructure
** Current allocation plans refelct a priori estimates -- we'll be
revising them as we get a feel for the workloads (ex: 8 machines on
ETL is a ridiculous overestimate; we'll probably only need 2-3.)
* Alteratives & Comparisons
** We benchmarked and tested Cassandra as a datastore, but ultimately
went with HDFS
** Request Logging
*** Recommendation:
https://www.mediawiki.org/wiki/Analytics/Kraken/Request_Logging
*** Alternatives:
https://www.mediawiki.org/wiki/Analytics/Kraken/Logging_Solutions_Overview

I hope this is helpful to everyone, and please let me know if there's
anything I missed or got wrong here.

Thanks
Rob