Analytics September 2012

analytics@lists.wikimedia.org

20 participants
18 discussions

Datastax or Cloudera for Analytics Batch Processing
by Andrew Otto 03 Oct '12

03 Oct '12

As discussed in the TechOps meeting yesterday, the Analytics team is evaluating two different Hadoop distributions for the batch processing layer of the Kraken cluster: Cloudera Hadoop 4 (CDH4) and Datastax Enterprise (DSE). I'm going to try to describe from a very high level why we think DSE is purrrrrty cool, and why CDH4 sounds like a relative headache. I'll then ask the questions we hope to answer soon. (Quick disclaimer: This is a totally biased and incomplete summary of DSE vs CDH4. I am highlighting weaknesses of CDH4 in order to illustrate why we are considering DSE. If we are allowed to use DSE, then we will have more work to do weighing pros and cons of both solutions.) CDH4 is just a well packaged distribution of Hadoop and some of the most useful Hadoop tools. The core of Hadoop is MapReduce and the Hadoop Distributed Filesystem (HDFS). HDFS allows huge files to be written across any number of machines. MapReduce jobs can then be written to do parallel processing of these files. Hadoop has a single NameNode that is responsible for managing file metadata in HDFS. This is a notorious single point of failure. CDH4 attempts to solve this problem with the introduction of a Standby NameNode[1]. This requires that both the Active and Standby NameNodes share a filesystem (via NFS or something similar). The Standby nodes uses this to synchronize edits to its own file metadata namespace. If the Active NameNode goes down, the Standby can take over and *should* have the exact same metadata stored as the Active. Administrators have to take special care to ensure that only one of the two NameNodes is active at once, or metadata corruption could result. Read about Fencing[2] for more info. Not only is the NameNode a SPOF, but it doesn't scale well when there are many files or many metadata operations. More recent Hadoop releases solve this providing 'HDFS Federation'[3]. Basically, this is just multiple NameNodes sharing the independent spaces of the same data nodes. This is all more configuration to get Hadoop to scale, something it is supposed to be good at out of the box. All of the above also comes with lots of configuration to maintain and tweak to make work. DSE is supposed to solve these problems and be much easier to work with. DSE is Hadoop without HDFS. Instead, Datastax has written an HDFS emulation layer on top of Cassandra (CFS). The huge benefit here is that from the clients viewpoint, everything works just like Hadoop[4]. MapReduce jobs and all of the fancy Hadoop tools still work. But there's no NameNode to deal with. In Cassandra all nodes are peers. This allows for a much more homogenous cluster and less configuration[5]. Cassandra automatically rebalances its data when a node goes down. Ah but DSE has WMF related problems of its own! Most importantly, DSE is not 100% open source*. The core of DSE is open source components available under the Apache license (Hadoop, Cassandra, etc.). As far as I can tell, anything that is not an Apache project is proprietary. Datastax has a comparison[7] of its Community vs. Enterprise versions. This includes packaging and wrapper CLI tools, but most importantly the Cassandra File System (CFS) emulation layer. This piece is what makes DSE so attractive. Also note that DSE is not free. It is for development purposes, but it has a license that we'd have to buy if we want to run it in production. However, Diederik knows one of the founders of Datastax who might give it to us at a discount, or perhaps for free. Okokok. So I'm starting this thread in order to answer my big question: Are we allowed to use DSE even if it is not 100% open source? If the answer is an easy 'no', then our choice is easy: we will use CDH4. Is it possible to get an answer to this question before we go down the road of dedicating time to evaluating and learning DSE? tl;dr: Datastax is cooler than Cloudera, but Datastax is not 100% open source. Can we still use it? - otto + the Analytics Team *much like Cloudera for Hadoop, DataStax publishes tested, stable distributions of Cassandra and ecosystem tools bundled together into one package. In addition to selling support, their Enterprise edition contains proprietary code that, among other things, provides HDFS emulation on top Cassandra. This code originated as Brisk[7], and was forked when the developers founded DataStax. [1] https://ccp.cloudera.com/display/CDH4DOC/Introduction+to+Hadoop+High+Availa… [2] https://ccp.cloudera.com/display/FREE4DOC/Configuring+HDFS+High+Availability [3] http://hadoop.apache.org/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/Federati… [4] http://www.datastax.com/faq#dse-2 [5] http://blog.milford.io/2011/04/why-i-am-very-excited-about-datastaxs-brisk/ [6] http://wiki.apache.org/cassandra/HadoopSupport [7] https://github.com/riptano/brisk , https://github.com/riptano/brisk-hadoop-common

7 10

Nexus Artifact Repository for Maven
by David Schoonover 01 Oct '12

01 Oct '12

Okay! I've gotten Sonatype's Nexus Artifact Repository for Maven set up and working on Kripke: http://nexus.wmflabs.org/nexus/index.html#welcome If you want to fiddle, PM me for the admin password. Next week I'll write a tutorial and some notes about getting your dev env set up to work with Java (or other JVM languages). I'm planning on doing that for Eclipse, as that's the most popular OSS Java IDE. Cheers! -- David Schoonover dsc(a)wikimedia.org

3 2

Re: [Analytics] [GLAM] GLAM & Analytics dashboard
by Sumana Harihareswara 30 Sep '12

30 Sep '12

(cross-posting to analytics list) On 09/29/2012 12:13 PM, Lori Phillips wrote: [snip] > *Action step*: I think we should create a single page, be it on meta, > outreach, or mediawiki, where we compile all of the useful links, needs, > and suggestions for the Analytics Team. They can then take from that what > they can. But centralizing the information is key. This could include the > above Google Doc I mention, links to our current guides and partnerships > lists on Commons, info on the Europeana Toolset Project, and additional > suggestions we may have. I'd be appreciative to hear if someone, who is > more technically inclined than myself (and that doesn't take much) is > willing to take this on. It looks like people started at https://www.mediawiki.org/wiki/Analytics/Pageviews/GLAM , right? "The purpose of this page is to gather all the different documents, discussions and ideas related to GLAM Analytics and to define a list with the most important metrics. Please feel free to add any missing data, this is document is almost certain to be incomplete." > Hope that helps, and thanks for your patience as I attempt to faciliate > this topic which is far out of my comfort zone. > Lori Thanks for working on this, Lori! -- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

2 1

Fwd: [Wikitech-l] MediaWiki community metrics
by Sumana Harihareswara 28 Sep '12

28 Sep '12

Cross-posting. -------- Original Message -------- Subject: [Wikitech-l] MediaWiki community metrics Date: Fri, 28 Sep 2012 13:10:12 -0700 From: Quim Gil <quimgil(a)gmail.com> Reply-To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Hi there, Asking for a task to volunteer, Sumana encouraged me to look at the topic of community metrics. She pointed to https://wikitech.wikimedia.org/view/Pentaho as a starting point. After a first look at Pentaho and what some colleagues at the MeeGo project did with it [1], I searched (a bit) for any wiki pages or discussions about community metrics here, but couldn't find any. http://www.mediawiki.org/wiki/User:Qgil/MediaWiki_Community_Metrics Edits welcome! I'm looking for feedback, help, and a first prototype of an automatically refreshed report hopefully sooner than later. Something simple to build upon. Even if it's too tempting to define the first prototype thinking first on tools or data available, you are encouraged to start by proposing what questions do you want actually answered. What community trends do you want to know? See http://www.mediawiki.org/wiki/User:Qgil/MediaWiki_Community_Metrics#Trends_… - in few days we should have agreed on the first and most important trends we want to visualize. [1] http://wiki.meego.com/Metrics/Dashboard -- Quim _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

1 0

Fw: [E3-team] Intel: analytics platforms
by Ori Livneh 25 Sep '12

25 Sep '12

FYI. If you'd like an invite, shoot my an e-mail! Forwarded message: > From: Ori Livneh <ori(a)wikimedia.org> > Reply To: E3 team discussion list <e3-team(a)lists.wikimedia.org> > To: Internal E3 team discussion list <e3-team(a)lists.wikimedia.org> > Date: Tuesday, September 25, 2012 2:25:29 PM > Subject: [E3-team] Intel: analytics platforms > > Hey guys, > > I thought we should take a look at how the major analytics platforms do what they do: how they implement experiments, how they collect data, and how they present it for analysis. We can't use them on Wikipedia, so I created a simple web "app" that guides the user along a mock conversion funnel: > > http://cube.256.io/bi/ > > That page is logging events to Google Analytics, MixPanel, KISSMetrics and Optimizely. > > You should have received invitations for Google Analytics already. I'll follow up with the login details for the other services. > > Some of these services are smart enough to filter you out of the data stream if you log in to their service, so if you can, go through a part (or all) of the funnel on random computers, so we have some data. > > If this provokes ideas, or if you discover interesting things, please report back. > > > Ori > > -- > Ori Livneh > ori(a)wikimedia.org (mailto:ori@wikimedia.org) > > > > _______________________________________________ > E3-team mailing list > E3-team(a)lists.wikimedia.org (mailto:E3-team@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/e3-team >

1 0

Good article on Spanner
by Ori Livneh 25 Sep '12

25 Sep '12

"Spanner" is Google's new distributed data system. https://cloudant.com/blog/cloudant-labs-on-google-spanner/ -- Ori Livneh ori(a)wikimedia.org

1 0

Productionize WebSockets edit feed
by Ori Livneh 23 Sep '12

23 Sep '12

Hello analytics, I've set up an article edit / insert feed at http://kubo.wmflabs.org/editstream.html. It's receiving updates using a node.js WebSockets server running on stat1. I'd like to productionize it and wanted to solicit your input on how to do it right. I think it'd be useful to provide this stream as a service to the community. Each edit event is ~300 bytes of gzipped-compressed JSON data. With ~140,000 edits a day, the bandwidth per client is 0.5kbps. No filtering or buffering happens on the server, so I think it'll scale quite well. Should I simply submit a puppet patch to configure this service to run on stat1001? It'd be good to map the service onto a URL on bits, so that it's easily accessible from JS code running on Wikipedia. Thoughts? Let me know! Thanks, Ori -- Ori Livneh ori(a)wikimedia.org

6 11

Analytics Roadmap & Project Followups
by David Schoonover 21 Sep '12

21 Sep '12

Hello all, I've finished the followups from the planning meeting we had yesterday. - Our Sept and Oct roadmap notes are now integrated into the Platform roadmap page: https://www.mediawiki.org/wiki/Roadmap#Analytics - Analytics now has a Roadmap page of its own: https://www.mediawiki.org/wiki/Analytics/Roadmap - I created Project Pages (with status boxen) for all the top-level projects we identified that lacked them. (If you're CC'd directly on this mail, you're a project owner and need to review your project's page, update the status, etc. You know, be a good engineer :)). The full list: - Kraken: https://www.mediawiki.org/wiki/Analytics/Kraken - Limn (Dan): https://www.mediawiki.org/wiki/Analytics/Limn - Reportcard (Dan): https://www.mediawiki.org/wiki/Analytics/Reportcard - Infrastructure (Otto): https://www.mediawiki.org/wiki/Analytics/Infrastructure - Data Releases (Diederik): https://www.mediawiki.org/wiki/Analytics/Data_Releases - Legacy Logging (???): https://www.mediawiki.org/wiki/Analytics/Legacy_Logging -- I don't know who owns this, exactly, so please claim! - Wikistats: https://www.mediawiki.org/wiki/Analytics/Wikistats -- These pages are rather out of date (2011??), and this should be fixed. Who owns this? Erik? Stefan? Diederik? I also noticed/fixed a few things: - The Wikistats pages were off in random places, so I moved them under Analytics. - I added a team roster to https://www.mediawiki.org/wiki/Analytics - I added Dan in various places to project pages. - I updated our entry and projects on the Platform Engineering homepage: https://www.mediawiki.org/wiki/Wikimedia_Platform_Engineering#Analytics Cheers! -- David Schoonover dsc(a)wikimedia.org

3 2

Analytics Planning Meeting Notes
by David Schoonover 21 Sep '12

21 Sep '12

In my crusade to shame everyone into taking notes at their meetings, here are the notes from the Analytics Planning Meeting: https://www.mediawiki.org/wiki/Analytics/Roadmap/PlanningMeetings/2012_Sept… -- David Schoonover dsc(a)wikimedia.org (mailto:dsc@wikimedia.org)

1 0

irc logging
by evan rosen 17 Sep '12

17 Sep '12

fyi, I know very little about IRC in general. Is this already happening? Is there a reason we don't want to? Should it be public? I was hoping to install irssi <http://irssi.org/> (andrew?) on stat1 and have it make logs that we could all check when we've joined the channel too late to get the hangout link, or need to look for link that someone posted a week ago. I'm imagining just grep it over an ssh connection for now, though if we wanted to share it publicly we could create a labs instance and run one of those web interfaces that is built around irssi logs (irclog<http://irclog.readthedocs.org/en/latest/index.html>, webssi <http://wouter.coekaerts.be/webssi/>--suggestions welcome as none of these look super canonical) thanks, evan

5 4

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics September 2012