Howdy Andre,
The ClickTracking extension has a bit of a complicated history -- afaik there are three or four forks of it, and it's sorta maintained by several different teams. I don't personally know that much about it, so I've cc'd the Analytics list and some of the blokes I believe know more about it.
Cheers!
--
David Schoonover
dsc(a)wikimedia.org
On Friday, 12 October 2012 at 10:10 a, Trevor Parscal wrote:
> Nimish doesn't work for WMF anymore, and I don't know where his @wikimedia.org (http://wikimedia.org) email messages end up.
>
> This is a dependency for ClickTracking (and nothing else afaik), and should probably be merged together with it (in both software and bugs).
>
> Generally, it's to do with stats, so David Schoonover (cc'd) is a better person to ask about this.
>
> - Trevor
>
> On Fri, Oct 12, 2012 at 6:37 AM, Andre Klapper <aklapper(a)wikimedia.org (mailto:aklapper@wikimedia.org)> wrote:
> > Hi,
> >
> > contacting you as you are listed as maintainers on
> > https://www.mediawiki.org/wiki/Extension:UserDailyContribs
> >
> > According to
> > https://www.mediawiki.org/wiki/Category:Extensions_used_on_Wikimedia
> > this extension is deployed on Wikimedia, but I cannot find a good place
> > where to report bugs.
> >
> > Would it be useful if I created a dedicated component for this extension
> > in Bugzilla under the "MediaWiki extensions" product, and set you as the
> > default assignee for bug reports filed under it?
> >
> > Currently many reports get filed in the "[other]" component of the
> > "MediaWiki extensions" product in Bugzilla where they are hard to find
> > for maintainers.
> > A dedicated component would make it easier to report and get aware of
> > issues for this specific extension.
> >
> > Thanks,
> > andre
> > --
> > Andre Klapper | Wikimedia Bugwrangler
> > http://blogs.gnome.org/aklapper/
> >
> >
>
As far as I know nobody from WMF is attending this year, but they have a pretty meaty program: http://nyc2012.pydata.org/schedule/
(there is also a session by Didier Deshommes on Wikipedia Indexing And Analysis)
D
You may remember that one of the reasons not to consider a potential partnership/collaboration with MetaMarkets was that part of their analytics stack was proprietary. Today they announced that they are open sourcing Druid, the distributed data store that powers their dashboards:
http://metamarkets.com/category/technology/druid/
Dario
Hi, these days I'm polishing the presence of the Wikimedia Foundation at
http://ohloh.net in order to help promoting all the cool open suce
projects we are developing.
The Analytics team has a nice section in Gerrit including many projects
(see below). How would you prefer to have this reflected in Ohloh?
a) All repos under a generic Wikimedia Analytics project.
b) A few remarkable projects on its own (e.g. reportcard,
gerrit-stats...) and the rest under a common umbrella.
c) Each repo has its own project.
Option A is easier to implement but offers less level of detail. Option
C might add too much fragmentation if there are many tiny and not so
relevant projects. I can go for the option you prefer.
Also let me know if there are repos not worth of listing in Ohloh e.g.
internal stuff, playgrounds, data only repos interesting only for WMF
dudes...
Thank you!
--
Quim
Hiya all,
Last week, the A Team sat down together and spent a lot of time scrunching up our faces. The result was a roadmap for Kraken now that the pieces have taken shape and the research phase is basically done. All the components and tasks were examined; you can find the summary here:
https://www.mediawiki.org/wiki/Analytics/Kraken
We focused on the Pixel Service, the component for public data import, and filled in quite a bit of detail, as well as outlined a prototype that we plan to have running in the next two weeks. You can read more about that at:
https://www.mediawiki.org/wiki/Analytics/Kraken/Pixel_Service
Finally, we sketched out some near-term milestones for November, and updated the team roadmap page, and ported those changes to the platform roadmap:
https://www.mediawiki.org/wiki/Analytics/Roadmaphttps://www.mediawiki.org/wiki/Roadmap#Analytics
If you have any questions or comments, definitely let us know!
--
David Schoonover
dsc(a)wikimedia.org
As discussed in the TechOps meeting yesterday, the Analytics team is evaluating two different Hadoop distributions for the batch processing layer of the Kraken cluster: Cloudera Hadoop 4 (CDH4) and Datastax Enterprise (DSE). I'm going to try to describe from a very high level why we think DSE is purrrrrty cool, and why CDH4 sounds like a relative headache. I'll then ask the questions we hope to answer soon.
(Quick disclaimer: This is a totally biased and incomplete summary of DSE vs CDH4. I am highlighting weaknesses of CDH4 in order to illustrate why we are considering DSE. If we are allowed to use DSE, then we will have more work to do weighing pros and cons of both solutions.)
CDH4 is just a well packaged distribution of Hadoop and some of the most useful Hadoop tools. The core of Hadoop is MapReduce and the Hadoop Distributed Filesystem (HDFS). HDFS allows huge files to be written across any number of machines. MapReduce jobs can then be written to do parallel processing of these files.
Hadoop has a single NameNode that is responsible for managing file metadata in HDFS. This is a notorious single point of failure.
CDH4 attempts to solve this problem with the introduction of a Standby NameNode[1]. This requires that both the Active and Standby NameNodes share a filesystem (via NFS or something similar). The Standby nodes uses this to synchronize edits to its own file metadata namespace. If the Active NameNode goes down, the Standby can take over and *should* have the exact same metadata stored as the Active. Administrators have to take special care to ensure that only one of the two NameNodes is active at once, or metadata corruption could result. Read about Fencing[2] for more info.
Not only is the NameNode a SPOF, but it doesn't scale well when there are many files or many metadata operations. More recent Hadoop releases solve this providing 'HDFS Federation'[3]. Basically, this is just multiple NameNodes sharing the independent spaces of the same data nodes. This is all more configuration to get Hadoop to scale, something it is supposed to be good at out of the box.
All of the above also comes with lots of configuration to maintain and tweak to make work. DSE is supposed to solve these problems and be much easier to work with.
DSE is Hadoop without HDFS. Instead, Datastax has written an HDFS emulation layer on top of Cassandra (CFS). The huge benefit here is that from the clients viewpoint, everything works just like Hadoop[4]. MapReduce jobs and all of the fancy Hadoop tools still work. But there's no NameNode to deal with. In Cassandra all nodes are peers. This allows for a much more homogenous cluster and less configuration[5]. Cassandra automatically rebalances its data when a node goes down.
Ah but DSE has WMF related problems of its own! Most importantly, DSE is not 100% open source*. The core of DSE is open source components available under the Apache license (Hadoop, Cassandra, etc.). As far as I can tell, anything that is not an Apache project is proprietary. Datastax has a comparison[7] of its Community vs. Enterprise versions. This includes packaging and wrapper CLI tools, but most importantly the Cassandra File System (CFS) emulation layer. This piece is what makes DSE so attractive.
Also note that DSE is not free. It is for development purposes, but it has a license that we'd have to buy if we want to run it in production. However, Diederik knows one of the founders of Datastax who might give it to us at a discount, or perhaps for free.
Okokok. So I'm starting this thread in order to answer my big question: Are we allowed to use DSE even if it is not 100% open source? If the answer is an easy 'no', then our choice is easy: we will use CDH4. Is it possible to get an answer to this question before we go down the road of dedicating time to evaluating and learning DSE?
tl;dr: Datastax is cooler than Cloudera, but Datastax is not 100% open source. Can we still use it?
- otto + the Analytics Team
*much like Cloudera for Hadoop, DataStax publishes tested, stable distributions of Cassandra and ecosystem tools bundled together into one package. In addition to selling support, their Enterprise edition contains proprietary code that, among other things, provides HDFS emulation on top Cassandra. This code originated as Brisk[7], and was forked when the developers founded DataStax.
[1] https://ccp.cloudera.com/display/CDH4DOC/Introduction+to+Hadoop+High+Availa…
[2] https://ccp.cloudera.com/display/FREE4DOC/Configuring+HDFS+High+Availability
[3] http://hadoop.apache.org/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/Federati…
[4] http://www.datastax.com/faq#dse-2
[5] http://blog.milford.io/2011/04/why-i-am-very-excited-about-datastaxs-brisk/
[6] http://wiki.apache.org/cassandra/HadoopSupport
[7] https://github.com/riptano/brisk , https://github.com/riptano/brisk-hadoop-common
Thank you Sumana for the forwarding. Maybe this is a better list to
discuss
http://www.mediawiki.org/wiki/User:Qgil/MediaWiki_Community_Metrics
further anyway.
On 09/28/2012 01:10 PM, Quim Gil wrote:
> Even if it's too tempting to define the first prototype thinking first
> on tools or data available, you are encouraged to start by proposing
> what questions do you want actually answered. What community trends do
> you want to know?
A first proposal to stir your pros and cons:
=== Developers ===
* [[Developers]] with [[Gerrit]] access.
** Reviewers.
** Core developers with merge permissions.
** Active in the past week / month / year.
** WMF employees, other MediaWiki professionals, hobbyists.
** Countries where they work from.
* New accounts.
** How many requests (approved, declined?) per week / month / year.
** Primary motivation: new or existing project - which projects.
** WMF employees, other MediaWiki professionals, hobbyists.
** Countries where they work from.
=== Software projects ===
* [https://gerrit.wikimedia.org/r/#/admin/projects/ Projects in Gerrit]
** Types of project: MediaWiki core, extensions, mobile, infrastructure...
** Active in the past week / month / year.
** Officially supported.
** Considered stable, beta, experimental.
* Data per project:
** Commits (merged, rejected, waiting) and reviews.
** Committers and reviewers.
** WMF employees, other MediaWiki professionals, hobbyists.
** Countries where they work from.
Do you agree on the priority of these data points?
I wonder how much of this can already be extracted with
http://gerrit-stats.wmflabs.org/ . /me must look deeper.
The first roadblock found has been
Bug 40662 - Can't create a new gerrit-stats report
https://bugzilla.wikimedia.org/40662
> See
> http://www.mediawiki.org/wiki/User:Qgil/MediaWiki_Community_Metrics#Trends_…
> - in few days we should have agreed on the first and most important
> trends we want to visualize.
PS: http://www.ohloh.net/p/mediawiki/contributors seems to be stuck in svn?
--
Quim
Okay! I've gotten Sonatype's Nexus Artifact Repository for Maven set up and working on Kripke: http://nexus.wmflabs.org/nexus/index.html#welcome
If you want to fiddle, PM me for the admin password. Next week I'll write a tutorial and some notes about getting your dev env set up to work with Java (or other JVM languages). I'm planning on doing that for Eclipse, as that's the most popular OSS Java IDE.
Cheers!
--
David Schoonover
dsc(a)wikimedia.org