Analytics November 2012

analytics@lists.wikimedia.org

20 participants
11 discussions

Using EventLogging for basic blog analytics
by Erik Moeller 30 Nov '12

30 Nov '12

I'm wondering if it would be feasible to use the EventLogging system for basic analytics on individual blog posts. Basically to just have the viewers per post would be a big win for communications already. Is this feasible, and if so, what would need to happen for it? Is it just a matter of inserting the eventlogging gif with the right parameters into each post, and giving the communications folks a simple web view of the data? Thanks, Erik -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation Support Free Knowledge: https://wikimediafoundation.org/wiki/Donate

3 3

Event Data Stream
by Andrew Otto 30 Nov '12

30 Nov '12

(Moving this thread to Analytics list.) I just finished a discussion with David and Diederik about the log format for this thing. Here's what we got right now: * Request path * Query params * HTTP host (aka request hostname) * Timestamp * Client IP (aka remote address/host) * X-Forwarded-For * Referer * Accept-Language * Cookie * X-WAP-Profile * User-Agent * Server-Hostname * Sequence Number The corresponding varnishncsa log format string is: '%U %q %{Host}i %t %h %{X-Forwarded-For}i %{Referer}i %{Accept-Language}i %{Cookie}i %{X-WAP-Profile}i %{User-agent}i %l %n' (Note the literal tabs in that string. varnishncsa doesn't translate "\t", afaict.) I've tested this on my log1 labs instance via curl + varnish + varnishncsa. This curl command: curl --cookie 'uid=deadbeef; pageload_id=3;' -H "x-wap-profile: http://nds1.nds.nokia.com/uaprof/N6230ir200.xml" -H "X-Forwarded-For: 192.168.0.123" -H "Referer: http://www.google.com" -H "Accept-Language: en-US" "http://localhost:6081/event/e3?lol=dongs&foo(bar/baz)=*&this=<that/>" Results in this log line: /event/e3 ?lol=dongs&foo(bar/baz)=*&this=<that/> localhost:6081 2012-11-05T21:24:15 127.0.0.1 192.168.0.123 http://www.google.com en-US uid=deadbeef;%20pageload_id=3; http://nds1.nds.nokia.com/uaprof/N6230ir200.xml curl/7.19.7%20(x86_64-pc-linux-gnu)%20libcurl/7.19.7%20OpenSSL/0.9.8k%20zlib/1.2.3.3%20libidn/1.15 i-00000239.pmtpa.wmflabs 8 Note that fields (like User-Agent) are URL encoded, whereas the query params are not. Ori and others, thoughts thus far? If we are fine with this, Asher can move forward with making this stream available. Also, I think we are also still waiting on this RT ticket, right? https://rt.wikimedia.org/Ticket/Display.html?id=3760 -Ao On Oct 31, 2012, at 4:58 PM, Andrew Otto <otto(a)wikimedia.org> wrote: > Hi guys! > > I wanted to write an email to summarize some of the chats I just had with a few of you. We were all talking about how to set up a single /event data stream from varnish that we could all share. Here's what we got: > > Asher will set up varnish to match for "^/event/.*". Any request that matches this will return a 204 response. A varnishncsa instance will then log this event to a shared stream. > > The URL will be expected to contain a product_code, as in /event/<product_code>. Consumers of this stream can filter out their relevant events by matching against their product code. The URL and query params will be the first fields in the each generated event, to allow for easy filtering. The rest of the log line will contain useful request data (client IPs, hostnames, seq numbers, etc.). We're still working out the exact log format, but it will contain all of the data that E3 needs, plus more that other consumers will find useful. Here's a preliminary list of fields: > > * URL (not including requested hostname. e.g. /event/<product_code>/ ) > * Query params > * Timestamp > * Client IP (aka remote host) > * X-Forwarded-For > * Referer > * Server Hostname > * Sequence number > * Request service time in ms > * Accept-Language (?) > * Cookie (?) > * User-Agent > > Obviously, this format still needs some work. We'll talk more about this tomorrow, so if you've got thoughts let us know. > > Thanks to all for chatting with me and working this out today! Asher, I will get you a varnishncsa format string soon. > > -AO > > > > > P.S. Apologies if this email is rambley, I did not proofread it. Ok byeeeeee I gotta go move a piano! > > >

3 3

Re: [Analytics] [Engineering] Lightning talks for metrics meeting next week - 12/6
by Andrew Otto 30 Nov '12

30 Nov '12

Hey Erik! I've added a line to the Metrics Meeting Agenda. If you've still got a slot for us, Evan Rosen and I will give a quick demo of Kraken. I'll give a quick overview of how it can be accessed and what it can currently be used for, and Evan will show off some things he is already doing with it. Thanks! Laters! -Andrew Otto On Nov 27, 2012, at 10:43 PM, Erik Moeller <erik(a)wikimedia.org> wrote: > Hi folks, > > if anyone would like to give a 3 minute (with 2 minute Q&A) or 5 > minute (without Q&A) update on a project you're working on in > engineering, please add it to the schedule here: > > https://meta.wikimedia.org/wiki/Metics_and_activities_meetings/2012-12-06 > > [Reminder - these meetings are publicly streamed and documented forever.] > > Presentation guidelines: > https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings#Guidelines_… > > We've got priority updates due for Mobile and Visual Editor, so those > will definitely be on the agenda, but other updates are welcome as > well. > > Thanks much, > Erik > > -- > Erik Möller > VP of Engineering and Product Development, Wikimedia Foundation > > Support Free Knowledge: https://wikimediafoundation.org/wiki/Donate > > _______________________________________________ > Engineering mailing list > Engineering(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/engineering

3 3

RFC: Building a frontend performance analysis platform
by Asher Feldman 30 Nov '12

30 Nov '12

We've recently begun trialing a few frontend performance monitoring services - Keynote, Gomez, trying to get the most out of Watchmouse. They have their individual pros and cons and when they report sporadic issues, it can be difficult to correlate to actual user experiences (how many users were effected, where, and to what extent?) The dearth of data around end-user page load times (and things like domComplete) is a major blind spot. Now that /event messages are flowing from bits to both kraken and vanadium, I think an initial in house system to analyze page load times as measured by actual users could be rapidly prototyped, and trump the above trials. This may already be an eventual deliverable for kraken, but given the drive behind the current trials, why wait? The client side would be simple js - for n% of page views from a supported browser (ie >= 9, chrome >= 6, ff >=6, android >= 4.0) fire off an event request containing everything relevant from the window.performance.timing object (https://developer.mozilla.org/en-US/docs/Navigation_timing). On the backend, perhaps some frequently periodic processing around geoip lookups and ISP (or other network path) determination before going into a data store from which we pull structured data for pretty numbers and pictures. The end result should be able to help identify everything from js/dom performance issues after a release, to who we should peer with and where we should provision our next edge cache center. My main questions right now: - Would vanadium or kraken be better suited for building this sooner than later (within a few weeks) - Would anyone like to help? (David, your guidance around coding the frontend visualization would be highly valued even if you don't have a day or two to personally throw at it) Asher

6 11

Roadmap Progress Updated
by David Schoonover 14 Nov '12

14 Nov '12

Hi err'body. In the fullness of time, and with considerable help from my friends, the roadmap page has been updated with the steps taken in our forced-march on Progress. As I am tremendously fallible, you are strongly encouraged to review and correct the contents therein via some sort of mysterious collaborative editing process: https://www.mediawiki.org/wiki/Analytics/Roadmap Cheers, -- David Schoonover dsc(a)wikimedia.org

1 0

Web Access Log Format Changes
by Andrew Otto 13 Nov '12

13 Nov '12

Hi guys! So, we've had a Todo on our list for a while now to make a couple of tweaks to the web access log format coming from squid, varnish and nginx. 1. Append Accept-Language and X-Carrier headers. This brings the field count from 14 up to 16. udp-filter has already been modified to handle this. I've already got a change in for this: https://gerrit.wikimedia.org/r/#/c/12188/ 2. Change field separator from space to tab. User-Agent and Content-Type headers (and possibly others) sometimes contain spaces. Some sources (e.g. varnish) properly URL encode the fields before they are sent out, but others don't. Using tab as the field separator in web access logs will avoid many of these issues. We have wanted to do this for a while, but haven't because we were worried about breaking Erik Zachte's wikistats scripts. Stefan Petrea is now working with Diederik on wikistats (and other things), and has dealt with this issue. So! We are ready! We'd like to make this change before we start real consumption of the web access logs into the Kraken cluster, which hopefully will be relatively soon. Would these changes cause Fundraising any foreseeable problems? Can we go ahead and work with ops to push this through? Thanks! -Andrew Otto

6 9

Kraken Hardware Assignments
by Andrew Otto 13 Nov '12

13 Nov '12

Woooweeeee! Now that we've got all of our servers up and running, let's take a minute to assign them all their official roles. Summary of what we've got: analytics1001 - analytics1010: Cisco UCS C250 M1 192G RAM 8 x 300G = 2.4T 24 core X5650 @ 2.67 GHz analytics1011 - analytics1022: Dell Poweredge R720 48G RAM 12 * 2T = 24T 12 core EW-2620 @ 2.00GHz analytics1023 - analytics1027: Dell PowerEdge R310 8G RAM 2 * 1G = 2G 4 core X3430 @ 2.40GHz an11 - an22 are easy. They should be Hadoop Worker (HDFS) nodes, since they have so much storage space! an23-an27 are relative weakling and should not be used for compute or data needs. I've currently got Zookeepers running on an23, an24 and an25 (we need 3 for a quorum), and I think we should keep it that way. The remaining assignments require more discussion. Our NameNode is currently on an01, but as Diederik pointed out last week, this is a bit of a waste of a node, since it is so beefy. I'd like to suggest that we use an26 and an27 for NameNode and backup NameNode. My rudimentary Snappy compression test reduces web access log files to about 33% of there original size. According the the unsampled file we saved back in August, uncompressed web request logs generate about 100 GB / hour. Rounded (way) up, that's 20 TB / week. If we snappy compress and do more rounding up, that's 7 TB / week*. an11, an12: Kafka Brokers I had wanted to use all of the R720s as hadoop workers, but we'd like to be able to store a week's worth of Kafka log buffer. There isn't enough storage space on the other machines to do this, so I think we should use two of these as Kafka brokers. If we RAID 1 the buffer drives (which we probably should), that makes the Kafka buffer 10 TB (2 nodes * 10 2 TB drives / 2 (for RAID)), which should be enough to cover us for a while. an01 - an10: Storm/ETL These are beefy (tons of RAM, 24 core), so these will be good for hefty realtime stuff. We could also take a few of these and use them as Hadoop workers, but since they don't really have that much space to add to the HDFS pool, I'm not sure if it is worth it. an23, an25: ZooKeepers As I said above, let's keep the ZKs here. an26, an27: Hadoop Masters Move the NameNodes (primary and secondary/failover) here. an13 - an22: Hadoop Workers We need to use the first 2 drives in RAID 1 for the OS, so really we only have 10 drives for HDFS space. Still, that gives us 200 TB. With an HDFS replication factor of 3, that's 67 TB HDFS. Thoughts? Since we'll def want to use an13-an22 as workers, I'll start spawning those up and adding them to the cluster today. Yeehaw! -Ao *For the sake of simplicity, I'm not counting other input sources (event log, sqoop, etc.), and instead hoping that rounding up as much as I did will cover these needs.

3 7

Re: [Analytics] Bug reports against UserDailyContribs
by David Schoonover 06 Nov '12

06 Nov '12

Howdy Andre, The ClickTracking extension has a bit of a complicated history -- afaik there are three or four forks of it, and it's sorta maintained by several different teams. I don't personally know that much about it, so I've cc'd the Analytics list and some of the blokes I believe know more about it. Cheers! -- David Schoonover dsc(a)wikimedia.org On Friday, 12 October 2012 at 10:10 a, Trevor Parscal wrote: > Nimish doesn't work for WMF anymore, and I don't know where his @wikimedia.org (http://wikimedia.org) email messages end up. > > This is a dependency for ClickTracking (and nothing else afaik), and should probably be merged together with it (in both software and bugs). > > Generally, it's to do with stats, so David Schoonover (cc'd) is a better person to ask about this. > > - Trevor > > On Fri, Oct 12, 2012 at 6:37 AM, Andre Klapper <aklapper(a)wikimedia.org (mailto:aklapper@wikimedia.org)> wrote: > > Hi, > > > > contacting you as you are listed as maintainers on > > https://www.mediawiki.org/wiki/Extension:UserDailyContribs > > > > According to > > https://www.mediawiki.org/wiki/Category:Extensions_used_on_Wikimedia > > this extension is deployed on Wikimedia, but I cannot find a good place > > where to report bugs. > > > > Would it be useful if I created a dedicated component for this extension > > in Bugzilla under the "MediaWiki extensions" product, and set you as the > > default assignee for bug reports filed under it? > > > > Currently many reports get filed in the "[other]" component of the > > "MediaWiki extensions" product in Bugzilla where they are hard to find > > for maintainers. > > A dedicated component would make it easier to report and get aware of > > issues for this specific extension. > > > > Thanks, > > andre > > -- > > Andre Klapper | Wikimedia Bugwrangler > > http://blogs.gnome.org/aklapper/ > > > > >

5 5

New patchset adds edit history graph to InfoAction
by Sumana Harihareswara 02 Nov '12

02 Nov '12

https://gerrit.wikimedia.org/r/31603 This patchset, which has not yet been merged into MediaWiki, adds an edit history graph to InfoAction, per https://bugzilla.wikimedia.org/show_bug.cgi?id=41329 . "Added the jQuery extension jqPlot to make graphs. Then added a section for analytics to InfoAction, in which is a graph of monthly edits." I thought people interested in analytics would want to know about this. :-) Thanks for the patchset, Tyler Romeo. -- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

3 2

userstats Python library - user-centric metrics on Wikipedia users
by Sumana Harihareswara 02 Nov '12

02 Nov '12

https://github.com/embr/userstats "We're pleased to release version 0.1.0 of the userstats Python library and command-line tool for computing user-centric metrics on Wikipedia users. The goal of the software is to make it easy for project owners to track the contributions and status of users involved in their project. It is also intended to be easily extensible so that custom metrics can be added using only a few lines of Python code." >From the "Global Learning and Grantmaking" section of the September WMF report: https://blog.wikimedia.org/2012/10/31/wikimedia-foundation-report-september… -- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

1 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics November 2012