I'm wondering if it would be feasible to use the EventLogging system
for basic analytics on individual blog posts. Basically to just have
the viewers per post would be a big win for communications already.
Is this feasible, and if so, what would need to happen for it? Is it
just a matter of inserting the eventlogging gif with the right
parameters into each post, and giving the communications folks a
simple web view of the data?
Thanks,
Erik
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation
Support Free Knowledge: https://wikimediafoundation.org/wiki/Donate
(Moving this thread to Analytics list.)
I just finished a discussion with David and Diederik about the log format for this thing. Here's what we got right now:
* Request path
* Query params
* HTTP host (aka request hostname)
* Timestamp
* Client IP (aka remote address/host)
* X-Forwarded-For
* Referer
* Accept-Language
* Cookie
* X-WAP-Profile
* User-Agent
* Server-Hostname
* Sequence Number
The corresponding varnishncsa log format string is:
'%U %q %{Host}i %t %h %{X-Forwarded-For}i %{Referer}i %{Accept-Language}i %{Cookie}i %{X-WAP-Profile}i %{User-agent}i %l %n'
(Note the literal tabs in that string. varnishncsa doesn't translate "\t", afaict.)
I've tested this on my log1 labs instance via curl + varnish + varnishncsa.
This curl command:
curl --cookie 'uid=deadbeef; pageload_id=3;' -H "x-wap-profile: http://nds1.nds.nokia.com/uaprof/N6230ir200.xml" -H "X-Forwarded-For: 192.168.0.123" -H "Referer: http://www.google.com" -H "Accept-Language: en-US" "http://localhost:6081/event/e3?lol=dongs&foo(bar/baz)=*&this=<that/>"
Results in this log line:
/event/e3 ?lol=dongs&foo(bar/baz)=*&this=<that/> localhost:6081 2012-11-05T21:24:15 127.0.0.1 192.168.0.123 http://www.google.com en-US uid=deadbeef;%20pageload_id=3; http://nds1.nds.nokia.com/uaprof/N6230ir200.xml curl/7.19.7%20(x86_64-pc-linux-gnu)%20libcurl/7.19.7%20OpenSSL/0.9.8k%20zlib/1.2.3.3%20libidn/1.15 i-00000239.pmtpa.wmflabs 8
Note that fields (like User-Agent) are URL encoded, whereas the query params are not.
Ori and others, thoughts thus far? If we are fine with this, Asher can move forward with making this stream available.
Also, I think we are also still waiting on this RT ticket, right?
https://rt.wikimedia.org/Ticket/Display.html?id=3760
-Ao
On Oct 31, 2012, at 4:58 PM, Andrew Otto <otto(a)wikimedia.org> wrote:
> Hi guys!
>
> I wanted to write an email to summarize some of the chats I just had with a few of you. We were all talking about how to set up a single /event data stream from varnish that we could all share. Here's what we got:
>
> Asher will set up varnish to match for "^/event/.*". Any request that matches this will return a 204 response. A varnishncsa instance will then log this event to a shared stream.
>
> The URL will be expected to contain a product_code, as in /event/<product_code>. Consumers of this stream can filter out their relevant events by matching against their product code. The URL and query params will be the first fields in the each generated event, to allow for easy filtering. The rest of the log line will contain useful request data (client IPs, hostnames, seq numbers, etc.). We're still working out the exact log format, but it will contain all of the data that E3 needs, plus more that other consumers will find useful. Here's a preliminary list of fields:
>
> * URL (not including requested hostname. e.g. /event/<product_code>/ )
> * Query params
> * Timestamp
> * Client IP (aka remote host)
> * X-Forwarded-For
> * Referer
> * Server Hostname
> * Sequence number
> * Request service time in ms
> * Accept-Language (?)
> * Cookie (?)
> * User-Agent
>
> Obviously, this format still needs some work. We'll talk more about this tomorrow, so if you've got thoughts let us know.
>
> Thanks to all for chatting with me and working this out today! Asher, I will get you a varnishncsa format string soon.
>
> -AO
>
>
>
>
> P.S. Apologies if this email is rambley, I did not proofread it. Ok byeeeeee I gotta go move a piano!
>
>
>
Hey Erik!
I've added a line to the Metrics Meeting Agenda. If you've still got a slot for us, Evan Rosen and I will give a quick demo of Kraken. I'll give a quick overview of how it can be accessed and what it can currently be used for, and Evan will show off some things he is already doing with it.
Thanks! Laters!
-Andrew Otto
On Nov 27, 2012, at 10:43 PM, Erik Moeller <erik(a)wikimedia.org> wrote:
> Hi folks,
>
> if anyone would like to give a 3 minute (with 2 minute Q&A) or 5
> minute (without Q&A) update on a project you're working on in
> engineering, please add it to the schedule here:
>
> https://meta.wikimedia.org/wiki/Metics_and_activities_meetings/2012-12-06
>
> [Reminder - these meetings are publicly streamed and documented forever.]
>
> Presentation guidelines:
> https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings#Guidelines_…
>
> We've got priority updates due for Mobile and Visual Editor, so those
> will definitely be on the agenda, but other updates are welcome as
> well.
>
> Thanks much,
> Erik
>
> --
> Erik Möller
> VP of Engineering and Product Development, Wikimedia Foundation
>
> Support Free Knowledge: https://wikimediafoundation.org/wiki/Donate
>
> _______________________________________________
> Engineering mailing list
> Engineering(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/engineering
We've recently begun trialing a few frontend performance monitoring
services - Keynote, Gomez, trying to get the most out of Watchmouse. They
have their individual pros and cons and when they report sporadic issues,
it can be difficult to correlate to actual user experiences (how many users
were effected, where, and to what extent?) The dearth of data around
end-user page load times (and things like domComplete) is a major blind
spot.
Now that /event messages are flowing from bits to both kraken and vanadium,
I think an initial in house system to analyze page load times as measured
by actual users could be rapidly prototyped, and trump the above trials.
This may already be an eventual deliverable for kraken, but given the drive
behind the current trials, why wait?
The client side would be simple js - for n% of page views from a supported
browser (ie >= 9, chrome >= 6, ff >=6, android >= 4.0) fire off an event
request containing everything relevant from the window.performance.timing
object (https://developer.mozilla.org/en-US/docs/Navigation_timing).
On the backend, perhaps some frequently periodic processing around geoip
lookups and ISP (or other network path) determination before going into a
data store from which we pull structured data for pretty numbers and
pictures. The end result should be able to help identify everything from
js/dom performance issues after a release, to who we should peer with and
where we should provision our next edge cache center.
My main questions right now:
- Would vanadium or kraken be better suited for building this sooner than
later (within a few weeks)
- Would anyone like to help? (David, your guidance around coding the
frontend visualization would be highly valued even if you don't have a day
or two to personally throw at it)
Asher
Hi err'body.
In the fullness of time, and with considerable help from my friends, the roadmap page has been updated with the steps taken in our forced-march on Progress. As I am tremendously fallible, you are strongly encouraged to review and correct the contents therein via some sort of mysterious collaborative editing process:
https://www.mediawiki.org/wiki/Analytics/Roadmap
Cheers,
--
David Schoonover
dsc(a)wikimedia.org
Hi guys!
So, we've had a Todo on our list for a while now to make a couple of tweaks to the web access log format coming from squid, varnish and nginx.
1. Append Accept-Language and X-Carrier headers.
This brings the field count from 14 up to 16. udp-filter has already been modified to handle this. I've already got a change in for this: https://gerrit.wikimedia.org/r/#/c/12188/
2. Change field separator from space to tab.
User-Agent and Content-Type headers (and possibly others) sometimes contain spaces. Some sources (e.g. varnish) properly URL encode the fields before they are sent out, but others don't. Using tab as the field separator in web access logs will avoid many of these issues.
We have wanted to do this for a while, but haven't because we were worried about breaking Erik Zachte's wikistats scripts. Stefan Petrea is now working with Diederik on wikistats (and other things), and has dealt with this issue. So! We are ready! We'd like to make this change before we start real consumption of the web access logs into the Kraken cluster, which hopefully will be relatively soon.
Would these changes cause Fundraising any foreseeable problems? Can we go ahead and work with ops to push this through?
Thanks!
-Andrew Otto
Woooweeeee!
Now that we've got all of our servers up and running, let's take a minute to assign them all their official roles.
Summary of what we've got:
analytics1001 - analytics1010:
Cisco UCS C250 M1
192G RAM
8 x 300G = 2.4T
24 core X5650 @ 2.67 GHz
analytics1011 - analytics1022:
Dell Poweredge R720
48G RAM
12 * 2T = 24T
12 core EW-2620 @ 2.00GHz
analytics1023 - analytics1027:
Dell PowerEdge R310
8G RAM
2 * 1G = 2G
4 core X3430 @ 2.40GHz
an11 - an22 are easy. They should be Hadoop Worker (HDFS) nodes, since they have so much storage space!
an23-an27 are relative weakling and should not be used for compute or data needs. I've currently got Zookeepers running on an23, an24 and an25 (we need 3 for a quorum), and I think we should keep it that way.
The remaining assignments require more discussion. Our NameNode is currently on an01, but as Diederik pointed out last week, this is a bit of a waste of a node, since it is so beefy. I'd like to suggest that we use an26 and an27 for NameNode and backup NameNode.
My rudimentary Snappy compression test reduces web access log files to about 33% of there original size. According the the unsampled file we saved back in August, uncompressed web request logs generate about 100 GB / hour. Rounded (way) up, that's 20 TB / week.
If we snappy compress and do more rounding up, that's 7 TB / week*.
an11, an12: Kafka Brokers
I had wanted to use all of the R720s as hadoop workers, but we'd like to be able to store a week's worth of Kafka log buffer. There isn't enough storage space on the other machines to do this, so I think we should use two of these as Kafka brokers. If we RAID 1 the buffer drives (which we probably should), that makes the Kafka buffer 10 TB (2 nodes * 10 2 TB drives / 2 (for RAID)), which should be enough to cover us for a while.
an01 - an10: Storm/ETL
These are beefy (tons of RAM, 24 core), so these will be good for hefty realtime stuff. We could also take a few of these and use them as Hadoop workers, but since they don't really have that much space to add to the HDFS pool, I'm not sure if it is worth it.
an23, an25: ZooKeepers
As I said above, let's keep the ZKs here.
an26, an27: Hadoop Masters
Move the NameNodes (primary and secondary/failover) here.
an13 - an22: Hadoop Workers
We need to use the first 2 drives in RAID 1 for the OS, so really we only have 10 drives for HDFS space. Still, that gives us 200 TB. With an HDFS replication factor of 3, that's 67 TB HDFS.
Thoughts? Since we'll def want to use an13-an22 as workers, I'll start spawning those up and adding them to the cluster today. Yeehaw!
-Ao
*For the sake of simplicity, I'm not counting other input sources (event log, sqoop, etc.), and instead hoping that rounding up as much as I did will cover these needs.
Howdy Andre,
The ClickTracking extension has a bit of a complicated history -- afaik there are three or four forks of it, and it's sorta maintained by several different teams. I don't personally know that much about it, so I've cc'd the Analytics list and some of the blokes I believe know more about it.
Cheers!
--
David Schoonover
dsc(a)wikimedia.org
On Friday, 12 October 2012 at 10:10 a, Trevor Parscal wrote:
> Nimish doesn't work for WMF anymore, and I don't know where his @wikimedia.org (http://wikimedia.org) email messages end up.
>
> This is a dependency for ClickTracking (and nothing else afaik), and should probably be merged together with it (in both software and bugs).
>
> Generally, it's to do with stats, so David Schoonover (cc'd) is a better person to ask about this.
>
> - Trevor
>
> On Fri, Oct 12, 2012 at 6:37 AM, Andre Klapper <aklapper(a)wikimedia.org (mailto:aklapper@wikimedia.org)> wrote:
> > Hi,
> >
> > contacting you as you are listed as maintainers on
> > https://www.mediawiki.org/wiki/Extension:UserDailyContribs
> >
> > According to
> > https://www.mediawiki.org/wiki/Category:Extensions_used_on_Wikimedia
> > this extension is deployed on Wikimedia, but I cannot find a good place
> > where to report bugs.
> >
> > Would it be useful if I created a dedicated component for this extension
> > in Bugzilla under the "MediaWiki extensions" product, and set you as the
> > default assignee for bug reports filed under it?
> >
> > Currently many reports get filed in the "[other]" component of the
> > "MediaWiki extensions" product in Bugzilla where they are hard to find
> > for maintainers.
> > A dedicated component would make it easier to report and get aware of
> > issues for this specific extension.
> >
> > Thanks,
> > andre
> > --
> > Andre Klapper | Wikimedia Bugwrangler
> > http://blogs.gnome.org/aklapper/
> >
> >
>
https://gerrit.wikimedia.org/r/31603
This patchset, which has not yet been merged into MediaWiki, adds an
edit history graph to InfoAction, per
https://bugzilla.wikimedia.org/show_bug.cgi?id=41329 .
"Added the jQuery extension jqPlot to make graphs. Then added a section
for analytics to InfoAction, in which is a graph of monthly edits."
I thought people interested in analytics would want to know about this.
:-) Thanks for the patchset, Tyler Romeo.
--
Sumana Harihareswara
Engineering Community Manager
Wikimedia Foundation
https://github.com/embr/userstats
"We're pleased to release version 0.1.0 of the userstats Python library
and command-line tool for computing user-centric metrics on Wikipedia
users. The goal of the software is to make it easy for project owners to
track the contributions and status of users involved in their project.
It is also intended to be easily extensible so that custom metrics can
be added using only a few lines of Python code."
>From the "Global Learning and Grantmaking" section of the September WMF
report:
https://blog.wikimedia.org/2012/10/31/wikimedia-foundation-report-september…
--
Sumana Harihareswara
Engineering Community Manager
Wikimedia Foundation