Wooo hooooo! Thanks Asher!
I now have Kafka producers tailing this stream and producing logs into Kafka. I've still got the same ol' hacky Kafka Hadoop consumer iterating over event topics and writing logs hourly into Hadoop. I plan on making this less hacky soon (right now the Kafka Hadoop consumer is just running in a screen).
So, currently, most /event.gif request will hourly be imported into Hadoop at /user/otto/event/logs/event-unknown. The 'event-unknown' topic is for events without a known product_code. If a request goes to /event.gif/<product_code> AND <product_code> is one of the known product_code listed at http://www.mediawiki.org/wiki/Analytics/Product_Codes, then the event will be consumed via the topic 'event-<product_code>', and into the corresponding Hadoop directory. (again, I'll put these logs in a more appropriate location once I get rid of the Kafka Hadoop consumer).
Ori, wouldn't you just LOVE to get rid of that vestigial '.gif' on the end of the url? why not just /event? Ehhhhh? Asher is willing to do this, but we need to coordinate to make sure that doing so wouldn't break your stuff. Your parsing probably wouldn't change (since I don't think you are dealing with topics or product codes right now), but I betcha whatever extensions you have hitting the /event.gif URL will need to change. Whatcha think? There's no technical reason why we should do this, its purely bugging me to request a .gif file that is not a gif.
On Nov 28, 2012, at 9:49 PM, Ori Livneh ori@wikimedia.org wrote:
Hey Asher,
Separate works for me -- I don't need all those fields. I do need client IP, though. Submitted a patch here: https://gerrit.wikimedia.org/r/#/c/35848/
Thanks, O
-- Ori Livneh ori@wikimedia.org
On Wednesday, November 28, 2012 at 3:57 PM, Asher Feldman wrote:
I've decided to send independent streams of event log data from the bits servers, at least for the time being. One directly to the analytics cluster via their public ip'd frontend server, from whence it may be multicasted at analytic's discretion, the other will continue as-is (direct to vanadium, or via oxygen to vanadium if in esams).
I'm going to configure the analytics stream with the log format defined by Andrew below. Ori, this allows the log format to vanadium to remain unchanged if you'd like. Let me know if you'd like the same format as analytics, or to stick with what already in place.
On Mon, Nov 5, 2012 at 1:52 PM, Andrew Otto <otto@wikimedia.org (mailto:otto@wikimedia.org)> wrote:
(Moving this thread to Analytics list.)
I just finished a discussion with David and Diederik about the log format for this thing. Here's what we got right now:
- Request path
- Query params
- HTTP host (aka request hostname)
- Timestamp
- Client IP (aka remote address/host)
- X-Forwarded-For
- Referer
- Accept-Language
- Cookie
- X-WAP-Profile
- User-Agent
- Server-Hostname
- Sequence Number
The corresponding varnishncsa log format string is:
'%U %q %{Host}i %t %h %{X-Forwarded-For}i %{Referer}i %{Accept-Language}i %{Cookie}i %{X-WAP-Profile}i %{User-agent}i %l %n'
(Note the literal tabs in that string. varnishncsa doesn't translate "\t", afaict.)
I've tested this on my log1 labs instance via curl + varnish + varnishncsa.
This curl command:
curl --cookie 'uid=deadbeef; pageload_id=3;' -H "x-wap-profile: http://nds1.nds.nokia.com/uaprof/N6230ir200.xml" -H "X-Forwarded-For: 192.168.0.123" -H "Referer: http://www.google.com" -H "Accept-Language: en-US" "http://localhost:6081/event/e3?lol=dongs&foo(bar/baz)=*&this=<that/>"
Results in this log line:
/event/e3 ?lol=dongs&foo(bar/baz)=*&this=<that/> localhost:6081 2012-11-05T21:24:15 127.0.0.1 192.168.0.123 http://www.google.com en-US uid=deadbeef;%20pageload_id=3; http://nds1.nds.nokia.com/uaprof/N6230ir200.xml curl/7.19.7%20(x86_64-pc-linux-gnu)%20libcurl/7.19.7%20OpenSSL/0.9.8k%20zlib/1.2.3.3 (http://1.2.3.3)%20libidn/1.15 i-00000239.pmtpa.wmflabs 8
Note that fields (like User-Agent) are URL encoded, whereas the query params are not.
Ori and others, thoughts thus far? If we are fine with this, Asher can move forward with making this stream available.
Also, I think we are also still waiting on this RT ticket, right? https://rt.wikimedia.org/Ticket/Display.html?id=3760
-Ao
On Oct 31, 2012, at 4:58 PM, Andrew Otto <otto@wikimedia.org (mailto:otto@wikimedia.org)> wrote:
Hi guys!
I wanted to write an email to summarize some of the chats I just had with a few of you. We were all talking about how to set up a single /event data stream from varnish that we could all share. Here's what we got:
Asher will set up varnish to match for "^/event/.*". Any request that matches this will return a 204 response. A varnishncsa instance will then log this event to a shared stream.
The URL will be expected to contain a product_code, as in /event/<product_code>. Consumers of this stream can filter out their relevant events by matching against their product code. The URL and query params will be the first fields in the each generated event, to allow for easy filtering. The rest of the log line will contain useful request data (client IPs, hostnames, seq numbers, etc.). We're still working out the exact log format, but it will contain all of the data that E3 needs, plus more that other consumers will find useful. Here's a preliminary list of fields:
- URL (not including requested hostname. e.g. /event/<product_code>/ )
- Query params
- Timestamp
- Client IP (aka remote host)
- X-Forwarded-For
- Referer
- Server Hostname
- Sequence number
- Request service time in ms
- Accept-Language (?)
- Cookie (?)
- User-Agent
Obviously, this format still needs some work. We'll talk more about this tomorrow, so if you've got thoughts let us know.
Thanks to all for chatting with me and working this out today! Asher, I will get you a varnishncsa format string soon.
-AO
P.S. Apologies if this email is rambley, I did not proofread it. Ok byeeeeee I gotta go move a piano!