The main purpose is to add a log of this thread to a mailing list for archiving. (below)
A secondary purpose is to outline some of what has been discussed. Here is what I've gathered:
In the short term, packets to Vanadium will continue through the flow agreed on in the request ticket:
(agreed): <client side> -(clicktracking)-> api.php -(udp2log)-> vanadium
(alternate): <client side> -(clicktracking)->api.php -(0mq?)-> vanadium
[NB: (alternate) would break existing users of clicktracking, of which there should be none, but can be addressed in different thread].
Ori is moving for this change:
(intermediate): <client side> -(clicktracking/E3 extension)->
bits.wikimedia.org -(0mq)-> vanadium
[NB: both breaks existing users of clicktracking]
Bikeshedding:
Because cross-dc redundant tunneling is not in place, vanadium is not reachable by everything. This may take 1-2 months, or longer. intermediate is thus modified to replace bits to a specific bits host on equiad. We can revisit moving the varnish rule up to cover all bits at a later date (as far as I'm concerned, I'm happy with deferring this until Kraken needs its pixelserver, but whatever).
Mark has also requested that this be properly packaged and puppetized. Ori will be using labs at a test for this setup a la the way Patrick is currently handling a similar request for Wikipedia Zero right now.
Asher has requested that the pub/sub model proposed by Ori be reversed. This seems reasonable.
0mq allows for different queue configurations than pub-sub. There is some consideration of using UDP multicast instead. This should probably be revisited when Kraken goes online.
Current actions (so CT has a map):
At some point, nothing gets done without Mark since he wrote the puppet manifests on varnish. However, it's reasonable to get as much done under Mark's approval before he actually hands-on this config. Mark if you are not okay with any of the stuff, tell me. No point in continuing if it isn't going to happen. ;-)
I guess technically we could punt on the whole thing for 1-2 months. However, since at some point something like this needs to be tested on varnish on the cluster, we should probably take the opportunity to get this running on a single varnish machine while we have an engineer willing to do the lifting, packaging and puppetization. Analytics is way too busy on other parts to worry about collection.
If we punt for longer than 1-2 months, then I guess ori can't be held accountable when he takes down the cluster again with too many calls to api.php. :-D
If we get it on a single instance now, I'm inclined to let ops decide when/if they want to move the config to cover all of bits (and just inform Ori so he can update any extensions to point to the edges). The only hard deadline on rolling out to all esams would be when Kraken goes online and needs the config to point the pub-sub to publish to their scribe servers instead of vanadium.
Other than the above, I consider the whole thing settled. Last I checked, none of you report to me, so I'm not involved at all. :-P
P.S. Today is SysAdmin Appreciation day. There are three bottles of whiskey (probably already added to Ryan Lane's stash) and 2 dozen cookies/baked goods on CT's desk. I bought them for ops to dispose of how they see fit.
Take care,
terry
Hey Mark, Asher,
For event-tracking, could we add a VCL hook to bits.wikimedia.org that rewrites a specific URL to hit vanadium.eqiad.wmnet:8000?
I have a simple HTTP server there that parses and stores query strings for all incoming requests. I mean to use it as a way of capturing events from JavaScript code (for AB testing features, for example.) It responds to all requests with HTTP status code 204 ("No Content") and an empty body. But vanadium isn't public-facing, so I need to expose a URL.
Something like this should work, assuming vanadium is reachable from bits: http://p.defau.lt/?RhrkVPxrdhv0vPKvIwaRNQ
Very crude benchmarking (see http://p.defau.lt/?VRssDYUMq1djVFHzlyN_Yw) clocks the server at ~1,600 reqs/sec, which would add up to ~140 mil. / day. My plan is to be extremely conservative and limit ourselves to 200k reqs / day and ramp up very gradually iff it's stable enough. Although 200k sounds tiny, it can comfortably accomodate some interesting metrics -- enwiki averages 140k edits / day, for example.
Adding the URL to Varnish would complete this request: https://rt.wikimedia.org/Ticket/Display.html?id=3152
Let me know what you think.
Thanks,
Ori
--
Ori Livneh
ori@wikimedia.org
Great latency results for your collector! I don't think it matters much at the traffic rate you're talking about, but I think we'd want to consider a different approach or a public endpoint other than bits if use of this will be seriously ramped up in the future. bits serves ~40k requests/sec via 4 servers in the US and 2 in Europe, with spare enough capacity for a couple of those hosts to die. The >99.6% cache hitrate is important to the small server footprint, which a shift in the number of backend http requests varnish has to make could impact. Additionally, bits servers in europe can't hit private servers in eqiad and use the public eqiad bits ip as its backend. An eu request would take a couple hundred ms due to network latency, hitting both varnish in the eu and us.
I'm adding Patrick because we've discussed sending udp packets for a mobile analytics project directly from vanish via inline C. If progress is made there, perhaps your server could be modified to receive udp messages instead of http requests? It would be friendlier to EU users, since varnish could respond with a 204 immediately while whatever happens to get the udp packet forwarded to eqiad happens behind the scenes.
Much obliged for the thoughtful response!
UDP might not be the best option because data integrity is important. IIRC most implementations will fragment datagrams greater than 1472 bytes and will silently drop datagrams if a fragment is lost or delayed, which could easily skew our data if we're not super careful. Order and reliability count, and UDP is hard to reason about.
varnishlog might be a better option if you're willing to allow vanadium to maintain a persistent connection to the varnish caches (over SSH perhaps, with varnishlog instead of a login shell). Alternately the varnish caches could pipe varnishlog into some lightweight tool that sends things to vanadium. (Maybe this is the use-case for 0MQ that Terry has been itching for.) If I write it, would you be able to help with deployment / testing? (I think we could keep it pretty simple..)
--
Ori Livneh
ori@wikimedia.org
Much obliged for the thoughtful response!
UDP might not be the best option because data integrity is important. IIRC most implementations will fragment datagrams greater than 1472 bytes and will silently drop datagrams if a fragment is lost or delayed, which could easily skew our data if we're not super careful. Order and reliability count, and UDP is hard to reason about.
varnishlog might be a better option if you're willing to allow vanadium to maintain a persistent connection to the varnish caches (over SSH perhaps, with varnishlog instead of a login shell). Alternately the varnish caches could pipe varnishlog into some lightweight tool that sends things to vanadium. (Maybe this is the use-case for 0MQ that Terry has been itching for.) If I write it, would you be able to help with deployment / testing? (I think we could keep it pretty simple..)
--
Ori Livneh
ori@wikimedia.org
Hi Ori,
Besides Asher's response, which I fully agree with, let me add the following:
First of all, when we gave you that server vanadium, a few weeks ago, you argued for it by saying that you wanted to reduce the coupling with / dependencies of / imact on production as much as possible. But then you didn't mention any of this, and your proposed change, using bits, does quite the opposite. Let's not do that.
Solutions around varnishlog and ssh/connections sound clunky. Sending udp packets from Varnish would be fine I think, but you don't want that.
Why don't we see if we can integrate your requirements with the plans the analytics team has with their Hadoop cluster? That would avoid duplication of effort as well.
Hi Mark,
Thanks for your note. The design (capturing event data from URLs) is the plan for Kraken, and my work on the public-facing part of the stack is in collaboration with the analytics team, whose efforts are currently invested in storage and computation. I'm looping in David Schoonover, with whom I've been working to coordinate efforts. Once data is piping into vanadium, I'm going to drop server-side work entirely and focus on growing a client-side event tracking library, and that's going to integrate directly with Kraken.
To state the obvious: any analytics solution is going to need a channel for incoming data if we hope to do anything more interesting than searching for patterns in /dev/random. The needs to be some endpoint that client-side JavaScript code can hit or we'll have no way of tracking client-side state, which is increasingly AJAX driven and therefore not easily gleaned by looking at bare request logs.
Serializing state into URL params (as opposed to tracking data by issuing POST requests with JSON body, for example) is how we get a system designed to crunch page views (Kraken) to fulfill UX/UI testing requirements. So there is no duplicated effort here. A client-side library that transparently captures and transmits state in AJAX request URLs is going to help Kraken along.
I don't think the change list on Gerrit is an inelegant solution. The coupling problem with the click tracking extension was that it was using MediaWiki to parse event data from incoming requests and to generate successful responses, which didn't scale. My proposed solution has Varnish doing nothing more than responding to /beacon.gif with an empty response. I can't think of a way of implementing a tracking endpoint that would scale better or that would be more lightweight.
Transferring tracking data over a persistent SSH connection sucks, I agree, and I didn't go that route. I chose to do something very close to UDP, which is to pipe tracking request URLs from varnishlog into an unbuffered ZeroMQ publisher socket. The implementation does not require anything to be listening on the other end -- if the client on Vanadium dies, data is dropped on the floor, and the connection would be reestablished transparently once it is back up. I don't think this is going to perform worse than UDP, but I am not particular about this point -- UDP would be fine as well.
Asher was going to test what impact on load running varnishlog with a URL pattern will have. If it's minimal, would this be OK?
Thanks,
--
Ori Livneh
ori@wikimedia.org
It's good to know that the work here is in fact to create the public injection point for Kraken and not a duplicate effort. That likely means the total request rate will be much greater than what's driven by editor engagement tests, possibly up to a request per pageview.
I will test varnishlog on a bits server with a regex to capture /beacon requests to get a feel for the resulting resource utilization. It still requires inspection of every bits request from shared memory (significantly more data per request than what goes into an access log) to pick out a few, so it may not be the most efficient solution.
If varnish can send udp packets for specific requests, there's also the option of having it send one for /beacon requests to something listening on localhost, which could itself use 0mq or another reliable transport to pass messages on to kraken. That would probably address most concerns over udp, while also eliminating out of band processing of every bits request in order to find beacons.
Yet another option would be to build a new
beacon.wikimedia.org endpoint. You could have much greater flexibility over implementation choices if not piggybacking on bits, but with an operational and capital cost that would also delay release.
It looks like varnishlog is actually quite efficient at finding specific requests based on a field regex, and fetching one of the many log fields from matching requests. 'varnishlog -c -m RxURL:"^/event.gif" -i RxURL' utilized 5% of a core on a production bits server while it was serving ~6.2k reqs/sec, vs. far more for an unfiltered varnishlog process. So this seems feasible, provided that whatever process reads stdout from varnishlog (or directly accesses varnish shm) is similarly efficient, and had no risk of run away failure cases that might impact varnish performance.
This is invasive to bits, but seems reasonable in terms of asynchronously passing beacon messages from user requests (varnish returns an immediate 204 no matter what), and decoupling failures of the event reader or vanadium from users and varnish. Mark, what do you think?
On Jul 20, 2012, at 10:38 PM, Asher Feldman wrote:
It looks like varnishlog is actually quite efficient at finding specific requests based on a field regex, and fetching one of the many log fields from matching requests. 'varnishlog -c -m RxURL:"^/event.gif" -i RxURL' utilized 5% of a core on a production bits server while it was serving ~6.2k reqs/sec, vs. far more for an unfiltered varnishlog process. So this seems feasible, provided that whatever process reads stdout from varnishlog (or directly accesses varnish shm) is similarly efficient, and had no risk of run away failure cases that might impact varnish performance.
This is invasive to bits, but seems reasonable in terms of asynchronously passing beacon messages from user requests (varnish returns an immediate 204 no matter what), and decoupling failures of the event reader or vanadium from users and varnish. Mark, what do you think?
Yeah, this seems reasonable, but:
a) needs to be setup in a clean way (puppet configuration management, packaging of software used), and
b) we need a way to transfer data from esams to the private collector (in eqiad). esams can't talk to it directly.
--
Mark Bergsma <mark@wikimedia.org>
Lead Operations Architect
Wikimedia Foundation
On Jul 20, 2012, at 10:38 PM, Asher Feldman wrote:
It looks like varnishlog is actually quite efficient at finding specific requests based on a field regex, and fetching one of the many log fields from matching requests. 'varnishlog -c -m RxURL:"^/event.gif" -i RxURL' utilized 5% of a core on a production bits server while it was serving ~6.2k reqs/sec, vs. far more for an unfiltered varnishlog process. So this seems feasible, provided that whatever process reads stdout from varnishlog (or directly accesses varnish shm) is similarly efficient, and had no risk of run away failure cases that might impact varnish performance.
This is invasive to bits, but seems reasonable in terms of asynchronously passing beacon messages from user requests (varnish returns an immediate 204 no matter what), and decoupling failures of the event reader or vanadium from users and varnish. Mark, what do you think?
Can't we use scribe for this, as is already the plan for kraken (as far as I understand it)? That would probably also solve the problem of esams contacting pmtpa/eqiad internal hosts...
--
Mark Bergsma <mark@wikimedia.org>
Lead Operations Architect
Wikimedia Foundation
That's my cue.
So I actually think this is a really elegant solution to the question of "how do you get Varnish (or whoever) to talk to scribe?" ZMQ is fucking fantastic -- super stable, super efficient, and with a lot of care in the little bits. For those not in the know: zmq is a wrapper around Unix domain sockets. It's like Super IPC. In the case where you're using it for plain IPC, it's merely a nice interface with almost zero overhead, but also providing some convenient features. One of those, importantly, is that writing to a dangling ZMQ socket doesn't vomit all over syslog with errors -- the bits just quietly end up in /dev/null. (You can configure it to yell, if you really want, iirc.)
In the short-term, I'm not precisely sure what Ori plans on using as the consumer, but it would be great to have our own toolbox of connectors to, say, File, UDP, Scribe, etc. Then we'd have one interface that we could plug anything into. (We could theoretically upgrade our other custom connectors in nginx, etc with something like that, and have one universal backend, but I digress.)
When Kraken comes online, we'd swap out that short-term backend with a Scribe connector. Easy and elegant.
+1
--
David Schoonover
dsc@wikimedia.org
I think where this stands is that Ori needs to finalize a transport method for moving data off of varnish servers, and it sounds like ZMQ is appropriate and compatible with future kraken plans.
That leaves a question of how to move ZMQ packets from esams to eqiad. ZMQ supports multicast udp (could possibly use existing multicast forwarding infrastructure?) and tcp as transports. Mark, do you have a preference / could you provide Ori some guidance?
Update: I packaged this and put it up on a ppa on launchpad. Binaries are available for Ubuntu Precise, which is what I _think_ the Varnish machines are running. To install:
apt-add-repository ppa:ori-livneh/e3
apt-get update
apt-get install zpubsub
replete with a man page -- zpubsub(1)
--
Ori Livneh
ori@wikimedia.org
From vanadium (eqiad), I can connect to port 8649 on cp300[1-2].esams.wikimedia.org, which I presume is gmond. If we could open an additional port (bound to a zmq publisher socket that makes the filtered log stream available for vanadium to subscribe to), that would work.
I'm not sure multicast makes sense because the flow of communication is many-to-one, not one-to-many. The way I see it, vanadium could persist a connection to each varnish machine (4 on eqiad, 4 on pmtpa, 2 on esams = 10 total). The pub/sub pattern ensures that if vanadium crashes, the varnishes don't care, and just let the log data drop.
ZeroMQ pub/sub sockets support multicast over pgm or epgm, but I think that adds a layer of complexity (vs. unicast) that isn't needed or wanted for tracking events from A/B tests with fractional roll-outs.
If you're squeamish about this -- which I understand! -- just remember: all these calls are currently hitting api.php, which entails failed cache lookups on the Squids*, followed by work for the Mediawiki instances, which generate UDP packets, which end up on emery. This setup is capable of knocking out the site, as I found out in June.
* See:
$ curl -is --data "action=clicktracking" http://en.wikipedia.org/w/api.php | grep X-Cache
X-Cache: MISS from cp1004.eqiad.wmnet
X-Cache-Lookup: MISS from cp1004.eqiad.wmnet:3128
X-Cache: MISS from cp1017.eqiad.wmnet
X-Cache-Lookup: MISS from cp1017.eqiad.wmnet:80
--
Ori Livneh
ori@wikimedia.org
On Jul 25, 2012, at 7:15 AM, Ori Livneh wrote:
From vanadium (eqiad), I can connect to port 8649 on cp300[1-2].esams.wikimedia.org, which I presume is gmond. If we could open an additional port (bound to a zmq publisher socket that makes the filtered log stream available for vanadium to subscribe to), that would work.
Err, no you can't. vanadium is on the eqiad internal network, and has a private address. Since there's no NAT and no tunneling over the Internet, you can't reach esams currently. Sure you didn't test from another host? :)
I'm not sure multicast makes sense because the flow of communication is many-to-one, not one-to-many. The way I see it, vanadium could persist a connection to each varnish machine (4 on eqiad, 4 on pmtpa, 2 on esams = 10 total). The pub/sub pattern ensures that if vanadium crashes, the varnishes don't care, and just let the log data drop.
2 more in esams soon, BTW.
ZeroMQ pub/sub sockets support multicast over pgm or epgm, but I think that adds a layer of complexity (vs. unicast) that isn't needed or wanted for tracking events from A/B tests with fractional roll-outs.
If you're squeamish about this -- which I understand! -- just remember: all these calls are currently hitting api.php, which entails failed cache lookups on the Squids*, followed by work for the Mediawiki instances, which generate UDP packets, which end up on emery. This setup is capable of knocking out the site, as I found out in June.
On Tuesday, July 24, 2012 at 1:06 PM, Asher Feldman wrote:
I think where this stands is that Ori needs to finalize a transport method for moving data off of varnish servers, and it sounds like ZMQ is appropriate and compatible with future kraken plans.
That leaves a question of how to move ZMQ packets from esams to eqiad. ZMQ supports multicast udp (could possibly use existing multicast forwarding infrastructure?) and tcp as transports. Mark, do you have a preference / could you provide Ori some guidance?
We're actually working on connecting the internal subnets of pmtpa/eqiad and esams, via redundant tunnels. That would allow direct unicast and multicast connectivity with no proxying or other hacks. Some experiments have already been done a while back, but it won't be available and reliable until we finish a router migration, which is 1-2 months out. I think that would be the cleanest and nicest solution, but it's a question whether this can wait for that.
--
Mark Bergsma <mark@wikimedia.org>
Lead Operations Architect
Wikimedia Foundation
I want to pull Gabriel for a couple ticks tomorrow to see if we can get this unstuck a bit. I'm not sure I want to wait 1-2 months with E3 clicktracking stuff going to api.php and risking another outage. Let's see if we can find a solution that is feasible under the current infrastructure and switch to the router solution when that's available.
If someone reminds me tomorrow about this, I'll have Ori bring Gabriel up to speed on what this discussion is about… Imight forget because I had a bad case of the insomnias last night.
On Jul 25, 2012, at 4:04 AM, Mark Bergsma wrote:
On Jul 25, 2012, at 7:15 AM, Ori Livneh wrote:
From vanadium (eqiad), I can connect to port 8649 on cp300[1-2].esams.wikimedia.org, which I presume is gmond. If we could open an additional port (bound to a zmq publisher socket that makes the filtered log stream available for vanadium to subscribe to), that would work.
Err, no you can't. vanadium is on the eqiad internal network, and has a private address. Since there's no NAT and no tunneling over the Internet, you can't reach esams currently. Sure you didn't test from another host? :)
I'm not sure multicast makes sense because the flow of communication is many-to-one, not one-to-many. The way I see it, vanadium could persist a connection to each varnish machine (4 on eqiad, 4 on pmtpa, 2 on esams = 10 total). The pub/sub pattern ensures that if vanadium crashes, the varnishes don't care, and just let the log data drop.
2 more in esams soon, BTW.
I guess we need a standard for a count when a machine count is high enough for multicast udp to be better than pubsub. I don't think 12 (4/dc) is it though. ;-)
On Tue, Jul 24, 2012 at 10:15 PM, Ori Livneh <ori@wikimedia.org> wrote:
From vanadium (eqiad), I can connect to port 8649 on cp300[1-2].esams.wikimedia.org, which I presume is gmond. If we could open an additional port (bound to a zmq publisher socket that makes the filtered log stream available for vanadium to subscribe to), that would work.
The number of varnish servers will change, data centers get failed over, etc. I think you'd want the publishers to establish the connection with vanadium, not the other way around.
terry chay 최태리
Director of Features Engineering
Wikimedia Foundation
“Imagine a world in which every single human being can freely share in the sum of all knowledge. That's our commitment.”
p: +1 (415) 839-6885 x6832
m: +1 (408) 480-8902
aim: terrychay