The main purpose is to add a log of this thread to a mailing list for archiving. (below)
A secondary purpose is to outline some of what has been discussed. Here is what I've
gathered:
In the short term, packets to Vanadium will continue through the flow agreed on in the
request ticket:
(old <http://wikitech.wikimedia.org/view/Squid_logging>): client
-(clicktracking)-> api.php -(udp2log)-> emery -> pulled down from log files
(agreed): <client side> -(clicktracking)-> api.php -(udp2log)-> vanadium
(alternate): <client side> -(clicktracking)->api.php -(0mq?)-> vanadium
[NB: (alternate) would break existing users of clicktracking, of which there should be
none, but can be addressed in different thread].
Ori is moving for this change:
(intermediate): <client side> -(clicktracking/E3 extension)->
bits.wikimedia.org
-(0mq)-> vanadium
(permanent): <client side> -(anything)->
bits.wikimedia.org -(0mq/)-> Scribe
-> Kraken
[NB: both breaks existing users of clicktracking]
[NB: for permanent, c.f.
<http://www.mediawiki.org/wiki/Analytics/Kraken/Pixel_Service>]
Bikeshedding:
Because cross-dc redundant tunneling is not in place, vanadium is not reachable by
everything. This may take 1-2 months, or longer. intermediate is thus modified to replace
bits to a specific bits host on equiad. We can revisit moving the varnish rule up to cover
all bits at a later date (as far as I'm concerned, I'm happy with deferring this
until Kraken needs its pixelserver, but whatever).
Mark has also requested that this be properly packaged and puppetized. Ori will be using
labs at a test for this setup a la the way Patrick is currently handling a similar request
for Wikipedia Zero right now.
Asher has requested that the pub/sub model proposed by Ori be reversed. This seems
reasonable.
0mq allows for different queue configurations than pub-sub. There is some consideration
of using UDP multicast instead. This should probably be revisited when Kraken goes
online.
Current actions (so CT has a map):
At some point, nothing gets done without Mark since he wrote the puppet manifests on
varnish. However, it's reasonable to get as much done under Mark's approval before
he actually hands-on this config. Mark if you are not okay with any of the stuff, tell me.
No point in continuing if it isn't going to happen. ;-)
I guess technically we could punt on the whole thing for 1-2 months. However, since at
some point something like this needs to be tested on varnish on the cluster, we should
probably take the opportunity to get this running on a single varnish machine while we
have an engineer willing to do the lifting, packaging and puppetization. Analytics is way
too busy on other parts to worry about collection.
If we punt for longer than 1-2 months, then I guess ori can't be held accountable
when he takes down the cluster again with too many calls to api.php. :-D
If we get it on a single instance now, I'm inclined to let ops decide when/if they
want to move the config to cover all of bits (and just inform Ori so he can update any
extensions to point to the edges). The only hard deadline on rolling out to all esams
would be when Kraken goes online and needs the config to point the pub-sub to publish to
their scribe servers instead of vanadium.
Other than the above, I consider the whole thing settled. Last I checked, none of you
report to me, so I'm not involved at all. :-P
P.S. Today is SysAdmin Appreciation day. There are three bottles of whiskey (probably
already added to Ryan Lane's stash) and 2 dozen cookies/baked goods on CT's desk.
I bought them for ops to dispose of how they see fit.
http://www.someecards.com/workplace-cards/my-job-is-to-annoy
Take care,
terry
On Jul 17, 2012, at 3:56 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
Hey Mark, Asher,
For event-tracking, could we add a VCL hook to
bits.wikimedia.org that rewrites a
specific URL to hit vanadium.eqiad.wmnet:8000?
I have a simple HTTP server there that parses and stores query strings for all incoming
requests. I mean to use it as a way of capturing events from JavaScript code (for AB
testing features, for example.) It responds to all requests with HTTP status code 204
("No Content") and an empty body. But vanadium isn't public-facing, so I
need to expose a URL.
Something like this should work, assuming vanadium is reachable from bits:
http://p.defau.lt/?RhrkVPxrdhv0vPKvIwaRNQ
Very crude benchmarking (see
http://p.defau.lt/?VRssDYUMq1djVFHzlyN_Yw) clocks the server
at ~1,600 reqs/sec, which would add up to ~140 mil. / day. My plan is to be extremely
conservative and limit ourselves to 200k reqs / day and ramp up very gradually iff
it's stable enough. Although 200k sounds tiny, it can comfortably accomodate some
interesting metrics -- enwiki averages 140k edits / day, for example.
Adding the URL to Varnish would complete this request:
https://rt.wikimedia.org/Ticket/Display.html?id=3152
Let me know what you think.
Thanks,
Ori
--
Ori Livneh
ori(a)wikimedia.org
On Jul 17, 2012, at 5:12 PM, Asher Feldman <afeldman(a)wikimedia.org> wrote:
Great latency results for your collector! I don't
think it matters much at the traffic rate you're talking about, but I think we'd
want to consider a different approach or a public endpoint other than bits if use of this
will be seriously ramped up in the future. bits serves ~40k requests/sec via 4 servers in
the US and 2 in Europe, with spare enough capacity for a couple of those hosts to die.
The >99.6% cache hitrate is important to the small server footprint, which a shift in
the number of backend http requests varnish has to make could impact. Additionally, bits
servers in europe can't hit private servers in eqiad and use the public eqiad bits ip
as its backend. An eu request would take a couple hundred ms due to network latency,
hitting both varnish in the eu and us.
I'm adding Patrick because we've discussed sending udp packets for a mobile
analytics project directly from vanish via inline C. If progress is made there, perhaps
your server could be modified to receive udp messages instead of http requests? It would
be friendlier to EU users, since varnish could respond with a 204 immediately while
whatever happens to get the udp packet forwarded to eqiad happens behind the scenes.
On Jul 17, 2012, at 10:13 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
Much obliged for the thoughtful response!
UDP might not be the best option because data integrity is important. IIRC most
implementations will fragment datagrams greater than 1472 bytes and will silently drop
datagrams if a fragment is lost or delayed, which could easily skew our data if we're
not super careful. Order and reliability count, and UDP is hard to reason about.
varnishlog might be a better option if you're willing to allow vanadium to maintain a
persistent connection to the varnish caches (over SSH perhaps, with varnishlog instead of
a login shell). Alternately the varnish caches could pipe varnishlog into some lightweight
tool that sends things to vanadium. (Maybe this is the use-case for 0MQ that Terry has
been itching for.) If I write it, would you be able to help with deployment / testing? (I
think we could keep it pretty simple..)
--
Ori Livneh
ori(a)wikimedia.org
On Jul 17, 2012, at 10:13 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
Much obliged for the thoughtful response!
UDP might not be the best option because data integrity is important. IIRC most
implementations will fragment datagrams greater than 1472 bytes and will silently drop
datagrams if a fragment is lost or delayed, which could easily skew our data if we're
not super careful. Order and reliability count, and UDP is hard to reason about.
varnishlog might be a better option if you're willing to allow vanadium to maintain a
persistent connection to the varnish caches (over SSH perhaps, with varnishlog instead of
a login shell). Alternately the varnish caches could pipe varnishlog into some lightweight
tool that sends things to vanadium. (Maybe this is the use-case for 0MQ that Terry has
been itching for.) If I write it, would you be able to help with deployment / testing? (I
think we could keep it pretty simple..)
--
Ori Livneh
ori(a)wikimedia.org
On Jul 19, 2012, at 4:40 AM, Mark Bergsma <mark(a)wikimedia.org> wrote:
Hi Ori,
Besides Asher's response, which I fully agree with, let me add the following:
First of all, when we gave you that server vanadium, a few weeks ago, you argued for it
by saying that you wanted to reduce the coupling with / dependencies of / imact on
production as much as possible. But then you didn't mention any of this, and your
proposed change, using bits, does quite the opposite. Let's not do that.
Solutions around varnishlog and ssh/connections sound clunky. Sending udp packets from
Varnish would be fine I think, but you don't want that.
Why don't we see if we can integrate your requirements with the plans the analytics
team has with their Hadoop cluster? That would avoid duplication of effort as well.
On Jul 19, 2012, at 10:16 AM, Ori Livneh <ori(a)wikimedia.org> wrote:
Hi Mark,
Thanks for your note. The design (capturing event data from URLs) is the plan for Kraken,
and my work on the public-facing part of the stack is in collaboration with the analytics
team, whose efforts are currently invested in storage and computation. I'm looping in
David Schoonover, with whom I've been working to coordinate efforts. Once data is
piping into vanadium, I'm going to drop server-side work entirely and focus on growing
a client-side event tracking library, and that's going to integrate directly with
Kraken.
To state the obvious: any analytics solution is going to need a channel for incoming data
if we hope to do anything more interesting than searching for patterns in /dev/random. The
needs to be some endpoint that client-side JavaScript code can hit or we'll have no
way of tracking client-side state, which is increasingly AJAX driven and therefore not
easily gleaned by looking at bare request logs.
Serializing state into URL params (as opposed to tracking data by issuing POST requests
with JSON body, for example) is how we get a system designed to crunch page views (Kraken)
to fulfill UX/UI testing requirements. So there is no duplicated effort here. A
client-side library that transparently captures and transmits state in AJAX request URLs
is going to help Kraken along.
I don't think the change list on Gerrit is an inelegant solution. The coupling
problem with the click tracking extension was that it was using MediaWiki to parse event
data from incoming requests and to generate successful responses, which didn't scale.
My proposed solution has Varnish doing nothing more than responding to /beacon.gif with an
empty response. I can't think of a way of implementing a tracking endpoint that would
scale better or that would be more lightweight.
Transferring tracking data over a persistent SSH connection sucks, I agree, and I
didn't go that route. I chose to do something very close to UDP, which is to pipe
tracking request URLs from varnishlog into an unbuffered ZeroMQ publisher socket. The
implementation does not require anything to be listening on the other end -- if the client
on Vanadium dies, data is dropped on the floor, and the connection would be reestablished
transparently once it is back up. I don't think this is going to perform worse than
UDP, but I am not particular about this point -- UDP would be fine as well.
Asher was going to test what impact on load running varnishlog with a URL pattern will
have. If it's minimal, would this be OK?
Thanks,
--
Ori Livneh
ori(a)wikimedia.org
On Jul 19, 2012, at 10:54 AM, Asher Feldman <afeldman(a)wikimedia.org> wrote:
It's good to know that the work here is in fact to
create the public injection point for Kraken and not a duplicate effort. That likely means
the total request rate will be much greater than what's driven by editor engagement
tests, possibly up to a request per pageview.
I will test varnishlog on a bits server with a regex to capture /beacon requests to get a
feel for the resulting resource utilization. It still requires inspection of every bits
request from shared memory (significantly more data per request than what goes into an
access log) to pick out a few, so it may not be the most efficient solution.
If varnish can send udp packets for specific requests, there's also the option of
having it send one for /beacon requests to something listening on localhost, which could
itself use 0mq or another reliable transport to pass messages on to kraken. That would
probably address most concerns over udp, while also eliminating out of band processing of
every bits request in order to find beacons.
Yet another option would be to build a new
beacon.wikimedia.org endpoint. You could have
much greater flexibility over implementation choices if not piggybacking on bits, but with
an operational and capital cost that would also delay release.
On Jul 20, 2012, at 1:38 PM, Asher Feldman <afeldman(a)wikimedia.org> wrote:
It looks like varnishlog is actually quite efficient
at finding specific requests based on a field regex, and fetching one of the many log
fields from matching requests. 'varnishlog -c -m RxURL:"^/event.gif" -i
RxURL' utilized 5% of a core on a production bits server while it was serving ~6.2k
reqs/sec, vs. far more for an unfiltered varnishlog process. So this seems feasible,
provided that whatever process reads stdout from varnishlog (or directly accesses varnish
shm) is similarly efficient, and had no risk of run away failure cases that might impact
varnish performance.
This is invasive to bits, but seems reasonable in terms of asynchronously passing beacon
messages from user requests (varnish returns an immediate 204 no matter what), and
decoupling failures of the event reader or vanadium from users and varnish. Mark, what do
you think?
On Jul 20, 2012, at 2:07 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
Thanks a bunch for testing this, Asher!
--
Ori Livneh
ori(a)wikimedia.org
On Jul 23, 2012, at 4:01 AM, Mark Bergsma <mark(a)wikimedia.org> wrote:
On Jul 20, 2012, at 10:38 PM, Asher Feldman wrote:
It looks like varnishlog is actually quite
efficient at finding specific requests based on a field regex, and fetching one of the
many log fields from matching requests. 'varnishlog -c -m
RxURL:"^/event.gif" -i RxURL' utilized 5% of a core on a production bits
server while it was serving ~6.2k reqs/sec, vs. far more for an unfiltered varnishlog
process. So this seems feasible, provided that whatever process reads stdout from
varnishlog (or directly accesses varnish shm) is similarly efficient, and had no risk of
run away failure cases that might impact varnish performance.
This is invasive to bits, but seems reasonable in terms of asynchronously passing beacon
messages from user requests (varnish returns an immediate 204 no matter what), and
decoupling failures of the event reader or vanadium from users and varnish. Mark, what do
you think?
Yeah, this seems reasonable, but:
a) needs to be setup in a clean way (puppet configuration management, packaging of
software used), and
b) we need a way to transfer data from esams to the private collector (in eqiad). esams
can't talk to it directly.
--
Mark Bergsma <mark(a)wikimedia.org>
Lead Operations Architect
Wikimedia Foundation
On Jul 23, 2012, at 9:13 AM, Mark Bergsma <mark(a)wikimedia.org> wrote:
On Jul 20, 2012, at 10:38 PM, Asher Feldman wrote:
It looks like varnishlog is actually quite
efficient at finding specific requests based on a field regex, and fetching one of the
many log fields from matching requests. 'varnishlog -c -m
RxURL:"^/event.gif" -i RxURL' utilized 5% of a core on a production bits
server while it was serving ~6.2k reqs/sec, vs. far more for an unfiltered varnishlog
process. So this seems feasible, provided that whatever process reads stdout from
varnishlog (or directly accesses varnish shm) is similarly efficient, and had no risk of
run away failure cases that might impact varnish performance.
This is invasive to bits, but seems reasonable in terms of asynchronously passing beacon
messages from user requests (varnish returns an immediate 204 no matter what), and
decoupling failures of the event reader or vanadium from users and varnish. Mark, what do
you think?
Can't we use scribe for this, as is already the plan for kraken (as far as I
understand it)? That would probably also solve the problem of esams contacting pmtpa/eqiad
internal hosts...
--
Mark Bergsma <mark(a)wikimedia.org>
Lead Operations Architect
Wikimedia Foundation
On Jul 23, 2012, at 9:58 AM, David Schoonover <dschoonover(a)wikimedia.org> wrote:
That's my cue.
So I actually think this is a really elegant solution to the question of "how do you
get Varnish (or whoever) to talk to scribe?" ZMQ is fucking fantastic -- super
stable, super efficient, and with a lot of care in the little bits. For those not in the
know: zmq is a wrapper around Unix domain sockets. It's like Super IPC. In the case
where you're using it for plain IPC, it's merely a nice interface with almost zero
overhead, but also providing some convenient features. One of those, importantly, is that
writing to a dangling ZMQ socket doesn't vomit all over syslog with errors -- the bits
just quietly end up in /dev/null. (You can configure it to yell, if you really want,
iirc.)
In the short-term, I'm not precisely sure what Ori plans on using as the consumer,
but it would be great to have our own toolbox of connectors to, say, File, UDP, Scribe,
etc. Then we'd have one interface that we could plug anything into. (We could
theoretically upgrade our other custom connectors in nginx, etc with something like that,
and have one universal backend, but I digress.)
When Kraken comes online, we'd swap out that short-term backend with a Scribe
connector. Easy and elegant.
+1
--
David Schoonover
dsc(a)wikimedia.org
On Jul 24, 2012, at 1:06 PM, Asher Feldman <afeldman(a)wikimedia.org> wrote:
I think where this stands is that Ori needs to
finalize a transport method for moving data off of varnish servers, and it sounds like ZMQ
is appropriate and compatible with future kraken plans.
That leaves a question of how to move ZMQ packets from esams to eqiad. ZMQ supports
multicast udp (could possibly use existing multicast forwarding infrastructure?) and tcp
as transports. Mark, do you have a preference / could you provide Ori some guidance?
On Jul 24, 2012, at 4:40 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
Update: I packaged this and put it up on a ppa on
launchpad. Binaries are available for Ubuntu Precise, which is what I _think_ the Varnish
machines are running. To install:
apt-add-repository ppa:ori-livneh/e3
apt-get update
apt-get install zpubsub
replete with a man page -- zpubsub(1)
--
Ori Livneh
ori(a)wikimedia.org
On Jul 24, 2012, at 10:15 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
From vanadium (eqiad), I can connect to port 8649 on
cp300[1-2].esams.wikimedia.org, which I presume is gmond. If we could open an additional
port (bound to a zmq publisher socket that makes the filtered log stream available for
vanadium to subscribe to), that would work.
I'm not sure multicast makes sense because the flow of communication is many-to-one,
not one-to-many. The way I see it, vanadium could persist a connection to each varnish
machine (4 on eqiad, 4 on pmtpa, 2 on esams = 10 total). The pub/sub pattern ensures that
if vanadium crashes, the varnishes don't care, and just let the log data drop.
ZeroMQ pub/sub sockets support multicast over pgm or epgm, but I think that adds a layer
of complexity (vs. unicast) that isn't needed or wanted for tracking events from A/B
tests with fractional roll-outs.
If you're squeamish about this -- which I understand! -- just remember: all these
calls are currently hitting api.php, which entails failed cache lookups on the Squids*,
followed by work for the Mediawiki instances, which generate UDP packets, which end up on
emery. This setup is capable of knocking out the site, as I found out in June.
* See:
$ curl -is --data "action=clicktracking"
http://en.wikipedia.org/w/api.php |
grep X-Cache
X-Cache: MISS from cp1004.eqiad.wmnet
X-Cache-Lookup: MISS from cp1004.eqiad.wmnet:3128
X-Cache: MISS from cp1017.eqiad.wmnet
X-Cache-Lookup: MISS from cp1017.eqiad.wmnet:80
--
Ori Livneh
ori(a)wikimedia.org
On Jul 25, 2012, at 4:04 AM, Mark Bergsma <mark(a)wikimedia.org> wrote:
On Jul 25, 2012, at 7:15 AM, Ori Livneh wrote:
From vanadium (eqiad), I can connect to port 8649
on
cp300[1-2].esams.wikimedia.org, which I presume is gmond. If we could open an
additional port (bound to a zmq publisher socket that makes the filtered log stream
available for vanadium to subscribe to), that would work.
Err, no you can't. vanadium is on the eqiad internal network, and has a private
address. Since there's no NAT and no tunneling over the Internet, you can't reach
esams currently. Sure you didn't test from another host? :)
I'm not sure multicast makes sense because
the flow of communication is many-to-one, not one-to-many. The way I see it, vanadium
could persist a connection to each varnish machine (4 on eqiad, 4 on pmtpa, 2 on esams =
10 total). The pub/sub pattern ensures that if vanadium crashes, the varnishes don't
care, and just let the log data drop.
2 more in esams soon, BTW.
ZeroMQ pub/sub sockets support multicast over pgm
or epgm, but I think that adds a layer of complexity (vs. unicast) that isn't needed
or wanted for tracking events from A/B tests with fractional roll-outs.
If you're squeamish about this -- which I understand! -- just remember: all these
calls are currently hitting api.php, which entails failed cache lookups on the Squids*,
followed by work for the Mediawiki instances, which generate UDP packets, which end up on
emery. This setup is capable of knocking out the site, as I found out in June.
On Tuesday, July 24, 2012 at 1:06 PM, Asher
Feldman wrote:
> I think where this stands is that Ori needs to finalize a transport method for moving
data off of varnish servers, and it sounds like ZMQ is appropriate and compatible with
future kraken plans.
>
> That leaves a question of how to move ZMQ packets from esams to eqiad. ZMQ supports
multicast udp (could possibly use existing multicast forwarding infrastructure?) and tcp
as transports. Mark, do you have a preference / could you provide Ori some guidance?
We're actually working on connecting the internal subnets of pmtpa/eqiad and esams,
via redundant tunnels. That would allow direct unicast and multicast connectivity with no
proxying or other hacks. Some experiments have already been done a while back, but it
won't be available and reliable until we finish a router migration, which is 1-2
months out. I think that would be the cleanest and nicest solution, but it's a
question whether this can wait for that.
--
Mark Bergsma <mark(a)wikimedia.org>
Lead Operations Architect
Wikimedia Foundation
On Jul 25, 2012, at 5:03 AM, Terry Chay <tchay(a)wikimedia.org> wrote:
I want to pull Gabriel for a couple ticks tomorrow to
see if we can get this unstuck a bit. I'm not sure I want to wait 1-2 months with E3
clicktracking stuff going to api.php and risking another outage. Let's see if we can
find a solution that is feasible under the current infrastructure and switch to the router
solution when that's available.
If someone reminds me tomorrow about this, I'll have Ori bring Gabriel up to speed
on what this discussion is about… Imight forget because I had a bad case of the insomnias
last night.
On Jul 25, 2012, at 4:04 AM, Mark Bergsma wrote:
On Jul 25, 2012, at 7:15 AM, Ori Livneh wrote:
From vanadium (eqiad), I can connect to port 8649
on
cp300[1-2].esams.wikimedia.org, which I presume is gmond. If we could open an
additional port (bound to a zmq publisher socket that makes the filtered log stream
available for vanadium to subscribe to), that would work.
Err, no you can't. vanadium is on the eqiad internal network, and has a private
address. Since there's no NAT and no tunneling over the Internet, you can't reach
esams currently. Sure you didn't test from another host? :)
I'm not sure multicast makes sense because
the flow of communication is many-to-one, not one-to-many. The way I see it, vanadium
could persist a connection to each varnish machine (4 on eqiad, 4 on pmtpa, 2 on esams =
10 total). The pub/sub pattern ensures that if vanadium crashes, the varnishes don't
care, and just let the log data drop.
2 more in esams soon, BTW.
I guess we need a standard for a count when a machine count is high enough for multicast
udp to be better than pubsub. I don't think 12 (4/dc) is it though. ;-)
On Jul 25, 2012, at 3:36 PM, Asher Feldman <afeldman(a)wikimedia.org> wrote:
On Tue, Jul 24, 2012 at 10:15 PM, Ori Livneh
<ori(a)wikimedia.org> wrote:
From vanadium (eqiad), I can connect to port 8649 on
cp300[1-2].esams.wikimedia.org,
which I presume is gmond. If we could open an additional port (bound to a zmq publisher
socket that makes the filtered log stream available for vanadium to subscribe to), that
would work.
The number of varnish servers will change, data centers get failed over, etc. I think
you'd want the publishers to establish the connection with vanadium, not the other way
around.
terry chay 최태리
Director of Features Engineering
Wikimedia Foundation
“Imagine a world in which every single human being can freely share in the sum of all
knowledge. That's our commitment.”
p: +1 (415) 839-6885 x6832
m: +1 (408) 480-8902
e: tchay(a)wikimedia.org
i:
http://terrychay.com/
w:
http://meta.wikimedia.org/wiki/User:Tychay
aim: terrychay