On Tue, Feb 12, 2013 at 10:46 PM, Matthew Walker <mwalker(a)wikimedia.org>wrote;wrote:
Asher,
Fundraising has some stuff that's partially managed by Jeff -- ie; we
consume a UDP log stream, aggregate it, and load it into a database. Is
this something that's also under your scheme? Or are you currently only
looking at hadoop data flows?
Regardless of whether the ideas presented in the RFC are acted upon or not
(it's extremely extremely unlikely), the existing logging infrastructure
relied upon by fundraising will continue as-is for the near/medium term. I
do expect all of the legacy udplog + stream filter infrastructure to be
replaced by the distributed infrastructure, but only after a period of
coexistence and incremental migration of functionality.
Some additional questions:
1) Right now if I want an additional UDP log stream I ask Jeff and he does
magic. Is this flexibility going to change? If so, how?
Eventually, yes. The request for a new log stream in this context really
means "start saving / processing a portion of the log stream matching
pattern X" that would otherwise for the most part be dropped on the floor.
The actual udp log stream going across the network is comprised of every
request to all of our domains except for
bits.wikimedia.org. From that,
0.01% are logged to disk, plus anything a team specifically requests, be it
100% of requests containing "action=edit", 10% from ip's originating from a
specific region of the world, or the things fundraising requests from Jeff,
such as making sure requests related to banner impressions and clickthrus
are specifically captured.
The goal of the distributed infrustucture is to write everything to disk
unsampled. There isn't a parallel to asking for a new banner log stream as
that data will be saved and available for analysis by default. Instead
you'd make sure that queries only examine relevant requests.
2) Something that analytics doesn't offer right
now is regular scheduled
big jobs, but they seemed to be working towards it -- your point 3 seems to
preclude this; or was it specifically just data transfer jobs that you're
against?
Job scheduling would definitely be an offered service before the system
could be considered feature complete. Preferably via a system more suited
for distributed compute environments than cron. Data transfer jobs are
fine too, so long as a scheduled transfer is appropriate for the type of
data. It isn't for the request log stream, but I can imagine a regularly
scheduled job importing data from the recentchanges table of various wikis
for example.
3) With regards to my original question about
fundraising's current
workflow -- I was hoping that in the future I could actually expose the
aggregate banner/landing page counts via an external world so that at the
very least other fundraising chapters could use the data. Is that something
that operations would be able to support? (obviously not the software side
of it, but the slave DB load)
I think this is out of scope, and it's unclear what the most efficient
infrastructure might look like in order to offer this in the future. Slave
db's might not be in the picture. The data would more likely be regular
generated via map reduce jobs and the output perhaps temporarily persisted
in a datastore powering a webapp. But regardless, I would expect this to
be supportable.
Thanks,
~Matt Walker
On Tue, Feb 12, 2013 at 4:22 PM, Asher Feldman <afeldman(a)wikimedia.org>wrote;wrote:
Howdy,
After having spent some time reviewing the analytics github repo and
playing observer to the quarterly review last December, and today's
security/architecture mixup, I have a few opinions and suggestions that I'd
like to share. They may upset some or step on toes. Sorry about that.
Main suggestion - all logging, etl, storage, and compute infrastructure
should be owned, implemented, and maintained by the operations team. There
should be a clear set of deliverables for ops: the entirety of the current
udp stream ingested, processed via an extensible etl layer with a minimum
of IP anonymization in place, and stored in hdfs in a standardized format
with logical access controls. Technology and implementation choices should
ultimately rest with ops so long as all deliverables are met, though
external advice and assistance (including from industry experts outside of
wmf) will be welcome and solicited.
The analytics team owns everything above this. Is pig the best tool to
analyze log data in hdfs? Does hive make sense for some things? Want to
add and analyze wiki revisions via map reduce jobs? Visualize everything
imaginable? Add more sophisticated transforms to the etl pipeline? Go,
go, analytics!
I see the work accomplished to date under the heading of kraken as
falling into three categories:
1) Data querying. This includes pig integration, repeatable queries run
via pig, and ad hoc map reduce jobs meant to analyze data written by folks
like Diederik. While modifications may be needed if there are changes to
how data is stored in hdfs (such as file name conventions or format) or to
access controls, this category of work isn't tied to infrastructure details
and should be reusable on any generic hadoop implementation containing wmf
log data.
2) Devops work. This includes everything Andrew Otto has done to
puppetize various pieces of the existing infrastructure. I'd consider all
of this experimental. Some might be reusable, some may need refactoring,
some should be chalked up as a learning exercise and abandoned. Even if
the majority was to fall under that last category, this has undoubtedly
been a valuable learning experience. Were Andrew to join the ops team and
collaborate with others on a from scratch implementation (let's say I'd
prefer us using the beta branch of actual apache hadoop instead of
cloudera), I'm sure the experience he's gained to date will be of use to
all.
3) Bound for mordor. Never happened, never to be spoken of again. This
includes things like the map reduce job executed via cron to transfer data
from kafka to hdfs, and... oh wait, never happened, never to be spoken of
again.
Unless I'm missing anything major, I don't see any reasons not to pursue
this new approach, nor does it appear that any significant amount of work
would be lost. Instead, the most useful bits (category 1) should still be
useful. And since that seems to be where analytics has been most
successful, perhaps it makes sense to let them focus fully on this sort of
thing instead of infrastructure.
-Asher
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics