Howdy,
After having spent some time reviewing the analytics github repo and playing observer to the quarterly review last December, and today's security/architecture mixup, I have a few opinions and suggestions that I'd like to share. They may upset some or step on toes. Sorry about that.
Main suggestion - all logging, etl, storage, and compute infrastructure should be owned, implemented, and maintained by the operations team. There should be a clear set of deliverables for ops: the entirety of the current udp stream ingested, processed via an extensible etl layer with a minimum of IP anonymization in place, and stored in hdfs in a standardized format with logical access controls. Technology and implementation choices should ultimately rest with ops so long as all deliverables are met, though external advice and assistance (including from industry experts outside of wmf) will be welcome and solicited.
The analytics team owns everything above this. Is pig the best tool to analyze log data in hdfs? Does hive make sense for some things? Want to add and analyze wiki revisions via map reduce jobs? Visualize everything imaginable? Add more sophisticated transforms to the etl pipeline? Go, go, analytics!
I see the work accomplished to date under the heading of kraken as falling into three categories:
1) Data querying. This includes pig integration, repeatable queries run via pig, and ad hoc map reduce jobs meant to analyze data written by folks like Diederik. While modifications may be needed if there are changes to how data is stored in hdfs (such as file name conventions or format) or to access controls, this category of work isn't tied to infrastructure details and should be reusable on any generic hadoop implementation containing wmf log data.
2) Devops work. This includes everything Andrew Otto has done to puppetize various pieces of the existing infrastructure. I'd consider all of this experimental. Some might be reusable, some may need refactoring, some should be chalked up as a learning exercise and abandoned. Even if the majority was to fall under that last category, this has undoubtedly been a valuable learning experience. Were Andrew to join the ops team and collaborate with others on a from scratch implementation (let's say I'd prefer us using the beta branch of actual apache hadoop instead of cloudera), I'm sure the experience he's gained to date will be of use to all.
3) Bound for mordor. Never happened, never to be spoken of again. This includes things like the map reduce job executed via cron to transfer data from kafka to hdfs, and... oh wait, never happened, never to be spoken of again.
Unless I'm missing anything major, I don't see any reasons not to pursue this new approach, nor does it appear that any significant amount of work would be lost. Instead, the most useful bits (category 1) should still be useful. And since that seems to be where analytics has been most successful, perhaps it makes sense to let them focus fully on this sort of thing instead of infrastructure.
-Asher
Asher,
Fundraising has some stuff that's partially managed by Jeff -- ie; we consume a UDP log stream, aggregate it, and load it into a database. Is this something that's also under your scheme? Or are you currently only looking at hadoop data flows?
Some additional questions: 1) Right now if I want an additional UDP log stream I ask Jeff and he does magic. Is this flexibility going to change? If so, how? 2) Something that analytics doesn't offer right now is regular scheduled big jobs, but they seemed to be working towards it -- your point 3 seems to preclude this; or was it specifically just data transfer jobs that you're against? 3) With regards to my original question about fundraising's current workflow -- I was hoping that in the future I could actually expose the aggregate banner/landing page counts via an external world so that at the very least other fundraising chapters could use the data. Is that something that operations would be able to support? (obviously not the software side of it, but the slave DB load)
Thanks,
~Matt Walker
On Tue, Feb 12, 2013 at 4:22 PM, Asher Feldman afeldman@wikimedia.orgwrote:
Howdy,
After having spent some time reviewing the analytics github repo and playing observer to the quarterly review last December, and today's security/architecture mixup, I have a few opinions and suggestions that I'd like to share. They may upset some or step on toes. Sorry about that.
Main suggestion - all logging, etl, storage, and compute infrastructure should be owned, implemented, and maintained by the operations team. There should be a clear set of deliverables for ops: the entirety of the current udp stream ingested, processed via an extensible etl layer with a minimum of IP anonymization in place, and stored in hdfs in a standardized format with logical access controls. Technology and implementation choices should ultimately rest with ops so long as all deliverables are met, though external advice and assistance (including from industry experts outside of wmf) will be welcome and solicited.
The analytics team owns everything above this. Is pig the best tool to analyze log data in hdfs? Does hive make sense for some things? Want to add and analyze wiki revisions via map reduce jobs? Visualize everything imaginable? Add more sophisticated transforms to the etl pipeline? Go, go, analytics!
I see the work accomplished to date under the heading of kraken as falling into three categories:
- Data querying. This includes pig integration, repeatable queries run
via pig, and ad hoc map reduce jobs meant to analyze data written by folks like Diederik. While modifications may be needed if there are changes to how data is stored in hdfs (such as file name conventions or format) or to access controls, this category of work isn't tied to infrastructure details and should be reusable on any generic hadoop implementation containing wmf log data.
- Devops work. This includes everything Andrew Otto has done to
puppetize various pieces of the existing infrastructure. I'd consider all of this experimental. Some might be reusable, some may need refactoring, some should be chalked up as a learning exercise and abandoned. Even if the majority was to fall under that last category, this has undoubtedly been a valuable learning experience. Were Andrew to join the ops team and collaborate with others on a from scratch implementation (let's say I'd prefer us using the beta branch of actual apache hadoop instead of cloudera), I'm sure the experience he's gained to date will be of use to all.
- Bound for mordor. Never happened, never to be spoken of again. This
includes things like the map reduce job executed via cron to transfer data from kafka to hdfs, and... oh wait, never happened, never to be spoken of again.
Unless I'm missing anything major, I don't see any reasons not to pursue this new approach, nor does it appear that any significant amount of work would be lost. Instead, the most useful bits (category 1) should still be useful. And since that seems to be where analytics has been most successful, perhaps it makes sense to let them focus fully on this sort of thing instead of infrastructure.
-Asher
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Tue, Feb 12, 2013 at 10:46 PM, Matthew Walker mwalker@wikimedia.orgwrote:
Asher,
Fundraising has some stuff that's partially managed by Jeff -- ie; we consume a UDP log stream, aggregate it, and load it into a database. Is this something that's also under your scheme? Or are you currently only looking at hadoop data flows?
Regardless of whether the ideas presented in the RFC are acted upon or not (it's extremely extremely unlikely), the existing logging infrastructure relied upon by fundraising will continue as-is for the near/medium term. I do expect all of the legacy udplog + stream filter infrastructure to be replaced by the distributed infrastructure, but only after a period of coexistence and incremental migration of functionality.
Some additional questions:
- Right now if I want an additional UDP log stream I ask Jeff and he does
magic. Is this flexibility going to change? If so, how?
Eventually, yes. The request for a new log stream in this context really means "start saving / processing a portion of the log stream matching pattern X" that would otherwise for the most part be dropped on the floor. The actual udp log stream going across the network is comprised of every request to all of our domains except for bits.wikimedia.org. From that, 0.01% are logged to disk, plus anything a team specifically requests, be it 100% of requests containing "action=edit", 10% from ip's originating from a specific region of the world, or the things fundraising requests from Jeff, such as making sure requests related to banner impressions and clickthrus are specifically captured.
The goal of the distributed infrustucture is to write everything to disk unsampled. There isn't a parallel to asking for a new banner log stream as that data will be saved and available for analysis by default. Instead you'd make sure that queries only examine relevant requests.
- Something that analytics doesn't offer right now is regular scheduled
big jobs, but they seemed to be working towards it -- your point 3 seems to preclude this; or was it specifically just data transfer jobs that you're against?
Job scheduling would definitely be an offered service before the system could be considered feature complete. Preferably via a system more suited for distributed compute environments than cron. Data transfer jobs are fine too, so long as a scheduled transfer is appropriate for the type of data. It isn't for the request log stream, but I can imagine a regularly scheduled job importing data from the recentchanges table of various wikis for example.
- With regards to my original question about fundraising's current
workflow -- I was hoping that in the future I could actually expose the aggregate banner/landing page counts via an external world so that at the very least other fundraising chapters could use the data. Is that something that operations would be able to support? (obviously not the software side of it, but the slave DB load)
I think this is out of scope, and it's unclear what the most efficient infrastructure might look like in order to offer this in the future. Slave db's might not be in the picture. The data would more likely be regular generated via map reduce jobs and the output perhaps temporarily persisted in a datastore powering a webapp. But regardless, I would expect this to be supportable.
Thanks,
~Matt Walker
On Tue, Feb 12, 2013 at 4:22 PM, Asher Feldman afeldman@wikimedia.orgwrote:
Howdy,
After having spent some time reviewing the analytics github repo and playing observer to the quarterly review last December, and today's security/architecture mixup, I have a few opinions and suggestions that I'd like to share. They may upset some or step on toes. Sorry about that.
Main suggestion - all logging, etl, storage, and compute infrastructure should be owned, implemented, and maintained by the operations team. There should be a clear set of deliverables for ops: the entirety of the current udp stream ingested, processed via an extensible etl layer with a minimum of IP anonymization in place, and stored in hdfs in a standardized format with logical access controls. Technology and implementation choices should ultimately rest with ops so long as all deliverables are met, though external advice and assistance (including from industry experts outside of wmf) will be welcome and solicited.
The analytics team owns everything above this. Is pig the best tool to analyze log data in hdfs? Does hive make sense for some things? Want to add and analyze wiki revisions via map reduce jobs? Visualize everything imaginable? Add more sophisticated transforms to the etl pipeline? Go, go, analytics!
I see the work accomplished to date under the heading of kraken as falling into three categories:
- Data querying. This includes pig integration, repeatable queries run
via pig, and ad hoc map reduce jobs meant to analyze data written by folks like Diederik. While modifications may be needed if there are changes to how data is stored in hdfs (such as file name conventions or format) or to access controls, this category of work isn't tied to infrastructure details and should be reusable on any generic hadoop implementation containing wmf log data.
- Devops work. This includes everything Andrew Otto has done to
puppetize various pieces of the existing infrastructure. I'd consider all of this experimental. Some might be reusable, some may need refactoring, some should be chalked up as a learning exercise and abandoned. Even if the majority was to fall under that last category, this has undoubtedly been a valuable learning experience. Were Andrew to join the ops team and collaborate with others on a from scratch implementation (let's say I'd prefer us using the beta branch of actual apache hadoop instead of cloudera), I'm sure the experience he's gained to date will be of use to all.
- Bound for mordor. Never happened, never to be spoken of again. This
includes things like the map reduce job executed via cron to transfer data from kafka to hdfs, and... oh wait, never happened, never to be spoken of again.
Unless I'm missing anything major, I don't see any reasons not to pursue this new approach, nor does it appear that any significant amount of work would be lost. Instead, the most useful bits (category 1) should still be useful. And since that seems to be where analytics has been most successful, perhaps it makes sense to let them focus fully on this sort of thing instead of infrastructure.
-Asher
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Initial thoughts -
At a high level, we need to make sure that our analytics engineering work is organized in a manner that is
a) architecturally sane; b) iterative; c) customer-driven.
The relatively high level of detachment of the analytics cluster work from ops (partially a function of how the team is situated in the org, partially a result of a high desire for expediency that's led to some bad habits) is clearly not sustainable and won't satisfy a) or b).
So there's no question that the structure of the work needs to change going forward. I doubt anyone in analytics would disagree with that. :-)
Moreover, we've already identified a clear near term deliverable: a minimally viable Hadoop cluster setup that passes architectural muster, with availability of the data currently in HDFS so that existing research (mobile PV + Wikipedia Zero analysis) can continue and be expanded upon.
My opening view is that we should strive to empower a cross-functional (i.e. ops+analytics) team to provide this deliverable as it sees fit, and ensure that said team is staffed so that its decisions will generally be trusted and accepted by both ops & analytics. This, IMO, is the best way to satisfy all three criteria above. If that approach sounds viable, then we'll just need to decide who's on that team, and trust them to do the work -- they would decide, ideally in consensus, how much of the work done so far to re-use vs. reboot.
I want to emphasize that the Hadoop cluster and related infrastructure is just a small part of the analytics engineering work we need to do. So we shouldn't agonize too much about who's part of that effort as long as we think it's a viable team that'll get the job done with minimal drama and at a level of quality that'll be satisfactory.
Oddly enough, I think all of the above is consistent with the outcomes of the meeting today. :-)
As for the long term ownership/support of the system, again, I think we just need to break down the responsibilities a bit more. Clearly we'll want ops to be in a position to maintain analytics services like any others. There may be larger changes that again require more cross-functional work. And there may be parts of the cluster administration and policy that can be handled entirely by analytics.
Erik