On Tue, Feb 12, 2013 at 10:46 PM, Matthew Walker <mwalker@wikimedia.org> wrote:
Asher,

Fundraising has some stuff that's partially managed by Jeff -- ie; we consume a UDP log stream, aggregate it, and load it into a database. Is this something that's also under your scheme? Or are you currently only looking at hadoop data flows?

Regardless of whether the ideas presented in the RFC are acted upon or not (it's extremely extremely unlikely), the existing logging infrastructure relied upon by fundraising will continue as-is for the near/medium term.  I do expect all of the legacy udplog + stream filter infrastructure to be replaced by the distributed infrastructure, but only after a period of coexistence and incremental migration of functionality.
 
Some additional questions:
1) Right now if I want an additional UDP log stream I ask Jeff and he does magic. Is this flexibility going to change? If so, how?

Eventually, yes.  The request for a new log stream in this context really means "start saving / processing a portion of the log stream matching pattern X" that would otherwise for the most part be dropped on the floor.  The actual udp log stream going across the network is comprised of every request to all of our domains except for bits.wikimedia.org.  From that, 0.01% are logged to disk, plus anything a team specifically requests, be it 100% of requests containing "action=edit", 10% from ip's originating from a specific region of the world, or the things fundraising requests from Jeff, such as making sure requests related to banner impressions and clickthrus are specifically captured.

The goal of the distributed infrustucture is to write everything to disk unsampled.  There isn't a parallel to asking for a new banner log stream as that data will be saved and available for analysis by default.  Instead you'd make sure that queries only examine relevant requests.
 
2) Something that analytics doesn't offer right now is regular scheduled big jobs, but they seemed to be working towards it -- your point 3 seems to preclude this; or was it specifically just data transfer jobs that you're against?

Job scheduling would definitely be an offered service before the system could be considered feature complete. Preferably via a system more suited for distributed compute environments than cron.  Data transfer jobs are fine too, so long as a scheduled transfer is appropriate for the type of data.  It isn't for the request log stream, but I can imagine a regularly scheduled job importing data from the recentchanges table of various wikis for example. 
 
3) With regards to my original question about fundraising's current workflow -- I was hoping that in the future I could actually expose the aggregate banner/landing page counts via an external world so that at the very least other fundraising chapters could use the data. Is that something that operations would be able to support? (obviously not the software side of it, but the slave DB load)

I think this is out of scope, and it's unclear what the most efficient infrastructure might look like in order to offer this in the future.  Slave db's might not be in the picture.  The data would more likely be regular generated via map reduce jobs and the output perhaps temporarily persisted in a datastore powering a webapp.  But regardless, I would expect this to be supportable.
 

Thanks,

~Matt Walker


On Tue, Feb 12, 2013 at 4:22 PM, Asher Feldman <afeldman@wikimedia.org> wrote:
Howdy,

After having spent some time reviewing the analytics github repo and playing observer to the quarterly review last December, and today's security/architecture mixup, I have a few opinions and suggestions that I'd like to share.  They may upset some or step on toes.  Sorry about that.

Main suggestion - all logging, etl, storage, and compute infrastructure should be owned, implemented, and maintained by the operations team.  There should be a clear set of deliverables for ops: the entirety of the current udp stream ingested, processed via an extensible etl layer with a minimum of IP anonymization in place, and stored in hdfs in a standardized format with logical access controls.  Technology and implementation choices should ultimately rest with ops so long as all deliverables are met, though external advice and assistance (including from industry experts outside of wmf) will be welcome and solicited. 

The analytics team owns everything above this.  Is pig the best tool to analyze log data in hdfs?  Does hive make sense for some things?  Want to add and analyze wiki revisions via map reduce jobs?  Visualize everything imaginable?  Add more sophisticated transforms to the etl pipeline?  Go, go, analytics!

I see the work accomplished to date under the heading of kraken as falling into three categories:

1) Data querying.  This includes pig integration, repeatable queries run via pig, and ad hoc map reduce jobs meant to analyze data written by folks like Diederik.  While modifications may be needed if there are changes to how data is stored in hdfs (such as file name conventions or format) or to access controls, this category of work isn't tied to infrastructure details and should be reusable on any generic hadoop implementation containing wmf log data.

2) Devops work.  This includes everything Andrew Otto has done to puppetize various pieces of the existing infrastructure.  I'd consider all of this experimental.  Some might be reusable, some may need refactoring, some should be chalked up as a learning exercise and abandoned.  Even if the majority was to fall under that last category, this has undoubtedly been a valuable learning experience.  Were Andrew to join the ops team and collaborate with others on a from scratch implementation (let's say I'd prefer us using the beta branch of actual apache hadoop instead of cloudera), I'm sure the experience he's gained to date will be of use to all.

3) Bound for mordor.  Never happened, never to be spoken of again.  This includes things like the map reduce job executed via cron to transfer data from kafka to hdfs, and... oh wait, never happened, never to be spoken of again.

Unless I'm missing anything major, I don't see any reasons not to pursue this new approach, nor does it appear that any significant amount of work would be lost.  Instead, the most useful bits (category 1) should still be useful.  And since that seems to be where analytics has been most successful, perhaps it makes sense to let them focus fully on this sort of thing instead of infrastructure. 

-Asher


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics