Well, no, HDFS is a means to and end of storing data in a form that can be cleaned with ETL processes so that /then/ they can go to the somewhere/something - which is a lot of use cases but most prominently our dashboards and ad-hoc research tasks.
Let me be clear here that this isn't a theoretical exercise existing in a vacuum; we do not want /a/ answer that can be hooked up to the dashboards. That's easy. That's a hideous shell script that scaps nginx files over. We want a answer that can be hooked up to the dashboards for many, many, many things, because we're not just wanting metrics and analytics for WDQS, we're also wanting them for the production API and for user events and for the cirrus logs and for high-level KPIs and that's just the things we've wanted this month.
I can't be building out an entirely new pipeline every single time someone builds a thing. That's not an efficient use of our analysts time and it massively increases the chance that something will go wrong. I'm not asking for an alternative to HDFS, because I don't want to be doing that. I'm asking for HDFS because then we don't need to reinvent the wheel every time we build a thing. If we can't do HDFS and going to production isn't going to work, then let's talk about what the alternatives are. Until then the use case is "the data being in HDFS so that analysts can consume it" and higher-level use cases are overthinking.
On 17 June 2015 at 03:22, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
The problem, as we've gone back and forth about for a while on phabricator, is that labs has absolutely zero inbuilt infrastructure for analytics.
If things are in production they go through the frontend varnishes, which are hooked up to HDFS, and all is fine. We have the request logs. If things are in labs...nothing. There is no access to HDFS, there is no consistent varnish setup that pipes things there, and analytics engineering has pretty much no plans to set up that sort of infrastructure.
Right. What I am still missing is that HDFS, varnish, etc. are means to an end, end being delivering info (in this case, usage logs) somewhere, and then doing something. So I do not have right now clear picture of what is that somewhere/something, and what data it consumes in what form. Maybe if I would be more up to speed on this - or at least understood what inputs are required and which forms of these inputs are acceptable, I could have a better picture. -- Stas Malyshev smalyshev@wikimedia.org
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search