On 17 June 2015 at 16:33, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Well, no, HDFS is a means to and end of storing data in a form that can be cleaned with ETL processes so that /then/ they can go to the somewhere/something - which is a lot of use cases but most prominently our dashboards and ad-hoc research tasks.
Thanks for explaining more! I think I understand you concern better now. With the renewed attention to WDQS productization, the point may be moot soon, but in case it won't be, I just wanted to explore a possibility of using the same infrastructure but with different inputs - or maybe possibility of building a bridge between HDFS and whatever we have in labs. I'm not saying this necessarily makes sense, but if it doesn't, I'd like to know why.
Thanks! It does make sense; at the moment the blocker is somewhere wooly around analytics and ops. So, on the same infrastructure, Analytics Engineering have indicated (iirc) they're comfortable standing up a HDFS instance but not so keen on maintaining it indefinitely. This makes total sense with their priorities. On building the bridge; at the moment the HDFS cluster is very deliberately firewalled. We'd need to deal with that (perhaps, as suggested, making highly specific and authenticated holes in the firewall?) before it was possible, and that seems to be an Analytics/Opsen thing.
reinvent the wheel every time we build a thing. If we can't do HDFS and going to production isn't going to work, then let's talk about what the alternatives are. Until then the use case is "the data being in HDFS so that analysts can consume it" and higher-level use cases are overthinking.
OK. Then if we go to production soon (hopefully) I assume we have an existing workflow allowing us to get stuff to HDFS. If not, we _may_ (again, if that doesn't make sense, fine, but would like to hear the reasons) explore the possibility of some process that would allow us to get data from whatever we have now (which can be rather flexible) into HDFS.
We do! if the WDQS queries are going through Production's varnish caches (an existing cluster) they go in automatically. If they go through a new frontend cluster, the cost of switching them in is fairly small.
-- Stas Malyshev smalyshev@wikimedia.org
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search