On 17 June 2015 at 16:33, Stas Malyshev <smalyshev(a)wikimedia.org> wrote:
Hi!
Well, no, HDFS is a means to and end of storing
data in a form that
can be cleaned with ETL processes so that /then/ they can go to the
somewhere/something - which is a lot of use cases but most prominently
our dashboards and ad-hoc research tasks.
Thanks for explaining more! I think I understand you concern better now.
With the renewed attention to WDQS productization, the point may be moot
soon, but in case it won't be, I just wanted to explore a possibility of
using the same infrastructure but with different inputs - or maybe
possibility of building a bridge between HDFS and whatever we have in
labs. I'm not saying this necessarily makes sense, but if it doesn't,
I'd like to know why.
Thanks! It does make sense; at the moment the blocker is somewhere
wooly around analytics and ops. So, on the same infrastructure,
Analytics Engineering have indicated (iirc) they're comfortable
standing up a HDFS instance but not so keen on maintaining it
indefinitely. This makes total sense with their priorities. On
building the bridge; at the moment the HDFS cluster is very
deliberately firewalled. We'd need to deal with that (perhaps, as
suggested, making highly specific and authenticated holes in the
firewall?) before it was possible, and that seems to be an
Analytics/Opsen thing.
reinvent the wheel every time we build a thing.
If we can't do HDFS
and going to production isn't going to work, then let's talk about
what the alternatives are. Until then the use case is "the data being
in HDFS so that analysts can consume it" and higher-level use cases
are overthinking.
OK. Then if we go to production soon (hopefully) I assume we have an
existing workflow allowing us to get stuff to HDFS. If not, we _may_
(again, if that doesn't make sense, fine, but would like to hear the
reasons) explore the possibility of some process that would allow us to
get data from whatever we have now (which can be rather flexible) into
HDFS.
We do! if the WDQS queries are going through Production's varnish
caches (an existing cluster) they go in automatically. If they go
through a new frontend cluster, the cost of switching them in is
fairly small.
--
Stas Malyshev
smalyshev(a)wikimedia.org
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
--
Oliver Keyes
Research Analyst
Wikimedia Foundation