Re: [Analytics] db1047 & one box to rule them all

30 Apr 2014

Not quite there yet - just pointing to it as a potentially blocker to the
"let's move everything to Hadoop!" idea (which I fully support). If the
goal is to enable research using unified data, but the unified data is more
difficult to access than the non-unified data, we probably haven't moved
the needle enough to justify it. "A sane way to access this stuff from
Python and R" should probably be considered a pretty firm prerequisite,
because without that, the utility isn't tremendously increased.

On 30 April 2014 09:42, Toby Negrin &lt;tnegrin(a)wikimedia.org&gt; wrote:

...
  I think we'll put everything on Hadoop at some
point but we're focusing on
 the page views now.

 Regarding the bug - if you're ready to use it I can see if Andrew can
 install the java package.

 -Toby

 On Apr 30, 2014, at 9:34 AM, Oliver Keyes &lt;okeyes(a)wikimedia.org&gt; wrote:

 On 30 April 2014 06:59, Dan Andreescu &lt;dandreescu(a)wikimedia.org&gt; wrote:

  This is awesome, thank you Sean

   *This is probably my bad, but I understood the
goal to be having a
  single db containing unified, core tablets. So,
we'd have one db, with one
 revision table, that'd have an extra column of "wiki" that denoted the
 project the entry referred to. This would let us perform global queries
 without the complex UNIONs mentioned above. Is this still the goal, or...?

 No, that wasn't the goal. Sorry if there was miscommunication. The
 actual data will remain in separate wikis using regular replication.

 However, it's quite possible to create one or more unified databases
 with (for example) SQL VIEWs that union all tables from a set of
 pre-defined wikis, with 'wiki' columns, just as you describe. Same thing,
 really. We could even allow ad-hoc creation of unified views for whatever
 .dblist is appropriate for the project. I don't think anything need be
 ruled out yet -- that's the whole point of SQL, right? Slow, but flexible.
 :-)

 that would work, Oliver is right that creating views for core tables in
 pre-defined wikis (say, all wikipedias) would be valuable. Sean, how about
 we create a page on wikitech with requirements for these views and we take
 it from there?

 Union-ified views sound great here.  Let's see how they perform.  I bet
 they'll be fine but if they're not, maybe we can throw them into Hadoop?
  Using the views to do the MySQL -> Hadoop replication would be so much
 easier than going to each database individually.

 Totally down for that, but... 
https://bugzilla.wikimedia.org/show_bug.cgi?id=64262

  _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] db1047 & one box to rule them all