On Apr 30, 2014, at 8:40 AM, Sean Pringle <springle(a)wikimedia.org> wrote:
On Wed, Apr 30, 2014 at 12:44 PM, Oliver Keyes
<okeyes(a)wikimedia.org> wrote:
Okay, so, have tested (to a limited degree. The work I'm doing that involves the dbs
involves eventlogging, so this is mostly me making up excuses to run queries). Thoughts:
*We should probably put in some kind of restrictions around what we care about. For
example, I see the tables relating to the WIkimania and Arbcom wikis in there. This is not
data I think we're ever going to care about, but it is data, which means we'll
either have to write really complex UNIONs to gather global data, with a
constantly-maintained list of dbs-we-don't-care-about, or accept inaccuracies in our
data. My suggestion would be for these dbs to be removed and excluded from replication,
using the noc dblists to identify the ones we don't care about; generally
"deleted","closed","special","wikimedia" wikis
aren't things we want to be running queries over.
If there are wikis you guys know for sure nobody using ‘research' user will ever
want, then they can simply be hidden by modifying the account grants.
Oliver, I am not sure how we define “data we’re [n]ever going to care about”. I do expect
we will receive occasional requests for data related to closed or special wikis (see
https://office.wikimedia.org/wiki/File:Officewiki_ae.png just to mention a recent
example).
The point about global queries is well taken, but I think it should be handled differently
(see below). Since we’re not talking about privacy here (uncensored data can be obtained
by anyone with access to the production DBs), but usability, I’d avoid making assumptions
about which wikis should *always* be excluded. We should have an equivalent of the API’s
sitematrix with project metadata to allow flexible filtering.
*This is probably my bad, but I understood the goal to
be having a single db containing unified, core tablets. So, we'd have one db, with one
revision table, that'd have an extra column of "wiki" that denoted the
project the entry referred to. This would let us perform global queries without the
complex UNIONs mentioned above. Is this still the goal, or...?
No, that wasn't the goal. Sorry if there was miscommunication. The actual data will
remain in separate wikis using regular replication.
However, it's quite possible to create one or more unified databases with (for
example) SQL VIEWs that union all tables from a set of pre-defined wikis, with
'wiki' columns, just as you describe. Same thing, really. We could even allow
ad-hoc creation of unified views for whatever .dblist is appropriate for the project. I
don't think anything need be ruled out yet -- that's the whole point of SQL,
right? Slow, but flexible. :-)
that would work, Oliver is right that creating views for core tables in pre-defined wikis
(say, all wikipedias) would be valuable. Sean, how about we create a page on wikitech with
requirements for these views and we take it from there?
Dario