On Wed, Apr 30, 2014 at 12:44 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
Okay, so, have tested (to a limited degree. The work
I'm doing that
involves the dbs involves eventlogging, so this is mostly me making up
excuses to run queries). Thoughts:
*We should probably put in some kind of restrictions around what we care
about. For example, I see the tables relating to the WIkimania and Arbcom
wikis in there. This is not data I think we're ever going to care about,
but it is *data*, which means we'll either have to write really complex
UNIONs to gather global data, with a constantly-maintained list of
dbs-we-don't-care-about, or accept inaccuracies in our data. My suggestion
would be for these dbs to be removed and excluded from replication, using
the noc dblists to identify the ones we don't care about; generally
"deleted","closed","special","wikimedia" wikis
aren't things we want to be
running queries over.
If there are wikis you guys know for sure nobody using 'research' user will
ever want, then they can simply be hidden by modifying the account grants.
*This is probably my bad, but I understood the goal to be having a single
db containing unified, core tablets. So, we'd have
one db, with one
revision table, that'd have an extra column of "wiki" that denoted the
project the entry referred to. This would let us perform global queries
without the complex UNIONs mentioned above. Is this still the goal, or...?
No, that wasn't the goal. Sorry if there was miscommunication. The actual
data will remain in separate wikis using regular replication.
However, it's quite possible to create one or more unified databases with
(for example) SQL VIEWs that union all tables from a set of pre-defined
wikis, with 'wiki' columns, just as you describe. Same thing, really. We
could even allow ad-hoc creation of unified views for whatever .dblist is
appropriate for the project. I don't think anything need be ruled out yet
-- that's the whole point of SQL, right? Slow, but flexible. :-)
Sean
--
DBA @ WMF