On Wed, Apr 30, 2014 at 12:44 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Okay, so, have tested (to a limited degree. The work I'm doing that involves the dbs involves eventlogging, so this is mostly me making up excuses to run queries). Thoughts:
*We should probably put in some kind of restrictions around what we care about. For example, I see the tables relating to the WIkimania and Arbcom wikis in there. This is not data I think we're ever going to care about, but it is *data*, which means we'll either have to write really complex UNIONs to gather global data, with a constantly-maintained list of dbs-we-don't-care-about, or accept inaccuracies in our data. My suggestion would be for these dbs to be removed and excluded from replication, using the noc dblists to identify the ones we don't care about; generally "deleted","closed","special","wikimedia" wikis aren't things we want to be running queries over.
If there are wikis you guys know for sure nobody using 'research' user will ever want, then they can simply be hidden by modifying the account grants.
*This is probably my bad, but I understood the goal to be having a single
db containing unified, core tablets. So, we'd have one db, with one revision table, that'd have an extra column of "wiki" that denoted the project the entry referred to. This would let us perform global queries without the complex UNIONs mentioned above. Is this still the goal, or...?
No, that wasn't the goal. Sorry if there was miscommunication. The actual data will remain in separate wikis using regular replication.
However, it's quite possible to create one or more unified databases with (for example) SQL VIEWs that union all tables from a set of pre-defined wikis, with 'wiki' columns, just as you describe. Same thing, really. We could even allow ad-hoc creation of unified views for whatever .dblist is appropriate for the project. I don't think anything need be ruled out yet -- that's the whole point of SQL, right? Slow, but flexible. :-)
Sean