On Apr 30, 2014, at 8:40 AM, Sean Pringle springle@wikimedia.org wrote:
On Wed, Apr 30, 2014 at 12:44 PM, Oliver Keyes okeyes@wikimedia.org wrote: Okay, so, have tested (to a limited degree. The work I'm doing that involves the dbs involves eventlogging, so this is mostly me making up excuses to run queries). Thoughts:
*We should probably put in some kind of restrictions around what we care about. For example, I see the tables relating to the WIkimania and Arbcom wikis in there. This is not data I think we're ever going to care about, but it is data, which means we'll either have to write really complex UNIONs to gather global data, with a constantly-maintained list of dbs-we-don't-care-about, or accept inaccuracies in our data. My suggestion would be for these dbs to be removed and excluded from replication, using the noc dblists to identify the ones we don't care about; generally "deleted","closed","special","wikimedia" wikis aren't things we want to be running queries over.
If there are wikis you guys know for sure nobody using ‘research' user will ever want, then they can simply be hidden by modifying the account grants.
Oliver, I am not sure how we define “data we’re [n]ever going to care about”. I do expect we will receive occasional requests for data related to closed or special wikis (see https://office.wikimedia.org/wiki/File:Officewiki_ae.png just to mention a recent example).
The point about global queries is well taken, but I think it should be handled differently (see below). Since we’re not talking about privacy here (uncensored data can be obtained by anyone with access to the production DBs), but usability, I’d avoid making assumptions about which wikis should *always* be excluded. We should have an equivalent of the API’s sitematrix with project metadata to allow flexible filtering.
*This is probably my bad, but I understood the goal to be having a single db containing unified, core tablets. So, we'd have one db, with one revision table, that'd have an extra column of "wiki" that denoted the project the entry referred to. This would let us perform global queries without the complex UNIONs mentioned above. Is this still the goal, or...?
No, that wasn't the goal. Sorry if there was miscommunication. The actual data will remain in separate wikis using regular replication.
However, it's quite possible to create one or more unified databases with (for example) SQL VIEWs that union all tables from a set of pre-defined wikis, with 'wiki' columns, just as you describe. Same thing, really. We could even allow ad-hoc creation of unified views for whatever .dblist is appropriate for the project. I don't think anything need be ruled out yet -- that's the whole point of SQL, right? Slow, but flexible. :-)
that would work, Oliver is right that creating views for core tables in pre-defined wikis (say, all wikipedias) would be valuable. Sean, how about we create a page on wikitech with requirements for these views and we take it from there?
Dario