Right now, I am working on experimenting with importing Revision history from XML dumps into an easier to use format, Avro. This new format requires a schema definition. We are considering the pros and cons of sticking close to older schemas, or creating new cleaner ones. For the most part these are just discussions around field names, but there are also times when flattening fields makes more sense (e.g. redirect_title vs redirect.title, since <redirect title=“blah”/> is how the field looks in XML). Data structure changes aren’t out of the question.
There isn’t a card, because on my end this is still experimentation. I’m trying to come up with something that Aaron can use easily, so my stuff has to work with his code. Hence the collaboration.
But! If we settle on this, then I will create cards for productionizing xmldump -> avro jobs. Those will certainly cover this issue.
Also: YEAH FOR GOOD NAMING! GO DAN! Don’t listen to those bikeshedhaters!
-Ao
On Dec 11, 2014, at 17:23, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Good question. I don't know if there is a desired outcome of this conversation. My purpose in starting this thread was to have a discussion about the problem we face so that we can start thinking better about it.
I don't think I have a task set up for "specifying a schema for revisions in hadoop". The closest bit we have on the R&D board is https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity -- which is the immediate goal of what I'm working on with Andrew right now. A more long-term goal would be to solve similar problems more easily in the future.
-Aaron
On Thu, Dec 11, 2014 at 2:13 PM, Grace Gellerman <ggellerman@wikimedia.org mailto:ggellerman@wikimedia.org> wrote: I'd like to put a placeholder in Phab or Trello for this work, but please help me out because I am still new....could someone help summarize the context and what we are trying solve?
Also, would this go into Research, Eng or Refinery backlog?
Thanks!
On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: Bikeshed indeed -- this seems to be a project that could soak up a lot of time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.
I disagree that this is bikeshedding. The reason people look back after a year at a project and go "yuck, wish we named those things differently" is precisely because this type of effort is incorrectly labeled as bikeshedding. We are *not* talking a bout a bike shed. We're talking about a schema that will hopefully serve hundreds or thousands of researchers and our own growing team (I'm considering both Aaron's revision schema and the data warehouse schema).
So, I'm not sure that is necessary for the term "identifier" which I assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=... http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces%7Cnamespacealiases&format=jsonfm
"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"
},
Fair enough, namespace_id seems like a good name for a property of a page entity then.
I don't see us getting rid of legacy naming right now. I don't see how adding a new name helps anyone -- veteran or newbie.
I disagree that we have to care at all about legacy names. I disagree that the principle of least surprise leads one to prefer database names. To me, that's more surprising because database conventions have no place in json. If I was new to this world, it also seems more surprising. If I was an existing user, I don't think I would be at all surprised as long as the names were clear and the schemas well documented. This page_namespace_id is a bit of a red herring because we have harder things to tackle like "restrictions".
However, if we were to develop a mapping of canonical names and pursue that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.
We need not be tied to the production db names. The data warehouse effort is trying to transform a confusing schema riddled with idiosyncrasies into a clean, easy to understand, and easy to work with, dimensional model. In the process, we are also trying to capture changes to objects over time so we are greatly expanding the usefulness of the database. Good naming matters and we should take our time.
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics