I'd like to put a placeholder in Phab or Trello for this work, but please help me out because I am still new....could someone help summarize the context and what we are trying solve?
Also, would this go into Research, Eng or Refinery backlog?
Thanks!
On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Bikeshed indeed -- this seems to be a project that could soak up a lot of
time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.
I disagree that this is bikeshedding. The reason people look back after a year at a project and go "yuck, wish we named those things differently" is precisely because this type of effort is incorrectly labeled as bikeshedding. We are *not* talking a bout a bike shed. We're talking about a schema that will hopefully serve hundreds or thousands of researchers and our own growing team (I'm considering both Aaron's revision schema and the data warehouse schema).
So, I'm not sure that is necessary for the term "identifier" which I
assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...
"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"
},
Fair enough, namespace_id seems like a good name for a property of a page entity then.
I don't see us getting rid of legacy naming right now. I don't see how
adding a new name helps anyone -- veteran or newbie.
I disagree that we have to care at all about legacy names. I disagree that the principle of least surprise leads one to prefer database names. To me, that's more surprising because database conventions have no place in json. If I was new to this world, it also seems more surprising. If I was an existing user, I don't think I would be at all surprised as long as the names were clear and the schemas well documented. This page_namespace_id is a bit of a red herring because we have harder things to tackle like "restrictions".
However, if we were to develop a mapping of canonical names and pursue
that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.
We need not be tied to the production db names. The data warehouse effort is trying to transform a confusing schema riddled with idiosyncrasies into a clean, easy to understand, and easy to work with, dimensional model. In the process, we are also trying to capture changes to objects over time so we are greatly expanding the usefulness of the database. Good naming matters and we should take our time.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics