Are you suggesting we buck any ugliness of the xml field names and choose the most consistent and elegant ones we can think of?!  :D :D


On Dec 10, 2014, at 16:07, Dan Andreescu <dandreescu@wikimedia.org> wrote:

I think naming things like projects and repositories and folders can be tricky.  I don't think naming schema fields should be very tricky.  Problems with names in schemas usually reflect limitations of the technologies involved.  From your example:

database has page.page_namespace.  This is mostly for clarity in SQL statements.  The name of the table is duplicated in the name of the field so you can make sense of fields across joins and complicated subqueries.

javascript has wgNamespaceNumber.  Looks like a convention dictated this, but luckily it's fairly isolated from research work so we can ignore such things.

XML has <page><ns>.  This is the closest to free of idiosyncrasy, but ns should be namespace and it probably isn't to conserve space in dumps (which can get large)

Finally we're considering page_namespace_id.  I disagree and I can make an objective argument.  We're going to use a json object to represent this data.  It should therefore be:

{ page: { namespace: 0 } }

There is no namespace table, and so the namespace is not an id.  It's a number that means different things based on configuration in different wikis.  If we decide to make a namespace entity with (wiki, number, description) properties, then it would be ok to have:

{ page: { namespace_id: 0 } }


As a side note, naming matters for our data warehouse as well.  I say we don't limit ourselves with tool idiosyncrasies.  Instead, let's come up with names that make sense.  Veteran researchers can rid themselves of the pain of old names, but new researchers shouldn't have to deal with legacy naming.  And hopefully for the veterans out there, the structure of the json document is enough to make up for the new approach.

On Wed, Dec 10, 2014 at 1:22 PM, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:
Hey folks,

I was talking to ottomata today about developing a schema for processing revisions in Hadoop.  We came across a deep problem with field names that I'd like to discuss because I want people to be aware of the problem.  

To explain this, I'll use an example.  Let's say you want to get the namespace of this page:

In javascript, this is represented as the variable wgNamespaceNumber.

In the database, this is represented as page.page_namespace

In the XML database dump, this is represented as the value at <page><ns> or <namespaces><namespace.key> depending where you are.

Right now, ottomata and I are considering the more descriptive name page_namespace_id since the value of all of these valiables/fields is an identifier -- not a name.   I think that this is a *good* name if we consider it in a vacuum, but if we choose it, we'll add yet another name for wiki devs & analysts to be aware of.

Given the context of this decision, my instinct is to choose the least surprising name.  Since I mostly work with the database, that would mean I'd choose page_namespace.

This is just one example of such nonsense.  The decisions we make in formats that we produce now can have immeasurable effects on the sanity of others.  I hope that the decisions we make today will minimize such pain, but it's hard to know for sure.  

-Aaron

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics