The state of field names in MediaWiki data

List overview All Threads
Download

newer

older

browsers: readers vs. editors

EventLogging and Adblock on...

Aaron Halfaker

10 Dec 2014 10 Dec '14

7:22 p.m.

Hey folks,

I was talking to ottomata today about developing a schema for processing revisions in Hadoop. We came across a deep problem with field names that I'd like to discuss because I want people to be aware of the problem.

To explain this, I'll use an example. Let's say you want to get the namespace of this page: https://en.wikipedia.org/wiki/Biology

In javascript, this is represented as the variable *wgNamespaceNumber*.

In the database, this is represented as *page.page_namespace*

In the XML database dump, this is represented as the value at *<page><ns> *or *<namespaces><namespace.key> *depending where you are.

Right now, ottomata and I are considering the more descriptive name *page_namespace_id* since the value of all of these valiables/fields is an identifier -- not a name. I think that this is a *good* name if we consider it in a vacuum, but if we choose it, we'll add yet another name for wiki devs & analysts to be aware of.

Given the context of this decision, my instinct is to choose the least surprising name. Since I mostly work with the database, that would mean I'd choose *page_namespace*.

This is just one example of such nonsense. The decisions we make in formats that we produce now can have immeasurable effects on the sanity of others. I hope that the decisions we make today will minimize such pain, but it's hard to know for sure.

-Aaron

Attachments:

attachment.htm (text/html — 1.7 KB)

Show replies by date

Dan Andreescu

10 Dec 10 Dec

10:07 p.m.

I think naming things like projects and repositories and folders can be tricky. I don't think naming schema fields should be very tricky. Problems with names in schemas usually reflect limitations of the technologies involved. From your example:

database has *page.page_namespace*. This is mostly for clarity in SQL statements. The name of the table is duplicated in the name of the field so you can make sense of fields across joins and complicated subqueries.

javascript has *wgNamespaceNumber*. Looks like a convention dictated this, but luckily it's fairly isolated from research work so we can ignore such things.

XML has *<page><ns>*. This is the closest to free of idiosyncrasy, but ns should be namespace and it probably isn't to conserve space in dumps (which can get large)

Finally we're considering page_namespace_id. I disagree and I can make an objective argument. We're going to use a json object to represent this data. It should therefore be:

{ page: { namespace: 0 } }

There is no namespace table, and so the namespace is not an id. It's a number that means different things based on configuration in different wikis. If we decide to make a namespace entity with (wiki, number, description) properties, then it would be ok to have:

{ page: { namespace_id: 0 } }

As a side note, naming matters for our data warehouse as well. I say we don't limit ourselves with tool idiosyncrasies. Instead, let's come up with names that make sense. Veteran researchers can rid themselves of the pain of old names, but new researchers shouldn't have to deal with legacy naming. And hopefully for the veterans out there, the structure of the json document is enough to make up for the new approach.

On Wed, Dec 10, 2014 at 1:22 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:

...

Hey folks,

I was talking to ottomata today about developing a schema for processing revisions in Hadoop. We came across a deep problem with field names that I'd like to discuss because I want people to be aware of the problem.

To explain this, I'll use an example. Let's say you want to get the namespace of this page: https://en.wikipedia.org/wiki/Biology

In javascript, this is represented as the variable *wgNamespaceNumber*.

In the database, this is represented as *page.page_namespace*

In the XML database dump, this is represented as the value at *<page><ns> *or *<namespaces><namespace.key> *depending where you are.

Right now, ottomata and I are considering the more descriptive name *page_namespace_id* since the value of all of these valiables/fields is an identifier -- not a name. I think that this is a *good* name if we consider it in a vacuum, but if we choose it, we'll add yet another name for wiki devs & analysts to be aware of.

Given the context of this decision, my instinct is to choose the least surprising name. Since I mostly work with the database, that would mean I'd choose *page_namespace*.

This is just one example of such nonsense. The decisions we make in formats that we produce now can have immeasurable effects on the sanity of others. I hope that the decisions we make today will minimize such pain, but it's hard to know for sure.

-Aaron

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Andrew Otto

10:17 p.m.

Are you suggesting we buck any ugliness of the xml field names and choose the most consistent and elegant ones we can think of?! :D :D

...

On Dec 10, 2014, at 16:07, Dan Andreescu dandreescu@wikimedia.org wrote:

I think naming things like projects and repositories and folders can be tricky. I don't think naming schema fields should be very tricky. Problems with names in schemas usually reflect limitations of the technologies involved. From your example:

database has page.page_namespace. This is mostly for clarity in SQL statements. The name of the table is duplicated in the name of the field so you can make sense of fields across joins and complicated subqueries.

javascript has wgNamespaceNumber. Looks like a convention dictated this, but luckily it's fairly isolated from research work so we can ignore such things.

XML has <page><ns>. This is the closest to free of idiosyncrasy, but ns should be namespace and it probably isn't to conserve space in dumps (which can get large)

Finally we're considering page_namespace_id. I disagree and I can make an objective argument. We're going to use a json object to represent this data. It should therefore be:

{ page: { namespace: 0 } }

There is no namespace table, and so the namespace is not an id. It's a number that means different things based on configuration in different wikis. If we decide to make a namespace entity with (wiki, number, description) properties, then it would be ok to have:

{ page: { namespace_id: 0 } }

As a side note, naming matters for our data warehouse as well. I say we don't limit ourselves with tool idiosyncrasies. Instead, let's come up with names that make sense. Veteran researchers can rid themselves of the pain of old names, but new researchers shouldn't have to deal with legacy naming. And hopefully for the veterans out there, the structure of the json document is enough to make up for the new approach.

On Wed, Dec 10, 2014 at 1:22 PM, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote: Hey folks,

I was talking to ottomata today about developing a schema for processing revisions in Hadoop. We came across a deep problem with field names that I'd like to discuss because I want people to be aware of the problem.

To explain this, I'll use an example. Let's say you want to get the namespace of this page: https://en.wikipedia.org/wiki/Biology https://en.wikipedia.org/wiki/Biology

In javascript, this is represented as the variable wgNamespaceNumber.

In the database, this is represented as page.page_namespace

In the XML database dump, this is represented as the value at <page><ns> or <namespaces><namespace.key> depending where you are.

Right now, ottomata and I are considering the more descriptive name page_namespace_id since the value of all of these valiables/fields is an identifier -- not a name. I think that this is a *good* name if we consider it in a vacuum, but if we choose it, we'll add yet another name for wiki devs & analysts to be aware of.

Given the context of this decision, my instinct is to choose the least surprising name. Since I mostly work with the database, that would mean I'd choose page_namespace.

This is just one example of such nonsense. The decisions we make in formats that we produce now can have immeasurable effects on the sanity of others. I hope that the decisions we make today will minimize such pain, but it's hard to know for sure.

-Aaron

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

10:20 p.m.

...

Are you suggesting we buck any ugliness of the xml field names and choose the most consistent and elegant ones we can think of?! :D :D

Are you implying I'm too verbose? If so - you're right. And I like how you put it. Yes. Just because many people have tried it different ways doesn't mean they had the liberty to think of good, clear names that make researchers happy. But that's exactly what our mission is here - so let's make researchers happy.

Andrew Otto

10:37 p.m.

Oh I would never imply TOO verbose. I am a verbose kinda guy!

...

On Dec 10, 2014, at 16:20, Dan Andreescu dandreescu@wikimedia.org wrote:

Are you suggesting we buck any ugliness of the xml field names and choose the most consistent and elegant ones we can think of?! :D :D

Are you implying I'm too verbose? If so - you're right. And I like how you put it. Yes. Just because many people have tried it different ways doesn't mean they had the liberty to think of good, clear names that make researchers happy. But that's exactly what our mission is here - so let's make researchers happy. _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Aaron Halfaker

11 Dec 11 Dec

8 p.m.

Woo! Bike sheds. So.

There is no namespace table, and so the namespace is not an id.

So, I'm not sure that is necessary for the term "identifier" which I assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...

"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"

Yay more names!

Veteran researchers can rid themselves of the pain of old names, but new

...

researchers shouldn't have to deal with legacy naming.

I don't see us getting rid of legacy naming right now. I don't see how adding a new name helps anyone -- veteran or newbie.

However, if we were to develop a mapping of canonical names and pursue that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.

-Aaron

On Wed, Dec 10, 2014 at 1:37 PM, Andrew Otto aotto@wikimedia.org wrote:

...

Oh I would never imply TOO verbose. I am a verbose kinda guy!

On Dec 10, 2014, at 16:20, Dan Andreescu dandreescu@wikimedia.org wrote:

Are you suggesting we buck any ugliness of the xml field names and choose

...
the most consistent and elegant ones we can think of?! :D :D

Are you implying I'm too verbose? If so - you're right. And I like how you put it. Yes. Just because many people have tried it different ways doesn't mean they had the liberty to think of good, clear names that make researchers happy. But that's exactly what our mission is here - so let's make researchers happy. _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Toby Negrin

10:04 p.m.

Bikeshed indeed -- this seems to be a project that could soak up a lot of time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.

On Thu, Dec 11, 2014 at 11:00 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:

...

Woo! Bike sheds. So.

There is no namespace table, and so the namespace is not an id.

So, I'm not sure that is necessary for the term "identifier" which I assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...

"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"

},

Yay more names!

Veteran researchers can rid themselves of the pain of old names, but new

...
researchers shouldn't have to deal with legacy naming.

I don't see us getting rid of legacy naming right now. I don't see how adding a new name helps anyone -- veteran or newbie.

However, if we were to develop a mapping of canonical names and pursue that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.

-Aaron

On Wed, Dec 10, 2014 at 1:37 PM, Andrew Otto aotto@wikimedia.org wrote:

...
Oh I would never imply TOO verbose. I am a verbose kinda guy!

On Dec 10, 2014, at 16:20, Dan Andreescu dandreescu@wikimedia.org wrote:

Are you suggesting we buck any ugliness of the xml field names and choose

...
the most consistent and elegant ones we can think of?! :D :D

Are you implying I'm too verbose? If so - you're right. And I like how you put it. Yes. Just because many people have tried it different ways doesn't mean they had the liberty to think of good, clear names that make researchers happy. But that's exactly what our mission is here - so let's make researchers happy. _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

10:52 p.m.

...

Bikeshed indeed -- this seems to be a project that could soak up a lot of time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.

I disagree that this is bikeshedding. The reason people look back after a year at a project and go "yuck, wish we named those things differently" is precisely because this type of effort is incorrectly labeled as bikeshedding. We are *not* talking a bout a bike shed. We're talking about a schema that will hopefully serve hundreds or thousands of researchers and our own growing team (I'm considering both Aaron's revision schema and the data warehouse schema).

...

So, I'm not sure that is necessary for the term "identifier" which I

...
assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...

"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"

},

Fair enough, namespace_id seems like a good name for a property of a page entity then.

...

I don't see us getting rid of legacy naming right now. I don't see how

...
adding a new name helps anyone -- veteran or newbie.

I disagree that we have to care at all about legacy names. I disagree that the principle of least surprise leads one to prefer database names. To me, that's more surprising because database conventions have no place in json. If I was new to this world, it also seems more surprising. If I was an existing user, I don't think I would be at all surprised as long as the names were clear and the schemas well documented. This page_namespace_id is a bit of a red herring because we have harder things to tackle like "restrictions".

...

However, if we were to develop a mapping of canonical names and pursue

...
that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.

We need not be tied to the production db names. The data warehouse effort is trying to transform a confusing schema riddled with idiosyncrasies into a clean, easy to understand, and easy to work with, dimensional model. In the process, we are also trying to capture changes to objects over time so we are greatly expanding the usefulness of the database. Good naming matters and we should take our time.

Grace Gellerman

11:13 p.m.

I'd like to put a placeholder in Phab or Trello for this work, but please help me out because I am still new....could someone help summarize the context and what we are trying solve?

Also, would this go into Research, Eng or Refinery backlog?

Thanks!

On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

Bikeshed indeed -- this seems to be a project that could soak up a lot of

...
time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.

I disagree that this is bikeshedding. The reason people look back after a year at a project and go "yuck, wish we named those things differently" is precisely because this type of effort is incorrectly labeled as bikeshedding. We are *not* talking a bout a bike shed. We're talking about a schema that will hopefully serve hundreds or thousands of researchers and our own growing team (I'm considering both Aaron's revision schema and the data warehouse schema).

...
So, I'm not sure that is necessary for the term "identifier" which I

...
assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...

"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"

},

Fair enough, namespace_id seems like a good name for a property of a page entity then.

...
I don't see us getting rid of legacy naming right now. I don't see how

...
adding a new name helps anyone -- veteran or newbie.

I disagree that we have to care at all about legacy names. I disagree that the principle of least surprise leads one to prefer database names. To me, that's more surprising because database conventions have no place in json. If I was new to this world, it also seems more surprising. If I was an existing user, I don't think I would be at all surprised as long as the names were clear and the schemas well documented. This page_namespace_id is a bit of a red herring because we have harder things to tackle like "restrictions".

...
However, if we were to develop a mapping of canonical names and pursue

...
that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.

We need not be tied to the production db names. The data warehouse effort is trying to transform a confusing schema riddled with idiosyncrasies into a clean, easy to understand, and easy to work with, dimensional model. In the process, we are also trying to capture changes to objects over time so we are greatly expanding the usefulness of the database. Good naming matters and we should take our time.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Aaron Halfaker

11:23 p.m.

Good question. I don't know if there is a desired outcome of this conversation. My purpose in starting this thread was to have a discussion about the problem we face so that we can start thinking better about it.

I don't think I have a task set up for "specifying a schema for revisions in hadoop". The closest bit we have on the R&D board is https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity -- which is the immediate goal of what I'm working on with Andrew right now. A more long-term goal would be to solve similar problems more easily in the future.

-Aaron

On Thu, Dec 11, 2014 at 2:13 PM, Grace Gellerman ggellerman@wikimedia.org wrote:

...

I'd like to put a placeholder in Phab or Trello for this work, but please help me out because I am still new....could someone help summarize the context and what we are trying solve?

Also, would this go into Research, Eng or Refinery backlog?

Thanks!

On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Bikeshed indeed -- this seems to be a project that could soak up a lot of

...
time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.

I disagree that this is bikeshedding. The reason people look back after a year at a project and go "yuck, wish we named those things differently" is precisely because this type of effort is incorrectly labeled as bikeshedding. We are *not* talking a bout a bike shed. We're talking about a schema that will hopefully serve hundreds or thousands of researchers and our own growing team (I'm considering both Aaron's revision schema and the data warehouse schema).

...
So, I'm not sure that is necessary for the term "identifier" which I

...
assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...

"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"

},

Fair enough, namespace_id seems like a good name for a property of a page entity then.

...
I don't see us getting rid of legacy naming right now. I don't see how

...
adding a new name helps anyone -- veteran or newbie.

I disagree that we have to care at all about legacy names. I disagree that the principle of least surprise leads one to prefer database names. To me, that's more surprising because database conventions have no place in json. If I was new to this world, it also seems more surprising. If I was an existing user, I don't think I would be at all surprised as long as the names were clear and the schemas well documented. This page_namespace_id is a bit of a red herring because we have harder things to tackle like "restrictions".

...
However, if we were to develop a mapping of canonical names and pursue

...
that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.

We need not be tied to the production db names. The data warehouse effort is trying to transform a confusing schema riddled with idiosyncrasies into a clean, easy to understand, and easy to work with, dimensional model. In the process, we are also trying to capture changes to objects over time so we are greatly expanding the usefulness of the database. Good naming matters and we should take our time.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Andrew Otto

11:48 p.m.

Right now, I am working on experimenting with importing Revision history from XML dumps into an easier to use format, Avro. This new format requires a schema definition. We are considering the pros and cons of sticking close to older schemas, or creating new cleaner ones. For the most part these are just discussions around field names, but there are also times when flattening fields makes more sense (e.g. redirect_title vs redirect.title, since <redirect title=“blah”/> is how the field looks in XML). Data structure changes aren’t out of the question.

There isn’t a card, because on my end this is still experimentation. I’m trying to come up with something that Aaron can use easily, so my stuff has to work with his code. Hence the collaboration.

But! If we settle on this, then I will create cards for productionizing xmldump -> avro jobs. Those will certainly cover this issue.

Also: YEAH FOR GOOD NAMING! GO DAN! Don’t listen to those bikeshedhaters!

-Ao

...

On Dec 11, 2014, at 17:23, Aaron Halfaker ahalfaker@wikimedia.org wrote:

Good question. I don't know if there is a desired outcome of this conversation. My purpose in starting this thread was to have a discussion about the problem we face so that we can start thinking better about it.

I don't think I have a task set up for "specifying a schema for revisions in hadoop". The closest bit we have on the R&D board is https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity -- which is the immediate goal of what I'm working on with Andrew right now. A more long-term goal would be to solve similar problems more easily in the future.

-Aaron

On Thu, Dec 11, 2014 at 2:13 PM, Grace Gellerman <ggellerman@wikimedia.org mailto:ggellerman@wikimedia.org> wrote: I'd like to put a placeholder in Phab or Trello for this work, but please help me out because I am still new....could someone help summarize the context and what we are trying solve?

Also, would this go into Research, Eng or Refinery backlog?

Thanks!

On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: Bikeshed indeed -- this seems to be a project that could soak up a lot of time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.

I disagree that this is bikeshedding. The reason people look back after a year at a project and go "yuck, wish we named those things differently" is precisely because this type of effort is incorrectly labeled as bikeshedding. We are *not* talking a bout a bike shed. We're talking about a schema that will hopefully serve hundreds or thousands of researchers and our own growing team (I'm considering both Aaron's revision schema and the data warehouse schema).

So, I'm not sure that is necessary for the term "identifier" which I assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=... http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces%7Cnamespacealiases&format=jsonfm

"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"

},

Fair enough, namespace_id seems like a good name for a property of a page entity then.

I don't see us getting rid of legacy naming right now. I don't see how adding a new name helps anyone -- veteran or newbie.

I disagree that we have to care at all about legacy names. I disagree that the principle of least surprise leads one to prefer database names. To me, that's more surprising because database conventions have no place in json. If I was new to this world, it also seems more surprising. If I was an existing user, I don't think I would be at all surprised as long as the names were clear and the schemas well documented. This page_namespace_id is a bit of a red herring because we have harder things to tackle like "restrictions".

However, if we were to develop a mapping of canonical names and pursue that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.

We need not be tied to the production db names. The data warehouse effort is trying to transform a confusing schema riddled with idiosyncrasies into a clean, easy to understand, and easy to work with, dimensional model. In the process, we are also trying to capture changes to objects over time so we are greatly expanding the usefulness of the database. Good naming matters and we should take our time.

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Grace Gellerman

12 Dec 12 Dec

6:22 p.m.

I appreciate Dan's passion and tenacity. I barely know what y'all are talking about, but I can tell that I support his commitment to good naming. Thanks, Dan!

For everything else, I created tickets to capture what you are working on now and what should go in the backlog.

Feel free to correct or wordsmith any errors.

For Aaron:

1. added to in progress on R & D Trello board:

https://trello.com/c/Yuki0FBE/574-specifying-a-schema-for-revisions-in-hadoo...

2. added inelegantly worded card to new ideas lane of R&D backlog board:

https://trello.com/c/TocTUcD7/206-solve-problems-similiar-to-ones-surfaced-w...

For Andrew: 3. productionizing xmldump -> avro jobs: https://phabricator.wikimedia.org/T78404

4. for experimentation part, I created this and called it out as spike:

https://phabricator.wikimedia.org/T78405

On Thu, Dec 11, 2014 at 2:48 PM, Andrew Otto aotto@wikimedia.org wrote:

...

Right now, I am working on experimenting with importing Revision history from XML dumps into an easier to use format, Avro. This new format requires a schema definition. We are considering the pros and cons of sticking close to older schemas, or creating new cleaner ones. For the most part these are just discussions around field names, but there are also times when flattening fields makes more sense (e.g. redirect_title vs redirect.title, since <redirect title=“blah”/> is how the field looks in XML). Data structure changes aren’t out of the question.

There isn’t a card, because on my end this is still experimentation. I’m trying to come up with something that Aaron can use easily, so my stuff has to work with his code. Hence the collaboration.

But! If we settle on this, then I will create cards for productionizing xmldump -> avro jobs. Those will certainly cover this issue.

Also: YEAH FOR GOOD NAMING! GO DAN! Don’t listen to those bikeshedhaters!

-Ao

On Dec 11, 2014, at 17:23, Aaron Halfaker ahalfaker@wikimedia.org wrote:

Good question. I don't know if there is a desired outcome of this conversation. My purpose in starting this thread was to have a discussion about the problem we face so that we can start thinking better about it.

I don't think I have a task set up for "specifying a schema for revisions in hadoop". The closest bit we have on the R&D board is https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity -- which is the immediate goal of what I'm working on with Andrew right now. A more long-term goal would be to solve similar problems more easily in the future.

-Aaron

On Thu, Dec 11, 2014 at 2:13 PM, Grace Gellerman <ggellerman@wikimedia.org

...
wrote:

...
I'd like to put a placeholder in Phab or Trello for this work, but please help me out because I am still new....could someone help summarize the context and what we are trying solve?

Also, would this go into Research, Eng or Refinery backlog?

Thanks!

On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Bikeshed indeed -- this seems to be a project that could soak up a lot

...
of time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.

I disagree that this is bikeshedding. The reason people look back after a year at a project and go "yuck, wish we named those things differently" is precisely because this type of effort is incorrectly labeled as bikeshedding. We are *not* talking a bout a bike shed. We're talking about a schema that will hopefully serve hundreds or thousands of researchers and our own growing team (I'm considering both Aaron's revision schema and the data warehouse schema).

...
So, I'm not sure that is necessary for the term "identifier" which I

...
assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...

"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"

},

Fair enough, namespace_id seems like a good name for a property of a page entity then.

...
I don't see us getting rid of legacy naming right now. I don't see how

...
adding a new name helps anyone -- veteran or newbie.

I disagree that we have to care at all about legacy names. I disagree that the principle of least surprise leads one to prefer database names. To me, that's more surprising because database conventions have no place in json. If I was new to this world, it also seems more surprising. If I was an existing user, I don't think I would be at all surprised as long as the names were clear and the schemas well documented. This page_namespace_id is a bit of a red herring because we have harder things to tackle like "restrictions".

...
However, if we were to develop a mapping of canonical names and pursue

...
that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.

We need not be tied to the production db names. The data warehouse effort is trying to transform a confusing schema riddled with idiosyncrasies into a clean, easy to understand, and easy to work with, dimensional model. In the process, we are also trying to capture changes to objects over time so we are greatly expanding the usefulness of the database. Good naming matters and we should take our time.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

3664

Age (days ago)

3666

Last active (days ago)

analytics@lists.wikimedia.org

11 comments

5 participants

tags (0)

participants (5)

Aaron Halfaker
Andrew Otto
Dan Andreescu
Grace Gellerman
Toby Negrin