Hey folks,
I was talking to ottomata today about developing a schema for processing revisions in Hadoop. We came across a deep problem with field names that I'd like to discuss because I want people to be aware of the problem.
To explain this, I'll use an example. Let's say you want to get the namespace of this page: https://en.wikipedia.org/wiki/Biology
In javascript, this is represented as the variable *wgNamespaceNumber*.
In the database, this is represented as *page.page_namespace*
In the XML database dump, this is represented as the value at *<page><ns> *or *<namespaces><namespace.key> *depending where you are.
Right now, ottomata and I are considering the more descriptive name *page_namespace_id* since the value of all of these valiables/fields is an identifier -- not a name. I think that this is a *good* name if we consider it in a vacuum, but if we choose it, we'll add yet another name for wiki devs & analysts to be aware of.
Given the context of this decision, my instinct is to choose the least surprising name. Since I mostly work with the database, that would mean I'd choose *page_namespace*.
This is just one example of such nonsense. The decisions we make in formats that we produce now can have immeasurable effects on the sanity of others. I hope that the decisions we make today will minimize such pain, but it's hard to know for sure.
-Aaron
I think naming things like projects and repositories and folders can be tricky. I don't think naming schema fields should be very tricky. Problems with names in schemas usually reflect limitations of the technologies involved. From your example:
database has *page.page_namespace*. This is mostly for clarity in SQL statements. The name of the table is duplicated in the name of the field so you can make sense of fields across joins and complicated subqueries.
javascript has *wgNamespaceNumber*. Looks like a convention dictated this, but luckily it's fairly isolated from research work so we can ignore such things.
XML has *<page><ns>*. This is the closest to free of idiosyncrasy, but ns should be namespace and it probably isn't to conserve space in dumps (which can get large)
Finally we're considering page_namespace_id. I disagree and I can make an objective argument. We're going to use a json object to represent this data. It should therefore be:
{ page: { namespace: 0 } }
There is no namespace table, and so the namespace is not an id. It's a number that means different things based on configuration in different wikis. If we decide to make a namespace entity with (wiki, number, description) properties, then it would be ok to have:
{ page: { namespace_id: 0 } }
As a side note, naming matters for our data warehouse as well. I say we don't limit ourselves with tool idiosyncrasies. Instead, let's come up with names that make sense. Veteran researchers can rid themselves of the pain of old names, but new researchers shouldn't have to deal with legacy naming. And hopefully for the veterans out there, the structure of the json document is enough to make up for the new approach.
On Wed, Dec 10, 2014 at 1:22 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Hey folks,
I was talking to ottomata today about developing a schema for processing revisions in Hadoop. We came across a deep problem with field names that I'd like to discuss because I want people to be aware of the problem.
To explain this, I'll use an example. Let's say you want to get the namespace of this page: https://en.wikipedia.org/wiki/Biology
In javascript, this is represented as the variable *wgNamespaceNumber*.
In the database, this is represented as *page.page_namespace*
In the XML database dump, this is represented as the value at *<page><ns> *or *<namespaces><namespace.key> *depending where you are.
Right now, ottomata and I are considering the more descriptive name *page_namespace_id* since the value of all of these valiables/fields is an identifier -- not a name. I think that this is a *good* name if we consider it in a vacuum, but if we choose it, we'll add yet another name for wiki devs & analysts to be aware of.
Given the context of this decision, my instinct is to choose the least surprising name. Since I mostly work with the database, that would mean I'd choose *page_namespace*.
This is just one example of such nonsense. The decisions we make in formats that we produce now can have immeasurable effects on the sanity of others. I hope that the decisions we make today will minimize such pain, but it's hard to know for sure.
-Aaron
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Are you suggesting we buck any ugliness of the xml field names and choose the most consistent and elegant ones we can think of?! :D :D
On Dec 10, 2014, at 16:07, Dan Andreescu dandreescu@wikimedia.org wrote:
I think naming things like projects and repositories and folders can be tricky. I don't think naming schema fields should be very tricky. Problems with names in schemas usually reflect limitations of the technologies involved. From your example:
database has page.page_namespace. This is mostly for clarity in SQL statements. The name of the table is duplicated in the name of the field so you can make sense of fields across joins and complicated subqueries.
javascript has wgNamespaceNumber. Looks like a convention dictated this, but luckily it's fairly isolated from research work so we can ignore such things.
XML has <page><ns>. This is the closest to free of idiosyncrasy, but ns should be namespace and it probably isn't to conserve space in dumps (which can get large)
Finally we're considering page_namespace_id. I disagree and I can make an objective argument. We're going to use a json object to represent this data. It should therefore be:
{ page: { namespace: 0 } }
There is no namespace table, and so the namespace is not an id. It's a number that means different things based on configuration in different wikis. If we decide to make a namespace entity with (wiki, number, description) properties, then it would be ok to have:
{ page: { namespace_id: 0 } }
As a side note, naming matters for our data warehouse as well. I say we don't limit ourselves with tool idiosyncrasies. Instead, let's come up with names that make sense. Veteran researchers can rid themselves of the pain of old names, but new researchers shouldn't have to deal with legacy naming. And hopefully for the veterans out there, the structure of the json document is enough to make up for the new approach.
On Wed, Dec 10, 2014 at 1:22 PM, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote: Hey folks,
I was talking to ottomata today about developing a schema for processing revisions in Hadoop. We came across a deep problem with field names that I'd like to discuss because I want people to be aware of the problem.
To explain this, I'll use an example. Let's say you want to get the namespace of this page: https://en.wikipedia.org/wiki/Biology https://en.wikipedia.org/wiki/Biology
In javascript, this is represented as the variable wgNamespaceNumber.
In the database, this is represented as page.page_namespace
In the XML database dump, this is represented as the value at <page><ns> or <namespaces><namespace.key> depending where you are.
Right now, ottomata and I are considering the more descriptive name page_namespace_id since the value of all of these valiables/fields is an identifier -- not a name. I think that this is a *good* name if we consider it in a vacuum, but if we choose it, we'll add yet another name for wiki devs & analysts to be aware of.
Given the context of this decision, my instinct is to choose the least surprising name. Since I mostly work with the database, that would mean I'd choose page_namespace.
This is just one example of such nonsense. The decisions we make in formats that we produce now can have immeasurable effects on the sanity of others. I hope that the decisions we make today will minimize such pain, but it's hard to know for sure.
-Aaron
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Are you suggesting we buck any ugliness of the xml field names and choose the most consistent and elegant ones we can think of?! :D :D
Are you implying I'm too verbose? If so - you're right. And I like how you put it. Yes. Just because many people have tried it different ways doesn't mean they had the liberty to think of good, clear names that make researchers happy. But that's exactly what our mission is here - so let's make researchers happy.
Oh I would never imply TOO verbose. I am a verbose kinda guy!
On Dec 10, 2014, at 16:20, Dan Andreescu dandreescu@wikimedia.org wrote:
Are you suggesting we buck any ugliness of the xml field names and choose the most consistent and elegant ones we can think of?! :D :D
Are you implying I'm too verbose? If so - you're right. And I like how you put it. Yes. Just because many people have tried it different ways doesn't mean they had the liberty to think of good, clear names that make researchers happy. But that's exactly what our mission is here - so let's make researchers happy. _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Woo! Bike sheds. So.
There is no namespace table, and so the namespace is not an id.
So, I'm not sure that is necessary for the term "identifier" which I assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...
"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"
},
Yay more names!
Veteran researchers can rid themselves of the pain of old names, but new
researchers shouldn't have to deal with legacy naming.
I don't see us getting rid of legacy naming right now. I don't see how adding a new name helps anyone -- veteran or newbie.
However, if we were to develop a mapping of canonical names and pursue that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.
-Aaron
On Wed, Dec 10, 2014 at 1:37 PM, Andrew Otto aotto@wikimedia.org wrote:
Oh I would never imply TOO verbose. I am a verbose kinda guy!
On Dec 10, 2014, at 16:20, Dan Andreescu dandreescu@wikimedia.org wrote:
Are you suggesting we buck any ugliness of the xml field names and choose
the most consistent and elegant ones we can think of?! :D :D
Are you implying I'm too verbose? If so - you're right. And I like how you put it. Yes. Just because many people have tried it different ways doesn't mean they had the liberty to think of good, clear names that make researchers happy. But that's exactly what our mission is here - so let's make researchers happy. _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Bikeshed indeed -- this seems to be a project that could soak up a lot of time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.
On Thu, Dec 11, 2014 at 11:00 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Woo! Bike sheds. So.
There is no namespace table, and so the namespace is not an id.
So, I'm not sure that is necessary for the term "identifier" which I assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...
"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"
},
Yay more names!
Veteran researchers can rid themselves of the pain of old names, but new
researchers shouldn't have to deal with legacy naming.
I don't see us getting rid of legacy naming right now. I don't see how adding a new name helps anyone -- veteran or newbie.
However, if we were to develop a mapping of canonical names and pursue that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.
-Aaron
On Wed, Dec 10, 2014 at 1:37 PM, Andrew Otto aotto@wikimedia.org wrote:
Oh I would never imply TOO verbose. I am a verbose kinda guy!
On Dec 10, 2014, at 16:20, Dan Andreescu dandreescu@wikimedia.org wrote:
Are you suggesting we buck any ugliness of the xml field names and choose
the most consistent and elegant ones we can think of?! :D :D
Are you implying I'm too verbose? If so - you're right. And I like how you put it. Yes. Just because many people have tried it different ways doesn't mean they had the liberty to think of good, clear names that make researchers happy. But that's exactly what our mission is here - so let's make researchers happy. _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Bikeshed indeed -- this seems to be a project that could soak up a lot of time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.
I disagree that this is bikeshedding. The reason people look back after a year at a project and go "yuck, wish we named those things differently" is precisely because this type of effort is incorrectly labeled as bikeshedding. We are *not* talking a bout a bike shed. We're talking about a schema that will hopefully serve hundreds or thousands of researchers and our own growing team (I'm considering both Aaron's revision schema and the data warehouse schema).
So, I'm not sure that is necessary for the term "identifier" which I
assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...
"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"
},
Fair enough, namespace_id seems like a good name for a property of a page entity then.
I don't see us getting rid of legacy naming right now. I don't see how
adding a new name helps anyone -- veteran or newbie.
I disagree that we have to care at all about legacy names. I disagree that the principle of least surprise leads one to prefer database names. To me, that's more surprising because database conventions have no place in json. If I was new to this world, it also seems more surprising. If I was an existing user, I don't think I would be at all surprised as long as the names were clear and the schemas well documented. This page_namespace_id is a bit of a red herring because we have harder things to tackle like "restrictions".
However, if we were to develop a mapping of canonical names and pursue
that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.
We need not be tied to the production db names. The data warehouse effort is trying to transform a confusing schema riddled with idiosyncrasies into a clean, easy to understand, and easy to work with, dimensional model. In the process, we are also trying to capture changes to objects over time so we are greatly expanding the usefulness of the database. Good naming matters and we should take our time.
I'd like to put a placeholder in Phab or Trello for this work, but please help me out because I am still new....could someone help summarize the context and what we are trying solve?
Also, would this go into Research, Eng or Refinery backlog?
Thanks!
On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Bikeshed indeed -- this seems to be a project that could soak up a lot of
time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.
I disagree that this is bikeshedding. The reason people look back after a year at a project and go "yuck, wish we named those things differently" is precisely because this type of effort is incorrectly labeled as bikeshedding. We are *not* talking a bout a bike shed. We're talking about a schema that will hopefully serve hundreds or thousands of researchers and our own growing team (I'm considering both Aaron's revision schema and the data warehouse schema).
So, I'm not sure that is necessary for the term "identifier" which I
assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...
"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"
},
Fair enough, namespace_id seems like a good name for a property of a page entity then.
I don't see us getting rid of legacy naming right now. I don't see how
adding a new name helps anyone -- veteran or newbie.
I disagree that we have to care at all about legacy names. I disagree that the principle of least surprise leads one to prefer database names. To me, that's more surprising because database conventions have no place in json. If I was new to this world, it also seems more surprising. If I was an existing user, I don't think I would be at all surprised as long as the names were clear and the schemas well documented. This page_namespace_id is a bit of a red herring because we have harder things to tackle like "restrictions".
However, if we were to develop a mapping of canonical names and pursue
that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.
We need not be tied to the production db names. The data warehouse effort is trying to transform a confusing schema riddled with idiosyncrasies into a clean, easy to understand, and easy to work with, dimensional model. In the process, we are also trying to capture changes to objects over time so we are greatly expanding the usefulness of the database. Good naming matters and we should take our time.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Good question. I don't know if there is a desired outcome of this conversation. My purpose in starting this thread was to have a discussion about the problem we face so that we can start thinking better about it.
I don't think I have a task set up for "specifying a schema for revisions in hadoop". The closest bit we have on the R&D board is https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity -- which is the immediate goal of what I'm working on with Andrew right now. A more long-term goal would be to solve similar problems more easily in the future.
-Aaron
On Thu, Dec 11, 2014 at 2:13 PM, Grace Gellerman ggellerman@wikimedia.org wrote:
I'd like to put a placeholder in Phab or Trello for this work, but please help me out because I am still new....could someone help summarize the context and what we are trying solve?
Also, would this go into Research, Eng or Refinery backlog?
Thanks!
On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Bikeshed indeed -- this seems to be a project that could soak up a lot of
time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.
I disagree that this is bikeshedding. The reason people look back after a year at a project and go "yuck, wish we named those things differently" is precisely because this type of effort is incorrectly labeled as bikeshedding. We are *not* talking a bout a bike shed. We're talking about a schema that will hopefully serve hundreds or thousands of researchers and our own growing team (I'm considering both Aaron's revision schema and the data warehouse schema).
So, I'm not sure that is necessary for the term "identifier" which I
assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...
"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"
},
Fair enough, namespace_id seems like a good name for a property of a page entity then.
I don't see us getting rid of legacy naming right now. I don't see how
adding a new name helps anyone -- veteran or newbie.
I disagree that we have to care at all about legacy names. I disagree that the principle of least surprise leads one to prefer database names. To me, that's more surprising because database conventions have no place in json. If I was new to this world, it also seems more surprising. If I was an existing user, I don't think I would be at all surprised as long as the names were clear and the schemas well documented. This page_namespace_id is a bit of a red herring because we have harder things to tackle like "restrictions".
However, if we were to develop a mapping of canonical names and pursue
that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.
We need not be tied to the production db names. The data warehouse effort is trying to transform a confusing schema riddled with idiosyncrasies into a clean, easy to understand, and easy to work with, dimensional model. In the process, we are also trying to capture changes to objects over time so we are greatly expanding the usefulness of the database. Good naming matters and we should take our time.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Right now, I am working on experimenting with importing Revision history from XML dumps into an easier to use format, Avro. This new format requires a schema definition. We are considering the pros and cons of sticking close to older schemas, or creating new cleaner ones. For the most part these are just discussions around field names, but there are also times when flattening fields makes more sense (e.g. redirect_title vs redirect.title, since <redirect title=“blah”/> is how the field looks in XML). Data structure changes aren’t out of the question.
There isn’t a card, because on my end this is still experimentation. I’m trying to come up with something that Aaron can use easily, so my stuff has to work with his code. Hence the collaboration.
But! If we settle on this, then I will create cards for productionizing xmldump -> avro jobs. Those will certainly cover this issue.
Also: YEAH FOR GOOD NAMING! GO DAN! Don’t listen to those bikeshedhaters!
-Ao
On Dec 11, 2014, at 17:23, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Good question. I don't know if there is a desired outcome of this conversation. My purpose in starting this thread was to have a discussion about the problem we face so that we can start thinking better about it.
I don't think I have a task set up for "specifying a schema for revisions in hadoop". The closest bit we have on the R&D board is https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity -- which is the immediate goal of what I'm working on with Andrew right now. A more long-term goal would be to solve similar problems more easily in the future.
-Aaron
On Thu, Dec 11, 2014 at 2:13 PM, Grace Gellerman <ggellerman@wikimedia.org mailto:ggellerman@wikimedia.org> wrote: I'd like to put a placeholder in Phab or Trello for this work, but please help me out because I am still new....could someone help summarize the context and what we are trying solve?
Also, would this go into Research, Eng or Refinery backlog?
Thanks!
On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: Bikeshed indeed -- this seems to be a project that could soak up a lot of time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.
I disagree that this is bikeshedding. The reason people look back after a year at a project and go "yuck, wish we named those things differently" is precisely because this type of effort is incorrectly labeled as bikeshedding. We are *not* talking a bout a bike shed. We're talking about a schema that will hopefully serve hundreds or thousands of researchers and our own growing team (I'm considering both Aaron's revision schema and the data warehouse schema).
So, I'm not sure that is necessary for the term "identifier" which I assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=... http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces%7Cnamespacealiases&format=jsonfm
"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"
},
Fair enough, namespace_id seems like a good name for a property of a page entity then.
I don't see us getting rid of legacy naming right now. I don't see how adding a new name helps anyone -- veteran or newbie.
I disagree that we have to care at all about legacy names. I disagree that the principle of least surprise leads one to prefer database names. To me, that's more surprising because database conventions have no place in json. If I was new to this world, it also seems more surprising. If I was an existing user, I don't think I would be at all surprised as long as the names were clear and the schemas well documented. This page_namespace_id is a bit of a red herring because we have harder things to tackle like "restrictions".
However, if we were to develop a mapping of canonical names and pursue that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.
We need not be tied to the production db names. The data warehouse effort is trying to transform a confusing schema riddled with idiosyncrasies into a clean, easy to understand, and easy to work with, dimensional model. In the process, we are also trying to capture changes to objects over time so we are greatly expanding the usefulness of the database. Good naming matters and we should take our time.
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I appreciate Dan's passion and tenacity. I barely know what y'all are talking about, but I can tell that I support his commitment to good naming. Thanks, Dan!
For everything else, I created tickets to capture what you are working on now and what should go in the backlog.
Feel free to correct or wordsmith any errors.
For Aaron:
1. added to in progress on R & D Trello board:
https://trello.com/c/Yuki0FBE/574-specifying-a-schema-for-revisions-in-hadoo...
2. added inelegantly worded card to new ideas lane of R&D backlog board:
https://trello.com/c/TocTUcD7/206-solve-problems-similiar-to-ones-surfaced-w...
For Andrew: 3. productionizing xmldump -> avro jobs: https://phabricator.wikimedia.org/T78404
4. for experimentation part, I created this and called it out as spike:
https://phabricator.wikimedia.org/T78405
On Thu, Dec 11, 2014 at 2:48 PM, Andrew Otto aotto@wikimedia.org wrote:
Right now, I am working on experimenting with importing Revision history from XML dumps into an easier to use format, Avro. This new format requires a schema definition. We are considering the pros and cons of sticking close to older schemas, or creating new cleaner ones. For the most part these are just discussions around field names, but there are also times when flattening fields makes more sense (e.g. redirect_title vs redirect.title, since <redirect title=“blah”/> is how the field looks in XML). Data structure changes aren’t out of the question.
There isn’t a card, because on my end this is still experimentation. I’m trying to come up with something that Aaron can use easily, so my stuff has to work with his code. Hence the collaboration.
But! If we settle on this, then I will create cards for productionizing xmldump -> avro jobs. Those will certainly cover this issue.
Also: YEAH FOR GOOD NAMING! GO DAN! Don’t listen to those bikeshedhaters!
-Ao
On Dec 11, 2014, at 17:23, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Good question. I don't know if there is a desired outcome of this conversation. My purpose in starting this thread was to have a discussion about the problem we face so that we can start thinking better about it.
I don't think I have a task set up for "specifying a schema for revisions in hadoop". The closest bit we have on the R&D board is https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity -- which is the immediate goal of what I'm working on with Andrew right now. A more long-term goal would be to solve similar problems more easily in the future.
-Aaron
On Thu, Dec 11, 2014 at 2:13 PM, Grace Gellerman <ggellerman@wikimedia.org
wrote:
I'd like to put a placeholder in Phab or Trello for this work, but please help me out because I am still new....could someone help summarize the context and what we are trying solve?
Also, would this go into Research, Eng or Refinery backlog?
Thanks!
On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Bikeshed indeed -- this seems to be a project that could soak up a lot
of time. I'm with Aaron -- let's be consistent with the principle of least surprise and use an existing identifier. The database seems as good a place to start as any.
I disagree that this is bikeshedding. The reason people look back after a year at a project and go "yuck, wish we named those things differently" is precisely because this type of effort is incorrectly labeled as bikeshedding. We are *not* talking a bout a bike shed. We're talking about a schema that will hopefully serve hundreds or thousands of researchers and our own growing team (I'm considering both Aaron's revision schema and the data warehouse schema).
So, I'm not sure that is necessary for the term "identifier" which I
assume that "id" abbreviates. Regardless it seems clear that these numbers are thought of as primary identifiers of a namespace that can otherwise have many names. For example, see this snippet from the result of this query: http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...
"1": { "id": 1, "case": "first-letter", "*": "Discusi\u00f3n", "subpages": "", "canonical": "Talk"
},
Fair enough, namespace_id seems like a good name for a property of a page entity then.
I don't see us getting rid of legacy naming right now. I don't see how
adding a new name helps anyone -- veteran or newbie.
I disagree that we have to care at all about legacy names. I disagree that the principle of least surprise leads one to prefer database names. To me, that's more surprising because database conventions have no place in json. If I was new to this world, it also seems more surprising. If I was an existing user, I don't think I would be at all surprised as long as the names were clear and the schemas well documented. This page_namespace_id is a bit of a red herring because we have harder things to tackle like "restrictions".
However, if we were to develop a mapping of canonical names and pursue
that from here forward, we might be able to move beyond the old names for the most important data sources in a few of years. However, I'm skeptical that we'll ever be able to change any production DB field names.
We need not be tied to the production db names. The data warehouse effort is trying to transform a confusing schema riddled with idiosyncrasies into a clean, easy to understand, and easy to work with, dimensional model. In the process, we are also trying to capture changes to objects over time so we are greatly expanding the usefulness of the database. Good naming matters and we should take our time.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics