Re: [Wikidata-l] Question about wikipedia categories.

List overview All Threads
Download

newer

older

[Wikidata-l] Pulling item name...

[Wikidata-l] development list for...

Paul A. Houle

6 May 2013 6 May '13

6:08 p.m.

From my viewpoint, biases are an issue of statistical sampling. Wikipedia is an encyclopedia by humans for humans so of course it has a anthropocentric background, in which the mass of all the concepts swirling around the Earth like an atmosphere curves the graph, keeping the Sun in orbit around our world. I find Wikipedia categories useful today, warts and all. They've got two things going for them: (1) Class and out-of-class dichotomies are the atom of ontology. Well-designed categories have an operational definition that allows class members to be determined with practically perfect precision (2) They are densely populated. Look at the categories on this guy's web page http://en.wikipedia.org/wiki/Arnold_Schwarzenegger each one of those categories states a useful and correct fact, even if the organization of those facts is entirely haphazard. For instance, it would be better if he was coded as an "American" and an "Austrian", "Californian", "Los Angelino" and he is also a "Bodybuilder" and an "Actor" and a zillion other things and then infer that he was a "American Bodybuilder", "Austrian Actor" and such. But it's not that easy because he was an "Austrian soldier" but not an "American soldier" and I'd feel uncomfortable calling him an "Austrian Politician". A lot of nuance is encoded in that sticky mess. It's very easy to analyze those categories and produce desired concepts like "Car" and "Bodybuilder" from junky categories like "Front-wheel drive vehicle," "General Motors Concept Cars", "Bodybuilder Actor" and "Actor Bodybuilder", in fact, that's exactly what the semantic web is for. There is so much rich and precise information in the categories that you get great results despite sampling error caused by low recall in the categories. I'd love to see better structure, but not at the cost of fact density or precision. If we can take advantage of the knowledge in the graph to exert gentle pressure that improves categorization in Wikipedia that would be great. It's definitely time for the social industry to move beyond "tags"

Show replies by date

Michael Hale

6 May 6 May

6:32 p.m.

New subject: [Wikidata-l] Question about wikipedia categories.

I agree they are extremely useful for many scenarios already. Earlier today I sorted the human proteins category by popularity, and by reading the articles for the most popular ones that I didn't know I felt like I was browsing the table of contents of a live molecular biology book that was more comprehensive than any existing book in print. I do think we are on track for undeniable improvements though. Arnold Schwarzenegger is in about 40 categories right now. His Wikidata item has about 20 statements. Eventually, at least all of the information you can gleam from those categories will be contained in the statements on Wikidata. Then we could update the pages so that the links at the bottom aren't to relevant categories, but are to relevant queries. At first, it would look sort of the same. You can click on the 20th-century American actors category now, and you could click on the 20th-century American actors query in the future. But when you get to the query page you can easily specialize or generalize the query with another click in many more directions than are currently supported in the category system. Right now, I can specialize the pages I see by going to the subcategory for American silent film actors. I can generalize the pages I see by going to a supercategory that drops the American requirement, the actor requirement, or the 20th century requirement. But if your first click away from the article doesn't take you to a category, but instead takes you to a query page you now have many more options. For example, you could delete the 20th-century requirement and add a politician requirement to the actor requirement. Then you are looking at Americans that are actors and politicians, which you can't do in the category system.

...

From: paul(a)ontology2.com To: wikidata-l(a)lists.wikimedia.org Date: Mon, 6 May 2013 18:08:04 +0000 Subject: Re: [Wikidata-l] Question about wikipedia categories. From my viewpoint, biases are an issue of statistical sampling. Wikipedia is an encyclopedia by humans for humans so of course it has a anthropocentric background, in which the mass of all the concepts swirling around the Earth like an atmosphere curves the graph, keeping the Sun in orbit around our world. I find Wikipedia categories useful today, warts and all. They've got two things going for them: (1) Class and out-of-class dichotomies are the atom of ontology. Well-designed categories have an operational definition that allows class members to be determined with practically perfect precision (2) They are densely populated. Look at the categories on this guy's web page http://en.wikipedia.org/wiki/Arnold_Schwarzenegger each one of those categories states a useful and correct fact, even if the organization of those facts is entirely haphazard. For instance, it would be better if he was coded as an "American" and an "Austrian", "Californian", "Los Angelino" and he is also a "Bodybuilder" and an "Actor" and a zillion other things and then infer that he was a "American Bodybuilder", "Austrian Actor" and such. But it's not that easy because he was an "Austrian soldier" but not an "American soldier" and I'd feel uncomfortable calling him an "Austrian Politician". A lot of nuance is encoded in that sticky mess. It's very easy to analyze those categories and produce desired concepts like "Car" and "Bodybuilder" from junky categories like "Front-wheel drive vehicle," "General Motors Concept Cars", "Bodybuilder Actor" and "Actor Bodybuilder", in fact, that's exactly what the semantic web is for. There is so much rich and precise information in the categories that you get great results despite sampling error caused by low recall in the categories. I'd love to see better structure, but not at the cost of fact density or precision. If we can take advantage of the knowledge in the graph to exert gentle pressure that improves categorization in Wikipedia that would be great. It's definitely time for the social industry to move beyond "tags" _______________________________________________ Wikidata-l mailing list Wikidata-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Chris Maloney

7:21 p.m.

New subject: [Wikidata-l] Question about wikipedia categories.

Michael, that's really closely in line with what I was thinking. Why don't you take a crack at improving http://meta.wikimedia.org/wiki/Talk:Beyond_categories? I am not sure if this is just a crazy pipe dream or not, but I can't help but be a little bit excited at the possibility that it might actually get done, and I think it would be a huge improvement. On Mon, May 6, 2013 at 2:32 PM, Michael Hale <hale.michael.jr(a)live.com> wrote:

...

From: paul(a)ontology2.com To: wikidata-l(a)lists.wikimedia.org Date: Mon, 6 May 2013 18:08:04 +0000

Subject: Re: [Wikidata-l] Question about wikipedia categories. From my viewpoint, biases are an issue of statistical sampling. Wikipedia is an encyclopedia by humans for humans so of course it has a anthropocentric background, in which the mass of all the concepts swirling around the Earth like an atmosphere curves the graph, keeping the Sun in orbit around our world. I find Wikipedia categories useful today, warts and all. They've got two things going for them: (1) Class and out-of-class dichotomies are the atom of ontology. Well-designed categories have an operational definition that allows class members to be determined with practically perfect precision (2) They are densely populated. Look at the categories on this guy's web page http://en.wikipedia.org/wiki/Arnold_Schwarzenegger each one of those categories states a useful and correct fact, even if the organization of those facts is entirely haphazard. For instance, it would be better if he was coded as an "American" and an "Austrian", "Californian", "Los Angelino" and he is also a "Bodybuilder" and an "Actor" and a zillion other things and then infer that he was a "American Bodybuilder", "Austrian Actor" and such. But it's not that easy because he was an "Austrian soldier" but not an "American soldier" and I'd feel uncomfortable calling him an "Austrian Politician". A lot of nuance is encoded in that sticky mess. It's very easy to analyze those categories and produce desired concepts like "Car" and "Bodybuilder" from junky categories like "Front-wheel drive vehicle," "General Motors Concept Cars", "Bodybuilder Actor" and "Actor Bodybuilder", in fact, that's exactly what the semantic web is for. There is so much rich and precise information in the categories that you get great results despite sampling error caused by low recall in the categories. I'd love to see better structure, but not at the cost of fact density or precision. If we can take advantage of the knowledge in the graph to exert gentle pressure that improves categorization in Wikipedia that would be great. It's definitely time for the social industry to move beyond "tags" _______________________________________________ Wikidata-l mailing list Wikidata-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

_______________________________________________ Wikidata-l mailing list Wikidata-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Michael Hale

8:06 p.m.

New subject: [Wikidata-l] Question about wikipedia categories.

I added a section for Wikidata query potential. I'd estimate we've imported about 1/4 to 1/3 of the data we'd need to start getting comparable results for a significant number of categories. I think we should iterate on the query prototype considering scenarios where you link to an existing query and want to modify it.

...

Date: Mon, 6 May 2013 15:21:23 -0400 From: voldrani(a)gmail.com To: wikidata-l(a)lists.wikimedia.org Subject: Re: [Wikidata-l] Question about wikipedia categories. Michael, that's really closely in line with what I was thinking. Why don't you take a crack at improving http://meta.wikimedia.org/wiki/Talk:Beyond_categories? I am not sure if this is just a crazy pipe dream or not, but I can't help but be a little bit excited at the possibility that it might actually get done, and I think it would be a huge improvement. On Mon, May 6, 2013 at 2:32 PM, Michael Hale <hale.michael.jr(a)live.com> wrote:

From: paul(a)ontology2.com To: wikidata-l(a)lists.wikimedia.org Date: Mon, 6 May 2013 18:08:04 +0000

_______________________________________________ Wikidata-l mailing list Wikidata-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Mathieu Stumpf

7 May 7 May

9:12 a.m.

New subject: [Wikidata-l] Question about wikipedia categories.

Le 2013-05-06 20:08, Paul A. Houle a écrit :

...

From my viewpoint, biases are an issue of statistical sampling.

It's not just the sampling which matter, how you process them to infer conclusions is just as important. Except in the case of an hypotetical clone army where each clone lived excatly the same experiences in the same order, I doubt human will ever come with the very same biases. Also even large statistical sampling can't prevent you from "black swan" occurence.

...

Wikipedia is an encyclopedia by humans for humans so of course it has a anthropocentric background, in which the mass of all the concepts swirling around the Earth like an atmosphere curves the graph, keeping the Sun in orbit around our world.

I don't think that human can be "objective" as opposed to infer assumption on non-subjective non-biased experiences. Those said it's clear to my mind that we are able to build conceptual models which can more or less successfully make predictive assumptions (provided that we trust or own memory and our ability to compare memory and sensitive perceptions). So as I understand it, while we tend to be anthropocentric, this cognitive bias is the first epsitemologic barrier which, to my mind, must be exceeded. Copernic and Darwin for example are two important step away from anthropocentric suppositions.

...

I find Wikipedia categories useful today, warts and all. They've got two things going for them: (1) Class and out-of-class dichotomies are the atom of ontology. Well-designed categories have an operational definition that allows class members to be determined with practically perfect precision

To my mind, there's no such thing as perfection or any absolute phenomenon, or at least there no way to decide actual reality of such a thing through our subjective experiences. In my humble opinion, well-designed categories help you find useful information for your current situation.

...

(2) They are densely populated. Look at the categories on this guy's web page http://en.wikipedia.org/wiki/Arnold_Schwarzenegger each one of those categories states a useful and correct fact, even if the organization of those facts is entirely haphazard. For instance, it would be better if he was coded as an "American" and an "Austrian", "Californian", "Los Angelino" and he is also a "Bodybuilder" and an "Actor" and a zillion other things and then infer that he was a "American Bodybuilder", "Austrian Actor" and such. But it's not that easy because he was an "Austrian soldier" but not an "American soldier" and I'd feel uncomfortable calling him an "Austrian Politician". A lot of nuance is encoded in that sticky mess.

What would it better to code it in an other way if it would bring to a situation where you arise chances to infer invalid assertions? Just because a locution is composed with several word doesn't mean that you could get the same meaning from seperated words, just like you can't take each letter of a word. Now you will probably want that the "Austrian Politician" be categorized in the "Politician" category. What's wrong with that?

...

It's very easy to analyze those categories and produce desired concepts like "Car" and "Bodybuilder" from junky categories like "Front-wheel drive vehicle," "General Motors Concept Cars", "Bodybuilder Actor" and "Actor Bodybuilder", in fact, that's exactly what the semantic web is for. There is so much rich and precise information in the categories that you get great results despite sampling error caused by low recall in the categories. I'd love to see better structure, but not at the cost of fact density or precision. If we can take advantage of the knowledge in the graph to exert gentle pressure that improves categorization in Wikipedia that would be great. It's definitely time for the social industry to move beyond "tags"

Do you have examples of what you would like to be able to do that you can't do with the current situation? I mean you may even create a "categories that are valide for my special purpose". -- Association Culture-Libre http://www.culture-libre.org/

4034

days inactive

4035

days old

wikidata@lists.wikimedia.org

Manage subscription

4 comments

4 participants

tags (0)

participants (4)

Chris Maloney
Mathieu Stumpf
Michael Hale
Paul A. Houle