Statistical methods can deal with black swans, but you've got to get away from normal distributions and also model the risk that your model is wrong.
Since training sets come from the same place sausage comes from, training sets in machine learning rarely teach the algorithm the correct prior distribution of the class. Punch a new prior into the system and it will perform much better.
Some kinds of sampling biases can be somewhat overcome. Involvement of multiple people smoothes out individual bias. (Kurzweil's project of stealing a human soul with a neural network is already being scoops by projects that are stealing statistical models of many souls.)
Language zone Wikipedias are obviously biased towards the viewpoint of people in that language zone. Mostly that's a good thing, because a Chinese knowledge base that reflected an Anglophone bias would seem unnatural to Chinese speakers.
And that's the point. Useful systems don't "eliminate bias" but they are given the bias that they need in order to do their job.
I agree categories are most useful when they are the categories you need. The toolbox above can help you estimate these with precision so high that it's difficult to measure.
Arnold S isn't the best case for categories because humans, bodybuilders, places, chemicals and such are well ontologized. Look at the collection that comes up for the word "Intersection",
http://en.wikipedia.org/wiki/Intersection
Most of these are connected to the larger mass through just a few categories that would be hard to express as restriction types. Wikipedia is reasonable to require concepts to have a category because really, if you want to assert something exists and can't find some category that this thing is a member of, I wouldn't be so sure that this thing exists.
I'm not sure if there is anything I can't do with the current situation, but bear in mind that I'm going to look at DBpedia, Wikidata and Freebase facts too and be willing to do data cleaning processing and hand cleaning of results that I cannot accept. It's a tricky and somewhat expensive process (though it's cheaper than conventional ontology construction), so cleaner data makes this process cheaper and quicker and available to more end users personalized to their own needs to define the categories they need.
Well, intersection is just a disambiguation page, but the categories for intersection (set theory) are good starting points for queries. Maybe I want to look at all concepts that are in set theory and calculus. Or maybe I want to see all mathematical concepts except for those in set theory and then sort them by the date of the first publication that described them. I'd argue that these are potentially common scenarios that we want to make easier for everyone. I agree that there is no single perfect/master/universal ontology. Sure, our minds are all rooted around our perceptions. If I say "dad" it conjures different images in each of our heads. But if I say "Tom Cruise" our mental images are much more similar. So there are large portions of our internal ontologies and mental representations that we share, which are generally what we put in an encyclopedia for our culture. Cultural differences are certainly fascinating and often follow linguistic barriers. The Pirahã people don't have numbers, just one, two, and many, and their language can be whistled. In standard psychology, a typical hurdle for self-awareness in children and animals is the ability to find a spot painted on one's head by using a mirror. To try and imagine an extreme, if I was the first bear to use Wikipedia I might want to make things that can and cannot be eaten as the fundamental categories. Who knows? You can certainly view the current category system as a graph with only one type of edge (a is a member of b, or equivalently, b contains a). Having loops just means, for example, that the graph can't be a tree, which isn't inherently bad. It just means that you have to alter the definition of some concepts, like root, to fit a broader variety of possible structures.
From: paul@ontology2.com To: wikidata-l@lists.wikimedia.org Date: Tue, 7 May 2013 13:50:10 +0000 Subject: Re: [Wikidata-l] Question about wikipedia categories.
Statistical methods can deal with black swans, but you've got to get
away from normal distributions and also model the risk that your model is wrong.
Since training sets come from the same place sausage comes from,
training sets in machine learning rarely teach the algorithm the correct prior distribution of the class. Punch a new prior into the system and it will perform much better.
Some kinds of sampling biases can be somewhat overcome. Involvement of
multiple people smoothes out individual bias. (Kurzweil's project of stealing a human soul with a neural network is already being scoops by projects that are stealing statistical models of many souls.)
Language zone Wikipedias are obviously biased towards the viewpoint of
people in that language zone. Mostly that's a good thing, because a Chinese knowledge base that reflected an Anglophone bias would seem unnatural to Chinese speakers.
And that's the point. Useful systems don't "eliminate bias" but they
are given the bias that they need in order to do their job.
I agree categories are most useful when they are the categories you
need. The toolbox above can help you estimate these with precision so high that it's difficult to measure.
Arnold S isn't the best case for categories because humans,
bodybuilders, places, chemicals and such are well ontologized. Look at the collection that comes up for the word "Intersection",
http://en.wikipedia.org/wiki/Intersection
Most of these are connected to the larger mass through just a few
categories that would be hard to express as restriction types. Wikipedia is reasonable to require concepts to have a category because really, if you want to assert something exists and can't find some category that this thing is a member of, I wouldn't be so sure that this thing exists.
I'm not sure if there is anything I can't do with the current situation,
but bear in mind that I'm going to look at DBpedia, Wikidata and Freebase facts too and be willing to do data cleaning processing and hand cleaning of results that I cannot accept. It's a tricky and somewhat expensive process (though it's cheaper than conventional ontology construction), so cleaner data makes this process cheaper and quicker and available to more end users personalized to their own needs to define the categories they need.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Le 2013-05-07 15:50, Paul A. Houle a écrit :
Statistical methods can deal with black swans, but you've got to get away from normal distributions and also model the risk that your model is wrong.
Do you have recommandations on things I could read on this topic? To me it's seems hard to evaluate probability like "I exist", when probability is something which come well after one own existence in an "existential chain". Of course, let's suppose that I do not exist, but some "existialism demon" fool me with the illusion that I do. Then I don't exist, so "I" can't worry on existence since "I" don't exist in the first place. Now how would you evaluate the chance that "I" exists? 1/2, 1?
Since training sets come from the same place sausage comes from, training sets in machine learning rarely teach the algorithm the correct prior distribution of the class. Punch a new prior into the system and it will perform much better.
Once again, do you have recommandations on things I could read on this topic?
Some kinds of sampling biases can be somewhat overcome. Involvement of multiple people smoothes out individual bias. (Kurzweil's project of stealing a human soul with a neural network is already being scoops by projects that are stealing statistical models of many souls.)
I'm affraid that I would need more references here too.
Language zone Wikipedias are obviously biased towards the viewpoint of people in that language zone. Mostly that's a good thing, because a Chinese knowledge base that reflected an Anglophone bias would seem unnatural to Chinese speakers. And that's the point. Useful systems don't "eliminate bias" but they are given the bias that they need in order to do their job.
While I agree with you on this point, I must admit that I don't feel at ease to say it. I mean, being satisfied with "it does the job" may probably already be a cultural bias.
Most of these are connected to the larger mass through just a few categories that would be hard to express as restriction types. Wikipedia is reasonable to require concepts to have a category because really, if you want to assert something exists and can't find some category that this thing is a member of, I wouldn't be so sure that this thing exists.
The concept exists indenpantly of the ontological status of the object to which it refers. Take "nothingness"[1], in fact the english wikipedia article give not a great definition: "Nothingness is the state of being nothing". Well no, nothingness actualy refers to nothing, and any statement which give existential attribute to nothingness is wrong. The only correct statements on nothingness are totologies of "nothingness doesn't exist". But the concept of nothingness do exist, through the thought which sustains it. The thought exists, but it doesn't mean that what the thought is refering to also exists. And as you can see with the nothing article, you can find categories for the concept, even if no attribute would apply to what the concept denotes.
[1] https://en.wikipedia.org/wiki/Nothing
I'm not sure if there is anything I can't do with the current situation, but bear in mind that I'm going to look at DBpedia, Wikidata and Freebase facts too and be willing to do data cleaning processing and hand cleaning of results that I cannot accept. It's a tricky and somewhat expensive process (though it's cheaper than conventional ontology construction), so cleaner data makes this process cheaper and quicker and available to more end users personalized to their own needs to define the categories they need.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l