Markus,
I share your dissatisfaction with "part of" because that language construct hides many different conceptual relationships that should be cleared out, I think we'll have some community discussion work to do in that regard. One of the uses is: what is the relationship between a human and his behavior? I would say that the "human" <has been defined as having> "human behavior" (or the reverse). But if you have a better suggestion to express this concept I would be really glad to hear it.
Now that you mention it, yes, I agree that only a property called "corresponds with item" makes sense in this context, but not the inverse.
I would like to make a further distinction regarding constraints. The nature of constraints is not to set arbitrary limits but to reflect patterns that naturally appear in concepts. On that regard, I hate the word "constraint", because it means that we are placing a "straitjacket" on reality, when it is the other way round, recurring patterns in the real world make us "expect" that a value will fall within the bonds of our expectations. I think that we should seriously consider using the term "expectation" from now on because we don't "constrain" the values per se, we "expect" them to have a value, and when the value departs from the expected value, then it sets an alarm that might reflect an error or not.
Once made that distinction, yes, you are right, considering that we are separating properties and items, our expectations do not belong to the data itself, they belong to the property.
However, I would like to go to bring the conversation to a deeper level. What is that what makes the concept of "addition (Q32043)" to be that? What is in "physical object (Q223557)" that we, sentient beings, can perceive and agree to treat as a concept? I mention those two because one is purely abstract, and the other one is purely physical. And I would say that "addition (Q32043)" <has been defined as having> "associativity (Q177251)" and "physical object (Q223557)" <has been repeatedly observed to have> "density (Q29539)". We can argue whether the second is an expectation or not, but the first is definitely not, someone defined an "addition" like that and this information can be sourced. Even more, we could also say that also "physical object (Q223557)" <has been defined as having> "density (Q29539)", and I guess we could find sources for that statement too.
With all this I want to make the point that there are two sources of expectations: - from our experience seeing repetitions and patterns in the values (male/female/etc "between 10 and 50"), which belong to the property - from the agreed definition of the concept itself, which belong to the data
Cheers, Micru
PS: this is a re-post because my previous message was bounced back "for being too long" :)
Hi, for the behavior, I would said a behavior may be linked to a psychological trait. I's say a behavior is defined by the person having a lot of acts belonging to a typical class of events.
someone is said to be "aggressive" if typically when he acts as hostile in many situations. I remember a theory about that : https://en.wikipedia.org/wiki/Trait_theory :)
2014-05-28 20:46 GMT+02:00 David Cuenca dacuetu@gmail.com:
Markus,
I share your dissatisfaction with "part of" because that language construct hides many different conceptual relationships that should be cleared out, I think we'll have some community discussion work to do in that regard. One of the uses is: what is the relationship between a human and his behavior? I would say that the "human" <has been defined as having> "human behavior" (or the reverse). But if you have a better suggestion to express this concept I would be really glad to hear it.
Now that you mention it, yes, I agree that only a property called "corresponds with item" makes sense in this context, but not the inverse.
I would like to make a further distinction regarding constraints. The nature of constraints is not to set arbitrary limits but to reflect patterns that naturally appear in concepts. On that regard, I hate the word "constraint", because it means that we are placing a "straitjacket" on reality, when it is the other way round, recurring patterns in the real world make us "expect" that a value will fall within the bonds of our expectations. I think that we should seriously consider using the term "expectation" from now on because we don't "constrain" the values per se, we "expect" them to have a value, and when the value departs from the expected value, then it sets an alarm that might reflect an error or not.
Once made that distinction, yes, you are right, considering that we are separating properties and items, our expectations do not belong to the data itself, they belong to the property.
However, I would like to go to bring the conversation to a deeper level. What is that what makes the concept of "addition (Q32043)" to be that? What is in "physical object (Q223557)" that we, sentient beings, can perceive and agree to treat as a concept? I mention those two because one is purely abstract, and the other one is purely physical. And I would say that "addition (Q32043)" <has been defined as having> "associativity (Q177251)" and "physical object (Q223557)" <has been repeatedly observed to have> "density (Q29539)". We can argue whether the second is an expectation or not, but the first is definitely not, someone defined an "addition" like that and this information can be sourced. Even more, we could also say that also "physical object (Q223557)" <has been defined as having> "density (Q29539)", and I guess we could find sources for that statement too.
With all this I want to make the point that there are two sources of expectations:
- from our experience seeing repetitions and patterns in the values
(male/female/etc "between 10 and 50"), which belong to the property
- from the agreed definition of the concept itself, which belong to the
data
Cheers, Micru
PS: this is a re-post because my previous message was bounced back "for being too long" :)
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
David,
One of the uses is: what is the relationship between a human and his behavior?
This is an easy question once you have been clear about what "human behaviour" is. According to enwiki, it is a "range of behaviours *exhibited by* humans". The bigger question for me is, whether it is useful to record this relationship ("exhibited by") in Wikidata. What would anybody do with this data? In what application could it be of interest?
Moreover, as a great Icelandic ontologist once said: "There is definitely, definitely, definitely no logic, to human behaviour" ;-)
On that regard, I hate the word "constraint", because it means that we are placing a "straitjacket" on reality, when it is the other way round, recurring patterns in the real world make us "expect" that a value will fall within the bonds of our expectations.
I think "constraints" are already understood in this way. The name comes from databases, where a "constraint violation" is indeed a rather hard error. On the other hand, ironically, constraints (as a technical term) are often considered to be a softer form of modelling than (onto)logical axioms: a constraint can be violated while a logical axiom (as the name suggests) is always true -- if it is not backed by the given data, new data will be inferred. So as a technical term, "constraint" is quite appropriate for the mechanism we have, although it may not be the best term to clarify the intention.
However, I would like to go to bring the conversation to a deeper level.
...
With all this I want to make the point that there are two sources of expectations:
- from our experience seeing repetitions and patterns in the values
(male/female/etc "between 10 and 50"), which belong to the property
- from the agreed definition of the concept itself, which belong to the data
Yes. I agree with this as a basic dichotomy of things we may want to record in Wikidata. Some things are true by definition, while others are just "very likely" by observation. The exact population of Paris we will never know, but we are completely sure that a piano is an instrument. (Maybe somebody with a better philosophical background than me could give a better perspective of these notions -- "analytical" vs. "empirical" come to mind, but I am sure there is more.)
Some important ideas like classification (instance of/subclass of) belong completely to the analytical realm. We don't observe classes, we define them. A planet is what we call a "planet", and this can change even if the actual lumps in space are pretty much the same.
However, there is yet a deeper level here (you asked for it ;-). Wikidata is not about facts but about statements with references. We do not record "Pluto was a planet until 2006" but "Pluto was a planet until 2006 *according to the IAU*". Likewise, we don't say "Berlin has 3 million inhabitants" but "Berlin has 3 million inhabitants *according to the Amt fuer Statistik Berlin-Brandenburg*". If you compare these two statements, you can see that they are both "empirical", based on our observation of a particular reference. We do not have analytical knowledge of what the IAU or the Amt fuer Statistic might say. So in this sense constraints can only ever be rough guidelines. It does not make logical sense to say "if source A says X then source B must say Y" -- even if we know that X implies Y (maybe by definition), we don't know what sources A and B say. All we can do with constraints it to uncover possible contradictions between sources, which might then be looked into.
Now inferences are slightly different. If we know that X implies Y, then if "A says X" we can infer that (implicitly) "A says Y". That is a logical relationship (or rule) on the level of what is claimed, rather than on the level of statements. Note that we still need to have a way to find out that "X implies Y", which is a content-level claim that should have its own reference somewhere. We mainly use inference in this sense with "subclass of" in reasonator or when checking constraints. In this case, the implications are encoded as subclass-of statements ("If X is a piano, then X is an instrument"). This allows us to have references on the implications.
In general, an interesting question here is what the status of "subclass of" really is. Do we gather this information from external sources (surely there must be a book that tells us that pianos are instruments) or do we as a community define this for Wikidata (surely, the overall hierarchy we get is hardly the "universal class hierarchy of the world" but a very specific classification that is different from other classifications that may exist elsewhere)? Best not to think about it too much and to gather sources whenever we have them ;-)
Besides these two notions ("constraints" to uncover inconsistent references, and "logical axioms" to derive new statements from given ones), there is also a third type of constraint that is purely analytical. If we *define* that our property "birthdate" can only be used on humans (just for the example), then we know that, by our own definition/requirement, any item that has a birthdate must be a human. This is independent of whether some reference says "IBM was born on June 16, 1911" -- we would simply not translate this as "birthdate" in our encoding in Wikidata. So it is possible to have purely analytical knowledge on this level, and we will have complete control over defining it (since it's our choice what we mean by property "birthdate"). In this case, we will get hard constraints that should really never be violated (you mentioned "subject of" should only be used as qualifier -- this is another example of a hard constraint that comes from our own community definitions).
At the moment, hard constraints (from definitions) and soft constraints (expectations) are simply mixed, and maybe this is fine since we handle them in a similar fashion (humans need to look how to fix the situation). Most constraints, even those that refer to definitions, are rather soft anyway since we apply them to statements, not to hard facts. Hard constraints can only occur in cases where the *encoding* of a statement in Wikidata is wrong (not the intended statement as such, but how it was translated to data).
Markus
Cheers, Micru
PS: this is a re-post because my previous message was bounced back "for being too long" :)
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Markus,
On Thu, May 29, 2014 at 12:53 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
This is an easy question once you have been clear about what "human behaviour" is. According to enwiki, it is a "range of behaviours *exhibited by* humans".
Settled :) Let's leave it at <defined as a trait of>
What would anybody do with this data? In what application could it be of interest?
Well, our goal it to gather the whole human knowledge, not to use it. I can think of several applications, but let's leave that open. Never underestimate human creativity ;-)
Moreover, as a great Icelandic ontologist once said: "There is definitely, definitely, definitely no logic, to human behaviour" ;-)
Definitely, that is why we spend so much time in front of flickering squares making them flicker even more. It makes total sense :P
I think "constraints" are already understood in this way. The name comes from databases, where a "constraint violation" is indeed a rather hard error. On the other hand, ironically, constraints (as a technical term) are often considered to be a softer form of modelling than (onto)logical axioms: a constraint can be violated while a logical axiom (as the name suggests) is always true -- if it is not backed by the given data, new data will be inferred. So as a technical term, "constraint" is quite appropriate for the mechanism we have, although it may not be the best term to clarify the intention.
Ok, I will not fight traditional labels nor conventions. I was interested in pointing out to the inappropriateness of using a word inside our community with a definition that doesn't matches its use, when there is another word that matches perfectly and conveys its meaning better to users.
Some important ideas like classification (instance of/subclass of) belong
completely to the analytical realm. We don't observe classes, we define them. A planet is what we call a "planet", and this can change even if the actual lumps in space are pretty much the same.
Agreed. Better labels could be <defined as instance of>/<defined as subclass of>
Now inferences are slightly different. If we know that X implies Y, then if "A says X" we can infer that (implicitly) "A says Y". That is a logical relationship (or rule) on the level of what is claimed, rather than on the level of statements. Note that we still need to have a way to find out that "X implies Y", which is a content-level claim that should have its own reference somewhere. We mainly use inference in this sense with "subclass of" in reasonator or when checking constraints. In this case, the implications are encoded as subclass-of statements ("If X is a piano, then X is an instrument"). This allows us to have references on the implications.
Nope, nope, nope. I was not referring to "hard" implications, but to heuristic ones.
Consider that these properties in the item namespace: <defined as a trait of> <defined as having> <defined as instance of>
Would translate as these constraints in the property namespace: <likely to be a trait of> <likely to have> <likely to be an instance of>
In general, an interesting question here is what the status of "subclass of" really is. Do we gather this information from external sources (surely there must be a book that tells us that pianos are instruments) or do we as a community define this for Wikidata (surely, the overall hierarchy we get is hardly the "universal class hierarchy of the world" but a very specific classification that is different from other classifications that may exist elsewhere)? Best not to think about it too much and to gather sources whenever we have them ;-)
I think it is good to think about it and to consider options to deal with it. Like for instance: <defined as instance of> "corresponds with item" <Wikimedia community concept> We already have items that refer to concepts that only make sense for us, so no change in that regard.
At the moment, hard constraints (from definitions) and soft constraints
(expectations) are simply mixed, and maybe this is fine since we handle them in a similar fashion (humans need to look how to fix the situation). Most constraints, even those that refer to definitions, are rather soft anyway since we apply them to statements, not to hard facts. Hard constraints can only occur in cases where the *encoding* of a statement in Wikidata is wrong (not the intended statement as such, but how it was translated to data).
As explained above, expectations inferred from definitions should not be treated as hard constraints, but as soft ones.
Micru
@David: I think you should have a look to fuzzy logic https://www.wikidata.org/wiki/Q224821 :)
2014-05-29 1:48 GMT+02:00 David Cuenca dacuetu@gmail.com:
Markus,
On Thu, May 29, 2014 at 12:53 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
This is an easy question once you have been clear about what "human behaviour" is. According to enwiki, it is a "range of behaviours *exhibited by* humans".
Settled :) Let's leave it at <defined as a trait of>
What would anybody do with this data? In what application could it be of interest?
Well, our goal it to gather the whole human knowledge, not to use it. I can think of several applications, but let's leave that open. Never underestimate human creativity ;-)
Moreover, as a great Icelandic ontologist once said: "There is definitely, definitely, definitely no logic, to human behaviour" ;-)
Definitely, that is why we spend so much time in front of flickering squares making them flicker even more. It makes total sense :P
I think "constraints" are already understood in this way. The name comes from databases, where a "constraint violation" is indeed a rather hard error. On the other hand, ironically, constraints (as a technical term) are often considered to be a softer form of modelling than (onto)logical axioms: a constraint can be violated while a logical axiom (as the name suggests) is always true -- if it is not backed by the given data, new data will be inferred. So as a technical term, "constraint" is quite appropriate for the mechanism we have, although it may not be the best term to clarify the intention.
Ok, I will not fight traditional labels nor conventions. I was interested in pointing out to the inappropriateness of using a word inside our community with a definition that doesn't matches its use, when there is another word that matches perfectly and conveys its meaning better to users.
Some important ideas like classification (instance of/subclass of) belong
completely to the analytical realm. We don't observe classes, we define them. A planet is what we call a "planet", and this can change even if the actual lumps in space are pretty much the same.
Agreed. Better labels could be <defined as instance of>/<defined as subclass of>
Now inferences are slightly different. If we know that X implies Y, then if "A says X" we can infer that (implicitly) "A says Y". That is a logical relationship (or rule) on the level of what is claimed, rather than on the level of statements. Note that we still need to have a way to find out that "X implies Y", which is a content-level claim that should have its own reference somewhere. We mainly use inference in this sense with "subclass of" in reasonator or when checking constraints. In this case, the implications are encoded as subclass-of statements ("If X is a piano, then X is an instrument"). This allows us to have references on the implications.
Nope, nope, nope. I was not referring to "hard" implications, but to heuristic ones.
Consider that these properties in the item namespace:
<defined as a trait of> <defined as having> <defined as instance of>
Would translate as these constraints in the property namespace:
<likely to be a trait of> <likely to have> <likely to be an instance of>
In general, an interesting question here is what the status of "subclass of" really is. Do we gather this information from external sources (surely there must be a book that tells us that pianos are instruments) or do we as a community define this for Wikidata (surely, the overall hierarchy we get is hardly the "universal class hierarchy of the world" but a very specific classification that is different from other classifications that may exist elsewhere)? Best not to think about it too much and to gather sources whenever we have them ;-)
I think it is good to think about it and to consider options to deal with it. Like for instance: <defined as instance of> "corresponds with item" <Wikimedia community concept> We already have items that refer to concepts that only make sense for us, so no change in that regard.
At the moment, hard constraints (from definitions) and soft constraints
(expectations) are simply mixed, and maybe this is fine since we handle them in a similar fashion (humans need to look how to fix the situation). Most constraints, even those that refer to definitions, are rather soft anyway since we apply them to statements, not to hard facts. Hard constraints can only occur in cases where the *encoding* of a statement in Wikidata is wrong (not the intended statement as such, but how it was translated to data).
As explained above, expectations inferred from definitions should not be treated as hard constraints, but as soft ones.
Micru
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 29/05/14 12:41, Thomas Douillard wrote:
@David: I think you should have a look to fuzzy logic https://www.wikidata.org/wiki/Q224821:)
Or at probabilistic logic, possibilistic logic, epistemic logic, ... it's endless. Let's first complete the data we are sure of before we start to discuss whether Pluto is a planet with fuzzy degree 0.6 or 0.7 ;-)
(The problem with quantitative logics is that there is usually no reference for the numbers you need there, so they are not well suited for a secondary data collection like Wikidata that relies on other sources. The closest concept that still might work is probabilistic logic, since you can really get some probabilities from published data; but even there it is hard to use the probability as a raw value without specifying very clearly what the experiment looked like.)
Markus
2014-05-29 1:48 GMT+02:00 David Cuenca <dacuetu@gmail.com mailto:dacuetu@gmail.com>:
Markus, On Thu, May 29, 2014 at 12:53 AM, Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: This is an easy question once you have been clear about what "human behaviour" is. According to enwiki, it is a "range of behaviours *exhibited by* humans". Settled :) Let's leave it at <defined as a trait of> What would anybody do with this data? In what application could it be of interest? Well, our goal it to gather the whole human knowledge, not to use it. I can think of several applications, but let's leave that open. Never underestimate human creativity ;-) Moreover, as a great Icelandic ontologist once said: "There is definitely, definitely, definitely no logic, to human behaviour" ;-) Definitely, that is why we spend so much time in front of flickering squares making them flicker even more. It makes total sense :P I think "constraints" are already understood in this way. The name comes from databases, where a "constraint violation" is indeed a rather hard error. On the other hand, ironically, constraints (as a technical term) are often considered to be a softer form of modelling than (onto)logical axioms: a constraint can be violated while a logical axiom (as the name suggests) is always true -- if it is not backed by the given data, new data will be inferred. So as a technical term, "constraint" is quite appropriate for the mechanism we have, although it may not be the best term to clarify the intention. Ok, I will not fight traditional labels nor conventions. I was interested in pointing out to the inappropriateness of using a word inside our community with a definition that doesn't matches its use, when there is another word that matches perfectly and conveys its meaning better to users. Some important ideas like classification (instance of/subclass of) belong completely to the analytical realm. We don't observe classes, we define them. A planet is what we call a "planet", and this can change even if the actual lumps in space are pretty much the same. Agreed. Better labels could be <defined as instance of>/<defined as subclass of> Now inferences are slightly different. If we know that X implies Y, then if "A says X" we can infer that (implicitly) "A says Y". That is a logical relationship (or rule) on the level of what is claimed, rather than on the level of statements. Note that we still need to have a way to find out that "X implies Y", which is a content-level claim that should have its own reference somewhere. We mainly use inference in this sense with "subclass of" in reasonator or when checking constraints. In this case, the implications are encoded as subclass-of statements ("If X is a piano, then X is an instrument"). This allows us to have references on the implications. Nope, nope, nope. I was not referring to "hard" implications, but to heuristic ones. Consider that these properties in the item namespace: <defined as a trait of> <defined as having> <defined as instance of> Would translate as these constraints in the property namespace: <likely to be a trait of> <likely to have> <likely to be an instance of> In general, an interesting question here is what the status of "subclass of" really is. Do we gather this information from external sources (surely there must be a book that tells us that pianos are instruments) or do we as a community define this for Wikidata (surely, the overall hierarchy we get is hardly the "universal class hierarchy of the world" but a very specific classification that is different from other classifications that may exist elsewhere)? Best not to think about it too much and to gather sources whenever we have them ;-) I think it is good to think about it and to consider options to deal with it. Like for instance: <defined as instance of> "corresponds with item" <Wikimedia community concept> We already have items that refer to concepts that only make sense for us, so no change in that regard. At the moment, hard constraints (from definitions) and soft constraints (expectations) are simply mixed, and maybe this is fine since we handle them in a similar fashion (humans need to look how to fix the situation). Most constraints, even those that refer to definitions, are rather soft anyway since we apply them to statements, not to hard facts. Hard constraints can only occur in cases where the *encoding* of a statement in Wikidata is wrong (not the intended statement as such, but how it was translated to data). As explained above, expectations inferred from definitions should not be treated as hard constraints, but as soft ones. Micru _______________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
hehe, maybe some kind inferences can lead to a good heuristic to suggest properties and values in the entity suggester. As they naturally become "softer" and "softer" by combination of uncertainties, this could also provide some kind of limits for inferences by fixing a probability below which we don't add a fuzzy fact to the set of facts.
Maybe we could fix an heuristic starting fuzziness or probability score based on "1 sourced claim" -> big score ; one disputed claim ; based on ranks and so on.
2014-05-29 13:43 GMT+02:00 Markus Krötzsch markus@semantic-mediawiki.org:
On 29/05/14 12:41, Thomas Douillard wrote:
@David: I think you should have a look to fuzzy logic https://www.wikidata.org/wiki/Q224821:)
Or at probabilistic logic, possibilistic logic, epistemic logic, ... it's endless. Let's first complete the data we are sure of before we start to discuss whether Pluto is a planet with fuzzy degree 0.6 or 0.7 ;-)
(The problem with quantitative logics is that there is usually no reference for the numbers you need there, so they are not well suited for a secondary data collection like Wikidata that relies on other sources. The closest concept that still might work is probabilistic logic, since you can really get some probabilities from published data; but even there it is hard to use the probability as a raw value without specifying very clearly what the experiment looked like.)
Markus
2014-05-29 1:48 GMT+02:00 David Cuenca <dacuetu@gmail.com mailto:dacuetu@gmail.com>:
Markus, On Thu, May 29, 2014 at 12:53 AM, Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: This is an easy question once you have been clear about what "human behaviour" is. According to enwiki, it is a "range of behaviours *exhibited by* humans". Settled :) Let's leave it at <defined as a trait of> What would anybody do with this data? In what application could it be of interest? Well, our goal it to gather the whole human knowledge, not to use it. I can think of several applications, but let's leave that open. Never underestimate human creativity ;-) Moreover, as a great Icelandic ontologist once said: "There is definitely, definitely, definitely no logic, to human behaviour"
;-)
Definitely, that is why we spend so much time in front of flickering squares making them flicker even more. It makes total sense :P I think "constraints" are already understood in this way. The name comes from databases, where a "constraint violation" is indeed a rather hard error. On the other hand, ironically, constraints (as a technical term) are often considered to be a softer form of modelling than (onto)logical axioms: a constraint can be violated while a logical axiom (as the name suggests) is always true -- if it is not backed by the given data, new data will be inferred. So as a technical term, "constraint" is quite appropriate for the mechanism we have, although it may not be the best term to clarify the intention. Ok, I will not fight traditional labels nor conventions. I was interested in pointing out to the inappropriateness of using a word inside our community with a definition that doesn't matches its use, when there is another word that matches perfectly and conveys its meaning better to users. Some important ideas like classification (instance of/subclass of) belong completely to the analytical realm. We don't observe classes, we define them. A planet is what we call a "planet", and this can change even if the actual lumps in space are pretty much the same. Agreed. Better labels could be <defined as instance of>/<defined as subclass of> Now inferences are slightly different. If we know that X implies Y, then if "A says X" we can infer that (implicitly) "A says Y". That is a logical relationship (or rule) on the level of what is claimed, rather than on the level of statements. Note that we still need to have a way to find out that "X implies Y", which is a content-level claim that should have its own reference somewhere. We mainly use inference in this sense with "subclass of" in reasonator or when checking constraints. In this case, the implications are encoded as subclass-of statements ("If X is a piano, then X is an instrument"). This allows us to have references on the implications. Nope, nope, nope. I was not referring to "hard" implications, but to heuristic ones. Consider that these properties in the item namespace: <defined as a trait of> <defined as having> <defined as instance of> Would translate as these constraints in the property namespace: <likely to be a trait of> <likely to have> <likely to be an instance of> In general, an interesting question here is what the status of "subclass of" really is. Do we gather this information from external sources (surely there must be a book that tells us that pianos are instruments) or do we as a community define this for Wikidata (surely, the overall hierarchy we get is hardly the "universal class hierarchy of the world" but a very specific classification that is different from other classifications that may exist elsewhere)? Best not to think about it too much and to gather sources whenever we have them ;-) I think it is good to think about it and to consider options to deal with it. Like for instance: <defined as instance of> "corresponds with item" <Wikimedia community concept> We already have items that refer to concepts that only make sense for us, so no change in that regard. At the moment, hard constraints (from definitions) and soft constraints (expectations) are simply mixed, and maybe this is fine since we handle them in a similar fashion (humans need to look how to fix the situation). Most constraints, even those that refer to definitions, are rather soft anyway since we apply them to statements, not to hard facts. Hard constraints can only occur in cases where the *encoding* of a statement in Wikidata is wrong (not the intended statement as such, but how it was translated to data). As explained above, expectations inferred from definitions should not be treated as hard constraints, but as soft ones. Micru _______________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 29/05/14 13:53, Thomas Douillard wrote:
hehe, maybe some kind inferences can lead to a good heuristic to suggest properties and values in the entity suggester. As they naturally become "softer" and "softer" by combination of uncertainties, this could also provide some kind of limits for inferences by fixing a probability below which we don't add a fuzzy fact to the set of facts.
Maybe we could fix an heuristic starting fuzziness or probability score based on "1 sourced claim" -> big score ; one disputed claim ; based on ranks and so on.
Sorry, I have to expand on this a bit ...
My main point was that there are many fuzzy logics (depending on the t-norm you chose) and many probabilistic logics (depending on the stochastic assumptions you make). The meaning of a score crucially depends on which logic you are in. Moreover, at least in fuzzy logic, the scores only are relevant in comparison to other scores (there is no absolute meaning to "0.3") -- therefore you need to ensure that the scores are assigned in a globally consistent way (0.3 in Wikidata would have to mean exactly the same wherever it is used).
This makes it extremely hard to implement such an approach in practice in a large, distributed knowledge base like ours. What's more, you cannot find these scores in books or newspapers, so you somehow have to make them up in another way. You suggested to use this for statements that are not generally accepted, but how do you measure "how disputed" a statement is? If two thirds of references are for it and the rest is against it, do you assign 0.66 as a score? It's very tricky.
Fuzzy logic has its main use in fuzzy control (the famous "washing machine" example), which is completely different and largely unrelated to fuzzy knowledge representation. In knowledge representation, fuzzy approaches are also studied, but their application is usually in a closed system (e.g., if you have one system that extracts data from a text and assigns "certainties" to all extracted facts in the same way). It's still unclear how to choose the right logic, but at least it will give you a uniform treatment of your data according to some fixed principles (whether they make sense or not).
The situation is much clearer in probabilistic logics, where you define your assumptions first (e.g., you assume that events are independent or that dependencies are captured in some specific way). This makes it more rigorous, but also harder to apply, since in practice these assumptions rarely hold. This is somewhat tolerable if you have a rather uniform data set (e.g., a lot of sensor measurements that give you some probability for actual states of the underlying system). But if you have a huge, open, cross-domain system like Wikidata, it would be almost impossible to force it into a particular probability framework where "0.3" really means "in 30% of all cases".
Also note that scientific probability is always a limit of observed frequencies. It says: if you do something again and again, this is the rate you will get. Often-heard statements like "We have an 80% chance to succeed!" or "Chances are almost zero that the Earth will blow up tomorrow!" are scientifically pointless, since you cannot repeat the experiments that they claim to make statements about. Many things we have in Wikidata are much more on the level of such general statements than on the level that you normally use probability for (good example of a proper use of probability: "based the tests that we did so far, this patient has a 35% chance of having cancer" -- these are not the things we normally have in Wikidata).
Markus
2014-05-29 13:43 GMT+02:00 Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org>:
On 29/05/14 12:41, Thomas Douillard wrote: @David: I think you should have a look to fuzzy logic <https://www.wikidata.org/__wiki/Q224821 <https://www.wikidata.org/wiki/Q224821>>:) Or at probabilistic logic, possibilistic logic, epistemic logic, ... it's endless. Let's first complete the data we are sure of before we start to discuss whether Pluto is a planet with fuzzy degree 0.6 or 0.7 ;-) (The problem with quantitative logics is that there is usually no reference for the numbers you need there, so they are not well suited for a secondary data collection like Wikidata that relies on other sources. The closest concept that still might work is probabilistic logic, since you can really get some probabilities from published data; but even there it is hard to use the probability as a raw value without specifying very clearly what the experiment looked like.) Markus
David,
I need to answer to your first assertion separately:
On 29/05/14 01:48, David Cuenca wrote:
Well, our goal it to gather the whole human knowledge, not to use it.
No, that is really not the case. Our goal is to gather carefully selected parts of the human knowledge. Our community defines what these parts are. Just like in Wikipedia.
Even if you wanted to gather "all human knowledge" this goal would not be a useful principle for deciding what to do first. For example, we know that every natural number is an element of the natural numbers. It is obviously not our goal to gather these infinitely many statements (if you disagree, you could try to propose a bot that starts to import this data ;-). Therefore, it is clear that gathering *all* knowledge is not even an abstract ideal of our community. Quite the contrary: we explicitly don't want it.
The natural numbers are just an extreme example. Many other cases exists (for instance, we do not import all free databases into Wikidata, although they are finite). The question then is: How do we know what data we want and what data we don't want? What principles do we base our decision on? For me, there are two main principles:
* practical utility (does it serve a purpose that we care about?) * simplicity and clarity (is it natural to express and easy to understand?)
You said that we cannot foresee *all* applications, but that does not mean that we should start to create data for which we cannot foresee *any*. There is just too much data of the latter kind, and we need to make a choice.
Don't get me wrong: I consider myself an "inclusionist". Better to have some useless data than to miss some important content. But there is no neutral ground here -- we all must draw a line somewhere (or start writing the natural number import bot ;-). My position is: if we have data that is very hard to capture and at the same time has no conceivable use, then we should not spend our energy on it while there is so much clearly defined, important data that we are still missing.
Markus
The other answers, under the original subject:
On 29/05/14 01:48, David Cuenca wrote:
Settled :) Let's leave it at <defined as a trait of>
I don't think it is very clear what the intention of this property is. What are the limits of its use? What is it meant to do? Can behaviour really be a "trait" of a species? If we allow it here, it seems to apply to all kinds of connections: density/car? eternity/time? time/reality? evil/devil? rigour/science? -- this is opening a can of worms. It will be hard to maintain this.
Wikiuser13 recently added "consists of: Neptune" to Q1. It was fixed. But it is a good example of the kind of confusion that comes from such general ontological (in the philosophical sense) properties. And "consists of" is still very simple compared to "defined as a trait of". Can't we focus on more obvious things like "has social network account" for a while? ;-)
...
Some important ideas like classification (instance of/subclass of) belong completely to the analytical realm. We don't observe classes, we define them. A planet is what we call a "planet", and this can change even if the actual lumps in space are pretty much the same.
Agreed. Better labels could be <defined as instance of>/<defined as subclass of>
I don't think this is better. The short names are fine. As I explained in my email, Wikidata statements are mainly about what the external references say. The distinction between "defined" and "observed" is not on the surface of this. The main question is "Did the reference say that pianos are instruments?" but not "Did the reference say pianos are instruments because of the definition of 'piano'?" Therefore, we don't need to put this information in our labels.
Now inferences are slightly different. If we know that X implies Y, then if "A says X" we can infer that (implicitly) "A says Y". That is a logical relationship (or rule) on the level of what is claimed, rather than on the level of statements. Note that we still need to have a way to find out that "X implies Y", which is a content-level claim that should have its own reference somewhere. We mainly use inference in this sense with "subclass of" in reasonator or when checking constraints. In this case, the implications are encoded as subclass-of statements ("If X is a piano, then X is an instrument"). This allows us to have references on the implications.
Nope, nope, nope. I was not referring to "hard" implications, but to heuristic ones.
Consider that these properties in the item namespace:
<defined as a trait of> <defined as having> <defined as instance of>
Would translate as these constraints in the property namespace:
<likely to be a trait of> <likely to have> <likely to be an instance of>
I think you might have misunderstood my email. I was arguing *in favour* of soft constraints, but in the paragraph before the one about inferences that you reply to here. Inferences are hard ways for obtaining new knowledge from our own definitions. Example:
If X is the father of Y according to reference A Then Y is the child of X according to reference A
This is as hard as it can get. We are absolutely sure of this since this rule just explains the relationship between two different ways we have for encoding family relationships.
Below, you said "expectations inferred from definitions should not be treated as hard constraints" -- maybe this mixture of terms indicates that I have not been clear enough about the distinction between "inference" and "constraint". They are really completely different ways of looking at things. Inferences are something that adds (inevitable) conclusions to your knowledge, while constraints just tell you what to check for. If you accept the premises of an inference and the inference rule, then you must also accept the conclusion -- there is no "soft" way of reading this. To make it soft, you can start to formalise "softness" in your knowledge, using fuzzy logic or whatnot (see my other email with Thomas).
I don't think we can use "soft inferences" (in the sense of fuzzy logic et al.) but I am in favour of "soft constraints" (in the sense of your "expectations"). I guess we agree on all of this, but have a bit of trouble in making ourselves clear :-) But it is rather subtle material after all.
In general, an interesting question here is what the status of "subclass of" really is. Do we gather this information from external sources (surely there must be a book that tells us that pianos are instruments) or do we as a community define this for Wikidata (surely, the overall hierarchy we get is hardly the "universal class hierarchy of the world" but a very specific classification that is different from other classifications that may exist elsewhere)? Best not to think about it too much and to gather sources whenever we have them ;-)
I think it is good to think about it and to consider options to deal with it. Like for instance: <defined as instance of> "corresponds with item" <Wikimedia community concept> We already have items that refer to concepts that only make sense for us, so no change in that regard.
If you say this, then you are taking the position that instance of is defined by the community rather than being taken from external sources. My point was that such a position is not justified, given that there are so many instance of relations with references (and even qualifiers). I am unsure about the status of "subclass of" -- it could be considered a community concept or a world concept. Maybe it's best to leave this to applications that use the data.
(Btw. "corresponds with item" would be another unclear property that we should better avoid.)
At the moment, hard constraints (from definitions) and soft constraints (expectations) are simply mixed, and maybe this is fine since we handle them in a similar fashion (humans need to look how to fix the situation). Most constraints, even those that refer to definitions, are rather soft anyway since we apply them to statements, not to hard facts. Hard constraints can only occur in cases where the *encoding* of a statement in Wikidata is wrong (not the intended statement as such, but how it was translated to data).
As explained above, expectations inferred from definitions should not be treated as hard constraints, but as soft ones.
As I said, hard and soft constraints would probably be treated in the same way anyway (which is the soft way), so I guess we agree here. My distinction of "hard" and "soft" constraints was about where we are getting the constraints from: some constraints are merely "usually satisfied in practice" (soft) while others are "requirements we have defined for the use of our properties" (hard). We can treat them similar, but it may still be good to understand where our "expectations" come from in each case.
Markus