Max's comment is very related to Wikidata.   The sex property [1] is a model system to explore important questions for the project at large.

For example, how rigorous do we want to be with automatic classification?  Let's say a property can have one of three values: A, B or C.  Roughly 90% of the valid subjects for that property are known to be either A or B, and 10% are known to be C.  Our automatic classifier can assign all valid subjects to either A or B.  However, it can't segregate A or B from C.  So our false positive rate is at least 10%.  Would it be acceptable for Wikidata to have a known error rate of 10% in certain properties?  At what error rate does automatic classification become unacceptable?

Another question this topic broaches: do we want to adopt formal domain and range constraints on properties?  If we do, then how do we handle rare values?  How about exceedingly rare values?  (It should be noted that the Wikidata sex property includes intersex in its range constraints [2].) There is ongoing discussion about whether we want to adopt range and domain constraints (among other property metadata) in Wikidata's Project chat [3].

Eric
https://www.wikidata.org/wiki/User:Emw

1.  https://www.wikidata.org/wiki/Property:P21
2.  https://www.wikidata.org/wiki/Property_talk:P21
3.  https://www.wikidata.org/wiki/Wikidata:Project_chat#What_type_of_data_should_be_stored (permalink: https://www.wikidata.org/w/index.php?title=Wikidata:Project_chat&oldid=78406798#What_type_of_data_should_be_stored)


On Tue, Oct 15, 2013 at 2:33 PM, Tom Morris <tfmorris@gmail.com> wrote:
So you've got an agenda that's unrelated to Wikidata or analysis thereof.  Got it.  Perhaps a non-Wikidata list would be a more appropriate forum.

On Tue, Oct 15, 2013 at 2:08 PM, Klein,Max <kleinm@oclc.org> wrote:

Sorry to rant. 

Accepted.

Tom

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l