Why do you want categories in the first place? Why not extract whatever semantic meaning you need (e.g., about genderbread) by parsing the sentences in each article?
On Mon, Feb 17, 2014 at 3:08 PM, Stuart A. Yeates syeates@gmail.com wrote:
On Tue, Feb 18, 2014 at 8:40 AM, Samuel Klein meta.sj@gmail.com wrote:
Why do you want categories rather than structured data about gender, religion, and race?
Because that structured data would embody even stronger assumptions that the current categorisation system? Gender, religion and race are self-defined on en.wiki; you'd have to get the data first and then prove that your structure didn't contradict any of the self-definitions.
Self-definition is fine, and compatible with what I think of as structured data: large numbers of high-granularity data points associated with each [article]. Category data is a narrow subset of structured data that happens to support a tree structure, in the MediaWiki implementation.
Coming from a Western, English-language point of view it's very easy to create structures that declare groups of people such as fa'afafine incapable of existing.
... so many assumptions you just made there :-)
If the concept of fa'afafine exists in our knowledge-set (and it does: anything that passes some low bar of verifiability can be included in what our projects consider knowledge), then it can be noted as a data point applying to some other topic [article]. There is little that is Western or English about our verifiability standards; though if you are talking about the English-language Wikipedia, having an English-language source increases the verifiability of a data-claim.
We can create a category for every data-attribute -- in this case, [[category:fa'afafine]] (which does not yet exist) or [[category:kathoey]] (which does). If we didn't have wikidata, that would be the clear solution. But now it is enough to capture the attribute of "self-identifies as fa'afafina", whether or not there is an associated category. In particular, arguments about "how many category-intersections of the fa'afafine gender and other traits deserve their own category" are red herrings. All that matters is identifying these (self-defined) attributes in a way that is easy to process in bulk.
A great example of this is the perennial proposal to import biographical details from some library system (usually the Deutsche Nationalbibliothek one), when they have a different definition of gender to en.wikipedia.
Why is this a problem? The attribute "gender according to DNB" is a) useful historical data, b) verifiable, and c) easy to add to wikidata. I believe you can have "DNB-gender" as one of the variations on the global "gender" attribute. Most articles (unless they are talking about the DNB specifically) would likely refer to the global attribute. But this way you can have both datasets globally accessible. Then after the import is done, people can write bulk data-cleaning scripts to help humans review those articles where the two differ. And in cases where there is a years-long edit war about what the global attribute should be, you can keep track of what the input source-data is from various sources.
Sam.