Why do you want categories in the first place? Why not extract
whatever semantic meaning you need (e.g., about genderbread) by
parsing the sentences in each article?
On Mon, Feb 17, 2014 at 3:08 PM, Stuart A. Yeates <syeates(a)gmail.com> wrote:
On Tue, Feb 18, 2014 at 8:40 AM, Samuel Klein
Why do you want categories rather than structured
data about gender,
religion, and race?
Because that structured data would embody even stronger assumptions that the
current categorisation system? Gender, religion and race are self-defined on
en.wiki; you'd have to get the data first and then prove that your structure
didn't contradict any of the self-definitions.
Self-definition is fine, and compatible with what I think of as
structured data: large numbers of high-granularity data points
associated with each [article]. Category data is a narrow subset of
structured data that happens to support a tree structure, in the
Coming from a Western, English-language point of view
it's very easy to
create structures that declare groups of people such as fa'afafine incapable
... so many assumptions you just made there :-)
If the concept of fa'afafine exists in our knowledge-set (and it does:
anything that passes some low bar of verifiability can be included in
what our projects consider knowledge), then it can be noted as a data
point applying to some other topic [article]. There is little that is
Western or English about our verifiability standards; though if you
are talking about the English-language Wikipedia, having an
English-language source increases the verifiability of a data-claim.
We can create a category for every data-attribute -- in this case,
[[category:fa'afafine]] (which does not yet exist) or
[[category:kathoey]] (which does). If we didn't have wikidata, that
would be the clear solution.
But now it is enough to capture the attribute of "self-identifies as
fa'afafina", whether or not there is an associated category. In
particular, arguments about "how many category-intersections of the
fa'afafine gender and other traits deserve their own category" are red
All that matters is identifying these (self-defined) attributes in a
way that is easy to process in bulk.
A great example of this is the perennial proposal to
details from some library system (usually the Deutsche Nationalbibliothek
one), when they have a different definition of gender to en.wikipedia.
Why is this a problem?
The attribute "gender according to DNB" is a) useful historical data,
b) verifiable, and c) easy to add to wikidata. I believe you can have
"DNB-gender" as one of the variations on the global "gender"
attribute. Most articles (unless they are talking about the DNB
specifically) would likely refer to the global attribute. But this
way you can have both datasets globally accessible. Then after the
import is done, people can write bulk data-cleaning scripts to help
humans review those articles where the two differ. And in cases where
there is a years-long edit war about what the global attribute should
be, you can keep track of what the input source-data is from various