New subject: [Wikidata-l] Question about wikipedia categories.

7 May 2013


      Statistical methods can deal with black swans,  but you've got to get 
away from normal distributions and also model the risk that your model is 
wrong.
Since training sets come from the same place sausage comes from, 
training sets in machine learning rarely teach the algorithm the correct 
prior distribution of the class.  Punch a new prior into the system and it 
will perform much better.
Some kinds of sampling biases can be somewhat overcome.  Involvement of 
multiple people smoothes out individual bias.   (Kurzweil's project of 
stealing a human soul with a neural network is already being scoops by 
projects that are stealing statistical models of many souls.)
Language zone Wikipedias are obviously biased towards the viewpoint of 
people in that language zone.  Mostly that's a good thing,  because a 
Chinese knowledge base that reflected an Anglophone bias would seem 
unnatural to Chinese speakers.
And that's the point.  Useful systems don't "eliminate bias" but they 
are given the bias that they need in order to do their job.
I agree categories are most useful when they are the categories you 
need.  The toolbox above can help you estimate these with precision so high 
that it's difficult to measure.
Arnold S isn't the best case for categories because humans, 
bodybuilders,  places,  chemicals and such  are well ontologized.  Look at 
the collection that comes up for the word "Intersection",
http://en.wikipedia.org/wiki/Intersection
Most of these are connected to the larger mass through just a few 
categories that would be hard to express as restriction types.   Wikipedia 
is reasonable to require concepts to have a category because really,  if you 
want to assert something exists and can't find some category that this thing 
is a member of,  I wouldn't be so sure that this thing exists.
I'm not sure if there is anything I can't do with the current situation, 
but bear in mind that I'm going to look at DBpedia,  Wikidata and Freebase 
facts too and be willing to do data cleaning processing and hand cleaning of 
results that I cannot accept.  It's a tricky and somewhat expensive process 
(though it's cheaper than conventional ontology construction),  so cleaner 
data makes this process cheaper and quicker and available to more end users 
personalized to their own needs to define the categories they need.

Re: [Wikidata-l] Question about wikipedia categories.