[WikiEN-l] The composition of Wikipedia (maybe, sort of)

Sun Oct 21 06:19:25 UTC 2007

Following up on a request from Elonka, I did an analysis using the category
web structure to crudely estimate how Wikipedia content is distributed.

The result I got was that Wikipedia "is":

9.6% - People
28.0% - Science
10.5% - Culture
16.0% - Geography
6.3% - History
0.8% - Religion
5.5% - Philosophy
1.8% - Mathematics
14.3% - Nature
6.0% - Technology
1.4% - Fiction

The basic principle I used was that each of the categories listed
above corresponds to pure X-ness, and that each child category inherits
their flavors as the average of their parent categories.[Footnote-1]  Each
article then has flavors based on the average of the categories they are in,
and the totals come from averaging over all articles.  This approach has
some interesting consequences, the Category:Scientists becomes a mix of
People-ness and Science-ness.  Category:American scientists would then blend
People-ness, Science-ness and Geography-ness, etc.  So even though 20-30% of
Wikipedia articles are biographies, most are blends of People-ness and
whatever the person is known for with the end result that this crude measure
only associates 10% of Wikipedia content with People-ness.

Whether this is desirable or not is of course subjective.  Consider, how
would you count Scientists if asked what fraction of the encyclopedia is
about Science vs. Biographies?  In some sense the question isn't even
sensible since it is not really an either/or proposition and some articles
are about both Science AND People.  The same problem exists for essentially
any set for categories.  The approach I used counts these problem cases a
little towards each relevant category, but other solutions are of course
possible.

On top of this, there is the problem that the category tree... er, web...
sucks.  There are many places that the categories meander sideways, like
Water sports->Sailing->Winds->Wind power, so the descendent grand-children
have very little to do with the grandparents.  If you want a challenge, find
the path that leads Science to Religion and back to Science (yes such a path
exists purely through category children).  In fact, each of Wikipedia's
"Category:Main topic classifications" categories share nearly all the same
children just at different depths of organization.

Oh, and then there is the problem that the category structure doesn't
necessarily make sense.  For example, Natural science and Applied Science
are both "Main topics" but their obvious parent, Science, is not.

So anyway, the category web sucks and the idea of breaking Wikipedia content
into discrete categories is somewhat nonsensical, but if one wants to try,
it might look something like the list I gave above.  Lots of Science and
Nature (which in practice means all those stubs of living things and
astronomical objects).  Many places, citites, states and Rambot fodder.
Substantial amounts of people, culture, and history (which overlap in a
variety of ways), and modest amounts of other things.  The fiction number
was lower than I expected, but that may be because it was diluted against
the entertainment side of Culture.  I also suspect that the Science number
is bit stacked because virtually everything is somebody's science (be it
social science, policital science, military science, etc.)

Whether these results are useful (or even interesting) is a fair question,
and I don't know.  Aside from subjectively deciding on some starting set of
categories, it is an "objective" measure.  However one might well get more
meaningful results by subjectively sorting a few thousand random articles.
I could also repeat this experiment with a different set of basis categories
if people have suggestions.

Anyway, I hope this helps Elonka's curiosity and is interesting to at least
someone (if only because it makes you think about the logical problems
associated with categorization).

-Robert Rohde

[1] The formal description of this form of flow modeling involves solving a
set of 200,000+ simultaneous linear vector equations, one for each
category.  Glory be unto Matlab.