[WikiEN-l] If anyone ever says Wikipedia is too deletionist

Andrew Gray andrew.gray at dunelm.org.uk
Mon Aug 10 14:01:23 UTC 2009


2009/8/9 Carcharoth <carcharothwp at googlemail.com>:

> If anyone could hazard a guess at how many of the 725,635 biographies
> we have where there might be a dispute over gender, that would be good
> (note that for some reason that figure, from the "WikiProject
> Biography" statistics, includes music groups, and also some other
> "group biographies", rather than "single biographies"). But really, if
> it is only a couple of hundred where the gender is disputed or not
> known, then there should be no objection to classifying the others by
> gender.

Okay, estimate time!

When LibraryThing began their "common knowledge" cataloging program -
essentially an attempt to gather structured information on books and
authors via a user-editable database - they tangled briefly with the
problem of gender for authors.

On the one hand, it's a very important detail to record, if only from
a pragmatic perspective - hang around a bookshop or a library and see
how long until someone starts looking for "a female crime novelist",
etc. For practical reasons, they wanted it a restricted "this field
has value X" record rather than free-text, which was used for almost
everything else.

On the other hand, it's even more complex for books than for our
biographies, as many books are authored by someone about whom even the
most basic biographical information is unknown, or who isn't a real
person at all, before we even worry about people who don't fit the
normal classifications.

In the end, they went with a fourfold structure:

* male
* female
* other/contested/unknown
* n/a

The third was for those who are people who don't fit neatly into the
first two, for whatever reason; the fourth was for corporate bodies,
and so also served as a way to differentiate real people and not-real
people.

This is quite handy, because the ratio of the third to the first two
gives us some idea of what we're likely to encounter in Wikipedia - it
won't be the same, but it'll be the right order of magnitude. There
are currently 8,736 "n/a", 57,047 "female", 118,069 "male"... and 431
"other". Roughly speaking, that's 0.25% of catalogued people aren't
defined neatly as male or female. Scaling that up to Wikipedia would
mean we'd be looking at, at most, 1,500 to 2,000 biographies where we
shouldn't simply do male/female.

Given that not all the "other" cases are people who fall outside the
binary - the data is a bit choppy and includes some who should be n/a,
plus oddities like joint pseudonyms - our proportion would probably be
lower. The chronological weighting of the two datasets complicates
matters; a set of authors will skew towards modernity, but then again
more than half our biographies are BLPs, so we ourselves also skew
towards modernity. I can't say which of those is the stronger pull!

So I think, all told, we're going to be looking at a few more than a
couple of hundred, but perhaps not more than a thousand cases. If
we're consciously trying to get good coverage of people who fall
outside the usual classification, and addressing those articles
rigorously - itself not a bad idea - we might end up pushing a couple
of thousand.

-- 
- Andrew Gray
  andrew.gray at dunelm.org.uk



More information about the WikiEN-l mailing list