2009/8/9 Carcharoth carcharothwp@googlemail.com:
If anyone could hazard a guess at how many of the 725,635 biographies we have where there might be a dispute over gender, that would be good (note that for some reason that figure, from the "WikiProject Biography" statistics, includes music groups, and also some other "group biographies", rather than "single biographies"). But really, if it is only a couple of hundred where the gender is disputed or not known, then there should be no objection to classifying the others by gender.
Okay, estimate time!
When LibraryThing began their "common knowledge" cataloging program - essentially an attempt to gather structured information on books and authors via a user-editable database - they tangled briefly with the problem of gender for authors.
On the one hand, it's a very important detail to record, if only from a pragmatic perspective - hang around a bookshop or a library and see how long until someone starts looking for "a female crime novelist", etc. For practical reasons, they wanted it a restricted "this field has value X" record rather than free-text, which was used for almost everything else.
On the other hand, it's even more complex for books than for our biographies, as many books are authored by someone about whom even the most basic biographical information is unknown, or who isn't a real person at all, before we even worry about people who don't fit the normal classifications.
In the end, they went with a fourfold structure:
* male * female * other/contested/unknown * n/a
The third was for those who are people who don't fit neatly into the first two, for whatever reason; the fourth was for corporate bodies, and so also served as a way to differentiate real people and not-real people.
This is quite handy, because the ratio of the third to the first two gives us some idea of what we're likely to encounter in Wikipedia - it won't be the same, but it'll be the right order of magnitude. There are currently 8,736 "n/a", 57,047 "female", 118,069 "male"... and 431 "other". Roughly speaking, that's 0.25% of catalogued people aren't defined neatly as male or female. Scaling that up to Wikipedia would mean we'd be looking at, at most, 1,500 to 2,000 biographies where we shouldn't simply do male/female.
Given that not all the "other" cases are people who fall outside the binary - the data is a bit choppy and includes some who should be n/a, plus oddities like joint pseudonyms - our proportion would probably be lower. The chronological weighting of the two datasets complicates matters; a set of authors will skew towards modernity, but then again more than half our biographies are BLPs, so we ourselves also skew towards modernity. I can't say which of those is the stronger pull!
So I think, all told, we're going to be looking at a few more than a couple of hundred, but perhaps not more than a thousand cases. If we're consciously trying to get good coverage of people who fall outside the usual classification, and addressing those articles rigorously - itself not a bad idea - we might end up pushing a couple of thousand.