On 4 April 2012 17:28, Carcharoth carcharothwp@googlemail.com wrote:
In principle that shouldn't be too hard to do, with Catscan 2.0 to
intersect categories for you. In practice the toolserver can't be taken for granted. And it seems that the naive way of doing this produces a list that is just too big (I took sub-categories to depth 5 there). To get an idea, if you do 1950 births intersect people stubs you get something over 2000. Which suggests the magnitude of the problem might be around 100,000.
This presumes 2000 every year from 1950 to 2000? Might not be that, but something of that order of magnitude. Thanks. I wish the toolserver and tools like that wouldn't trip up or time out over large stuff like that. The inability to get a true sense of the bigger picture can lead to potential failure points.
Catscan has always been quite slow - it's fair enough, I suppose, when you consider it's having to match item-by-item in two very large and dynamically generated lists! I wonder if it's possible to tell it to just return a figure for matching articles, rather than a list, when you expect it to be unusually large?
That aside, approximately two thirds of rated biography articles are stubs, judging by talkpage assessments. If this generalises to BLPs, we're talking a little under 400,000. *However*, this has two major caveats.
Firstly, I suspect that our BLPs are probably less likely to be stubs than other articles; they skew strongly towards topics from the past twenty years, which tend to be better documented and so it's easier for a casual editor to bring them up to a decent size.
Secondly, talkpage ratings (and stub templates on articles, for that matter) are notoriously laggy. A sizable proportion of articles nominally rated stubs are not stubs by any reasonable definition; they were rated a long time ago, and have since expanded and improved dramatically. However, the ratings often don't get changed by the authors who work on the articles; this is the same phenomenon which leaves maintenance templates on the top of articles years after the problems are resolved, and has the same effect of making things seem worse than they are. Depending on the topic, anything from 10% to 25% of articles marked as stubs probably aren't, in the sense that they have nontrivial content and serve as more than a placeholder.
Putting these together, I would make a wild stab at saying that it is unlikely more than half our BLPs - about a quarter of a million entries - are stubs. I'm not sure I'd go as low as 100,000, but it's interesting how divergent the estimates from different sources are...