[WikiEN-l] "Corporate Representatives for Ethical Wikipedia Engagement"

Wed Apr 4 19:16:34 UTC 2012

On 4 April 2012 17:28, Carcharoth <carcharothwp at googlemail.com> wrote:

>>> In principle that shouldn't be too hard to do, with Catscan 2.0 to
>> intersect categories for you. In practice the toolserver can't be taken for
>> granted. And it seems that the naive way of doing this produces a list that
>> is just too big (I took sub-categories to depth 5 there). To get an idea,
>> if you do 1950 births intersect people stubs you get something over 2000.
>> Which suggests the magnitude of the problem might be around 100,000.
>
> This presumes 2000 every year from 1950 to 2000? Might not be that,
> but something of that order of magnitude. Thanks. I wish the
> toolserver and tools like that wouldn't trip up or time out over large
> stuff like that. The inability to get a true sense of the bigger
> picture can lead to potential failure points.

Catscan has always been quite slow - it's fair enough, I suppose, when
you consider it's having to match item-by-item in two very large and
dynamically generated lists! I wonder if it's possible to tell it to
just return a figure for matching articles, rather than a list, when
you expect it to be unusually large?

That aside, approximately two thirds of rated biography articles are
stubs, judging by talkpage assessments. If this generalises to BLPs,
we're talking a little under 400,000. *However*, this has two major
caveats.

Firstly, I suspect that our BLPs are probably less likely to be stubs
than other articles; they skew strongly towards topics from the past
twenty years, which tend to be better documented and so it's easier
for a casual editor to bring them up to a decent size.

Secondly, talkpage ratings (and stub templates on articles, for that
matter) are notoriously laggy. A sizable proportion of articles
nominally rated stubs are not stubs by any reasonable definition; they
were rated a long time ago, and have since expanded and improved
dramatically. However, the ratings often don't get changed by the
authors who work on the articles; this is the same phenomenon which
leaves maintenance templates on the top of articles years after the
problems are resolved, and has the same effect of making things seem
worse than they are. Depending on the topic, anything from 10% to 25%
of articles marked as stubs probably aren't, in the sense that they
have nontrivial content and serve as more than a placeholder.

Putting these together, I would make a wild stab at saying that it is
unlikely more than half our BLPs - about a quarter of a million
entries - are stubs. I'm not sure I'd go as low as 100,000, but it's
interesting how divergent the estimates from different sources are...

-- 
- Andrew Gray
  andrew.gray at dunelm.org.uk