On Tue, Sep 06, 2005 at 11:52:21PM +0200, Lars Aronsson wrote:
Paweł Dembowski wrote:
It seems to me that Swedish Wikipedia is quite the opposite - they have over 100,000 articles mostly because of the huge amount of substubs...
I agree that this is embarrasing and should be addressed. I think that the Danish Wikipedia, with 30,000 articles, has an even higher percentage of (sub-)stubs than the Swedish one, but this is just a feeling and I have no numbers to prove this. We need a statistic for the amount of (sub-)stubs, so we can talk verifiable numbers (and set goals) instead of guestimates. How do we define that? Is the ">200 ch" count ("alternative" article count, [1]) in Erik Zachte's Wikistats a good metric? Or the percentage of articles longer than 0.5 kilobytes [2]? I think 200 characters is an OK stub, but perhaps a substub is less than 70 characters? This leaves us with the Special:Shortpages page. That page has the advantage of being instantly updated, which Wikistats is not.
The Swedish Wikipedia has 421 articles (0.4% of 102K) shorter than 70 bytes and the Danish has 351 (1.1% of 31K). As a comparison, the Dutch Wikipedia has 79 (0.08% of 89K) and the Polish has 387 (0.4% of 93K). This makes the Polish look just as bad as the Swedish, since both have 0.4% of articles shorter than 70 bytes. But perhaps a substub should be defined at 50 bytes instead? Or 100 bytes or 150?
Numbers like 0.4% of articles tell more about effectiveness of the wikicleaning process than about the typical article. (and by the way, Special:Shortpages is not updated live on WikiMedia servers)
Just take a look at the list of shortest pages on Polish Wikipedia - they're almost all: * Redirects (what are they doing on the list ?) * Disambiguation pages without descriptions for the links. Sometimes articles have titles so obvious that {{disambig}} + list of the links is enough. * A few cases of things that look like leftovers of the past technical problems * A few cases of things that should be immediately deteled, but have been missed or are simply too recent and will be deleted soon