[Foundation-l] Tragical dynamics: that run for the number of articles

Lars Aronsson lars at aronsson.se
Sun Jun 29 14:44:56 UTC 2008


Stephen Bain wrote:

> Swedish Wikipedia is distributed in almost exactly the same way 
> as English Wikipedia, with the difference being that its average 
> size is less than half that of En's, at around 1900 bytes.

In Sweden we have this problem that people in our capital are so 
arrogant towards the rest of the country.  This is of course 
nothing special about Sweden, it happens in every country.  
Americans just cannot stand New Yorkers (the big city) and policy 
makers in Washington DC.  People in Russia probably hate people in 
Moscow.  Europe is united in its disliking of the EU bureaucracy 
headquarters in Brussels. Talking to other Swedes of my age, it is 
easier to discuss individual streets in San Francisco than streets 
in a neighboring Swedish city where none of us have been.  We know 
that the U.N. headquarters is on the east side of Manhattan and 
that Wall Street is at a walking distance from Battery Park.  We 
all tend to look upwards.  Swedes look to Stockholm, Paris, London 
and New York, but more seldom to Latvia or Bangladesh or Malawi.

All Swedish wikipedians know about the Swedish and the English 
Wikipedia.  Of course the English is much larger.  The fact that 
Ingmar Bergman's English article is three times longer (44K) than 
the Swedish one (16K), is taken for granted and not as an urgent 
crisis that needs to be addressed.  The Swedish Wikipedia has 
articles about small Swedish places and people that don't have 
articles in the English Wikipedia, and so their sizes cannot be 
compared.  It is also wellknown that the Swedish Wikipedia is 
slightly larger than our immediate neighboring languages Finnish, 
Norwegian and Danish.

The idea to compare the Swedish Wikipedia to the one in Czech, 
having the same number of speakers (10M), just doesn't occur, 
because nobody in Sweden speaks Czech and most would believe you 
need to know the language in order to understand anything. I guess 
it is mutual, the Czech speak their own language plus English and 
German and a bit of Polish, but nobody would care about Swedish.

It was only recently that the stubbiness of the Swedish Wikipedia, 
relative to other languages of Wikipedia, started to come into 
public awareness.  Most regular wikipedians are now aware of this.  
But when we start to talk about improving quality, average Swedes 
who are occasional contributors still don't understand what this 
is about.  They think we want to compete with traditional printed 
encyclopedias, and that we are losing our soul as a Net project.  
Why should any article be deleted, when storage is so cheap?  Why 
require source citations and complete sentences and birth dates?  
Surely someone else is going to add that later.

It's easy to measure the average size of articles, but much harder 
to change it. Even if you add 50 bytes to each article, that would 
only move the average by 50 bytes and not from 2K to 4K. Just by 
translating from English, you can add 30K to the Swedish article 
on Ingmar Bergman, but that doesn't move the average size.

What I did was instead to look at the very smallest articles.  
There's a page [[Special:Shortpages]] for this, but unfortunately 
it contains both short articles and disambiguation pages.  From 
the database dumps, I could filter out the disambiguations which 
are 10K on the Swedish Wikipedia, leaving 270K real articles.

I found that 0.1 percent (270 articles) were shorter than 90 
bytes. This was better than the Arabic Wikipedia where 0.1 percent 
are shorter than 62 bytes, Latvian with 78 bytes and Estonian with 
81 bytes.  But it is far worse than Danish 130 bytes, Polish 160, 
Czech 213, German 280 and Russian 359 bytes.  This was in April.  
By merging and improving the very shortest articles, the Swedish 
Wikipedia's 0.1 percent shortest articles now reach 145 bytes.

In the next step, I found 1.0 percent of articles (2700 articles) 
were shorter than 126 bytes, which was far worse than any other 
language I looked at.  But during April and May this has improved 
to 171 bytes, which is better than Arabic and Estonian.  During 
all of April, the Swedish Wikipedia didn't grow at all, because 
stubs were being merged and removed faster than new articles were 
written.  This is when the Russian Wikipedia on May 19 got the 
10th position at 283K articles.

By addressing the very shortest articles, we remove the easiest 
excuse for people to create new very short stubs.  They can no 
longer point to other articles that are only 100 bytes long.  
After the lower limit is pushed from 120 to 140 bytes, it will 
continue to push upwards to 160 and 180 bytes.

Some might ask why we don't just remove everything shorter than 
200 or 300 bytes.  You are welcome to try to propose this, but so 
far every such proposal has met compact resitance.  Instead I 
found 20 articles saying "X is an island in Y archipelago" and 
merged them into a nice article on "Y archipelago" with a map, so 
that everybody can agree that the new style is a real improvement.  
With that sort of aggregation, it becomes more obvious that island 
Z is missing or that island W doesn't really belong. Fact checking 
and quality assurance are improved by merging stubs. It's all 
about setting an example for the future.  The average article size 
will move slowly over the next few years.

The next medium term goal is to look at the 10 percent shortest 
articles (27K of them), the "lower decile".  For the Swedish 
Wikipedia, this has moved from 312 bytes in April to 349 bytes in 
June.  For the Lithuanian, Estonian, Arabic, Danish Wikipedias it 
is around 400 bytes.  Norwegian and Icelandic are around 500, 
Polish at 638.  The French, Finnish and Hungarian are around 700. 
Latvian has 808, Czech 917, German 1081 and the Russian Wikipedia 
has 10 percent of its articles shorter than 1305 bytes.

If you believe the Polish 638 bytes is really bad, it's just 
because you haven't looked close enough at Danish and Norwegian. 
In fact, it's not so much worse than the French 682 bytes. You are 
welcome to make fun of the Swedish stubs, because it is an area 
where we can show constant improvement.

This hunt for "substubs" is just one of several quality 
improvement initiatives currently running on the Swedish 
Wikipedia.




-- 
  Lars Aronsson (lars at aronsson.se)
  Aronsson Datateknik - http://aronsson.se



More information about the foundation-l mailing list