[Foundation-l] Tragical dynamics: that run for the number of articles
Lars Aronsson
lars at aronsson.se
Sun Jun 29 14:44:56 UTC 2008
Stephen Bain wrote:
> Swedish Wikipedia is distributed in almost exactly the same way
> as English Wikipedia, with the difference being that its average
> size is less than half that of En's, at around 1900 bytes.
In Sweden we have this problem that people in our capital are so
arrogant towards the rest of the country. This is of course
nothing special about Sweden, it happens in every country.
Americans just cannot stand New Yorkers (the big city) and policy
makers in Washington DC. People in Russia probably hate people in
Moscow. Europe is united in its disliking of the EU bureaucracy
headquarters in Brussels. Talking to other Swedes of my age, it is
easier to discuss individual streets in San Francisco than streets
in a neighboring Swedish city where none of us have been. We know
that the U.N. headquarters is on the east side of Manhattan and
that Wall Street is at a walking distance from Battery Park. We
all tend to look upwards. Swedes look to Stockholm, Paris, London
and New York, but more seldom to Latvia or Bangladesh or Malawi.
All Swedish wikipedians know about the Swedish and the English
Wikipedia. Of course the English is much larger. The fact that
Ingmar Bergman's English article is three times longer (44K) than
the Swedish one (16K), is taken for granted and not as an urgent
crisis that needs to be addressed. The Swedish Wikipedia has
articles about small Swedish places and people that don't have
articles in the English Wikipedia, and so their sizes cannot be
compared. It is also wellknown that the Swedish Wikipedia is
slightly larger than our immediate neighboring languages Finnish,
Norwegian and Danish.
The idea to compare the Swedish Wikipedia to the one in Czech,
having the same number of speakers (10M), just doesn't occur,
because nobody in Sweden speaks Czech and most would believe you
need to know the language in order to understand anything. I guess
it is mutual, the Czech speak their own language plus English and
German and a bit of Polish, but nobody would care about Swedish.
It was only recently that the stubbiness of the Swedish Wikipedia,
relative to other languages of Wikipedia, started to come into
public awareness. Most regular wikipedians are now aware of this.
But when we start to talk about improving quality, average Swedes
who are occasional contributors still don't understand what this
is about. They think we want to compete with traditional printed
encyclopedias, and that we are losing our soul as a Net project.
Why should any article be deleted, when storage is so cheap? Why
require source citations and complete sentences and birth dates?
Surely someone else is going to add that later.
It's easy to measure the average size of articles, but much harder
to change it. Even if you add 50 bytes to each article, that would
only move the average by 50 bytes and not from 2K to 4K. Just by
translating from English, you can add 30K to the Swedish article
on Ingmar Bergman, but that doesn't move the average size.
What I did was instead to look at the very smallest articles.
There's a page [[Special:Shortpages]] for this, but unfortunately
it contains both short articles and disambiguation pages. From
the database dumps, I could filter out the disambiguations which
are 10K on the Swedish Wikipedia, leaving 270K real articles.
I found that 0.1 percent (270 articles) were shorter than 90
bytes. This was better than the Arabic Wikipedia where 0.1 percent
are shorter than 62 bytes, Latvian with 78 bytes and Estonian with
81 bytes. But it is far worse than Danish 130 bytes, Polish 160,
Czech 213, German 280 and Russian 359 bytes. This was in April.
By merging and improving the very shortest articles, the Swedish
Wikipedia's 0.1 percent shortest articles now reach 145 bytes.
In the next step, I found 1.0 percent of articles (2700 articles)
were shorter than 126 bytes, which was far worse than any other
language I looked at. But during April and May this has improved
to 171 bytes, which is better than Arabic and Estonian. During
all of April, the Swedish Wikipedia didn't grow at all, because
stubs were being merged and removed faster than new articles were
written. This is when the Russian Wikipedia on May 19 got the
10th position at 283K articles.
By addressing the very shortest articles, we remove the easiest
excuse for people to create new very short stubs. They can no
longer point to other articles that are only 100 bytes long.
After the lower limit is pushed from 120 to 140 bytes, it will
continue to push upwards to 160 and 180 bytes.
Some might ask why we don't just remove everything shorter than
200 or 300 bytes. You are welcome to try to propose this, but so
far every such proposal has met compact resitance. Instead I
found 20 articles saying "X is an island in Y archipelago" and
merged them into a nice article on "Y archipelago" with a map, so
that everybody can agree that the new style is a real improvement.
With that sort of aggregation, it becomes more obvious that island
Z is missing or that island W doesn't really belong. Fact checking
and quality assurance are improved by merging stubs. It's all
about setting an example for the future. The average article size
will move slowly over the next few years.
The next medium term goal is to look at the 10 percent shortest
articles (27K of them), the "lower decile". For the Swedish
Wikipedia, this has moved from 312 bytes in April to 349 bytes in
June. For the Lithuanian, Estonian, Arabic, Danish Wikipedias it
is around 400 bytes. Norwegian and Icelandic are around 500,
Polish at 638. The French, Finnish and Hungarian are around 700.
Latvian has 808, Czech 917, German 1081 and the Russian Wikipedia
has 10 percent of its articles shorter than 1305 bytes.
If you believe the Polish 638 bytes is really bad, it's just
because you haven't looked close enough at Danish and Norwegian.
In fact, it's not so much worse than the French 682 bytes. You are
welcome to make fun of the Swedish stubs, because it is an area
where we can show constant improvement.
This hunt for "substubs" is just one of several quality
improvement initiatives currently running on the Swedish
Wikipedia.
--
Lars Aronsson (lars at aronsson.se)
Aronsson Datateknik - http://aronsson.se
More information about the foundation-l
mailing list