[Foundation-l] Tragical dynamics: that run for the number of articles
Lars Aronsson
lars at aronsson.se
Sun Jun 29 00:03:38 UTC 2008
Tomasz Ganicz wrote:
> And if there is no clear definition of what is "real" article
> and what is not,
Apparently it was the 500k article event that caused Ziko to bring
the topic up this time. He's frustrated (and so am I) that 500K
articles is reported as an achievement, when it is indeed doubtful
what quality these articles have. Still, I think he exaggerates
the problem.
Earlier this year, when the topic came up on meta, it was because
of which languages were featured as the top 10 on
www.wikipedia.org,
http://meta.wikimedia.org/wiki/Top_Ten_Wikipedias
Since then, the Russian Wikipedia has gained the 10th position and
Swedish ("the one with all the stubs") is down to 11th, so there
is one problem less to care about. During that discussion, I
proposed to use the size of the compressed database dump
(pages-articles.xml.bz2) as the official metric, since it both
counts the total database size (one long article counts the same
as two short ones) and it completely removes the impact of bot
generated articles. The compressed size of the Volapük Wikipedia
is very small, becase the same patterns appear in many of its
numerous articles.
On the talk page, there is a table where this is shown, and you
can sort by column by clicking the little boxes,
http://meta.wikimedia.org/wiki/Talk:Top_Ten_Wikipedias#What_problem_do_we_want_to_solve
I'd like to propose a quality metric: The difference in rank
between the article count and the compressed database size.
The English Wikipedia is the biggest (rank 1), whether you count
articles or compressed database size. So its quality is 0.
The Polish Wikipedia was the 4th by article count, but the 7th by
compressed database size, for a quality of 4 - 7 = -3.
The Swedish Wikipedia was (when this table was compiled) the 10th
biggest by article count, but the 12th biggest by compressed
database size, so its quality is 10 - 12 = -2.
The Russian Wikipedia was the 11th by article count, but 9th by
compressed database size, so its quality is +2. This doesn't mean
the Russian Wikipedia is better than the English one, only that it
is better than (two of) its peers of similar size.
The Volapük Wikipedia was the 15th by article count, but the worse
than the 30th by compressed database size (the table is
incomplete), so its quality is worse than -15.
--
Lars Aronsson (lars at aronsson.se)
Aronsson Datateknik - http://aronsson.se
More information about the foundation-l
mailing list