Tomasz Ganicz wrote:
And if there is no clear definition of what is "real" article and what is not,
Apparently it was the 500k article event that caused Ziko to bring the topic up this time. He's frustrated (and so am I) that 500K articles is reported as an achievement, when it is indeed doubtful what quality these articles have. Still, I think he exaggerates the problem.
Earlier this year, when the topic came up on meta, it was because of which languages were featured as the top 10 on www.wikipedia.org, http://meta.wikimedia.org/wiki/Top_Ten_Wikipedias
Since then, the Russian Wikipedia has gained the 10th position and Swedish ("the one with all the stubs") is down to 11th, so there is one problem less to care about. During that discussion, I proposed to use the size of the compressed database dump (pages-articles.xml.bz2) as the official metric, since it both counts the total database size (one long article counts the same as two short ones) and it completely removes the impact of bot generated articles. The compressed size of the Volapük Wikipedia is very small, becase the same patterns appear in many of its numerous articles.
On the talk page, there is a table where this is shown, and you can sort by column by clicking the little boxes, http://meta.wikimedia.org/wiki/Talk:Top_Ten_Wikipedias#What_problem_do_we_wa...
I'd like to propose a quality metric: The difference in rank between the article count and the compressed database size.
The English Wikipedia is the biggest (rank 1), whether you count articles or compressed database size. So its quality is 0.
The Polish Wikipedia was the 4th by article count, but the 7th by compressed database size, for a quality of 4 - 7 = -3.
The Swedish Wikipedia was (when this table was compiled) the 10th biggest by article count, but the 12th biggest by compressed database size, so its quality is 10 - 12 = -2.
The Russian Wikipedia was the 11th by article count, but 9th by compressed database size, so its quality is +2. This doesn't mean the Russian Wikipedia is better than the English one, only that it is better than (two of) its peers of similar size.
The Volapük Wikipedia was the 15th by article count, but the worse than the 30th by compressed database size (the table is incomplete), so its quality is worse than -15.