Hi Danny,
I hope it's ok to forward this message to wikitech-l and
wiki-research-l because this are general research questions. You wrote:
I think this is all an invaluable discussion, but it
is being blurred
by statements that have no basis in actual fact. "Most of our
articles," "the vast majority of our articles," "only a very small
percentage of our articles" all mean nothing.
I would very much like to see a comprehensive report by the Research
Committee about the current state of Wikipedia.
Probably it won't help a lot but at least I can tell you what measures
*can* be determined in which way from my point of view. That means if I
had the time I could do it (sorry, this is frustrating also for me).
1. How many articles are being created currently.
Can be measured from the recentchanges table. Up to now there is no
dump and you cannot use the XML dump only because it does not contain
deletions - I also think that this table is does not contain all edits
from the beginning.
2. What is the actual proportion of anon v. registered
users.
Can be measured from the recentchanges table too. If you ignore the
deleted pages it can also measured from the full XML dump like I did
for the German Wikipedia or Erik Zachte for all Wikipedias:
http://en.wikipedia.org/wikistats/EN/TablesWikipediaEN.htm
There you can see it is 28% anonymous edits for the English Wikipedia
(but you may be interested in changes of this value so recentchanges is
needed)
3. How has the experiment on English impacted
Wikipedia.
In the press or by content? ;-)
4. What percentage of our articles are classified as
stubs?
There is a dump of categorylinks where you can easily count them. I'll
try do this tomorrow.
5. What percentage of our articles contain incorrect
information?
I guess 100%. Please specify "incorrect".
6. What percentage of our articles contain
pornographic images in
their histories?
You need to track deletion log (again there is no dump) but I don't
know how to determine if a deleted image was pornographic or not.
7. What percentage of our articles are copied verbatim
from other sources?
This is also difficult, I have to think about it.
8. What percentage of our images are copyrighted.
Tagged or non tagged? This also depends a lot on the law you consider.
9. What percentage of edits are vandalism?
This could be measured with the full dump or recent changes log. At
least you can count the number of probably reverts by looking at the
comments but you only get reverted vadalism (better than nothing).
10. What percentagge of our articles are of Featured
Article quality?
http://en.wikipedia.org/wiki/Wikipedia:Featured_article_statistics
Etc., etc., etc.
I would also like to see practical suggestions that can be
implemented in the immediate future to make the necessary changes.
Ideally, this would include a timeline.
Research is beeing done but very slowly because there are so many
possibilities and most Wiki researchers do it in there own time (like
Wikipedians) so there are no deadlines (except your masters thesis
deadline and things like that) and no specific goals. It's freedom of
research. I play around a lot with Wikipedia statistics and ideas of
research but it's more for fun - writing it down and analysing it in
all details is work.
Greetings,
Jakob
P.S: By the way the best paper about quality of Wikipedia is already
published since Wikimania. Andreas Brändle has finished his thesis now
but it's in German and not published yet.
http://en.wikibooks.org/wiki/Wikimania05/Paper-AB1