Hi Danny,
I hope it's ok to forward this message to wikitech-l and wiki-research-l because this are general research questions. You wrote:
I think this is all an invaluable discussion, but it is being blurred by statements that have no basis in actual fact. "Most of our articles," "the vast majority of our articles," "only a very small percentage of our articles" all mean nothing.
I would very much like to see a comprehensive report by the Research Committee about the current state of Wikipedia.
Probably it won't help a lot but at least I can tell you what measures *can* be determined in which way from my point of view. That means if I had the time I could do it (sorry, this is frustrating also for me).
- How many articles are being created currently.
Can be measured from the recentchanges table. Up to now there is no dump and you cannot use the XML dump only because it does not contain deletions - I also think that this table is does not contain all edits from the beginning.
- What is the actual proportion of anon v. registered users.
Can be measured from the recentchanges table too. If you ignore the deleted pages it can also measured from the full XML dump like I did for the German Wikipedia or Erik Zachte for all Wikipedias:
http://en.wikipedia.org/wikistats/EN/TablesWikipediaEN.htm
There you can see it is 28% anonymous edits for the English Wikipedia (but you may be interested in changes of this value so recentchanges is needed)
- How has the experiment on English impacted Wikipedia.
In the press or by content? ;-)
- What percentage of our articles are classified as stubs?
There is a dump of categorylinks where you can easily count them. I'll try do this tomorrow.
- What percentage of our articles contain incorrect information?
I guess 100%. Please specify "incorrect".
- What percentage of our articles contain pornographic images in
their histories?
You need to track deletion log (again there is no dump) but I don't know how to determine if a deleted image was pornographic or not.
- What percentage of our articles are copied verbatim from other sources?
This is also difficult, I have to think about it.
- What percentage of our images are copyrighted.
Tagged or non tagged? This also depends a lot on the law you consider.
- What percentage of edits are vandalism?
This could be measured with the full dump or recent changes log. At least you can count the number of probably reverts by looking at the comments but you only get reverted vadalism (better than nothing).
- What percentagge of our articles are of Featured Article quality?
http://en.wikipedia.org/wiki/Wikipedia:Featured_article_statistics
Etc., etc., etc.
I would also like to see practical suggestions that can be implemented in the immediate future to make the necessary changes. Ideally, this would include a timeline.
Research is beeing done but very slowly because there are so many possibilities and most Wiki researchers do it in there own time (like Wikipedians) so there are no deadlines (except your masters thesis deadline and things like that) and no specific goals. It's freedom of research. I play around a lot with Wikipedia statistics and ideas of research but it's more for fun - writing it down and analysing it in all details is work.
Greetings, Jakob
P.S: By the way the best paper about quality of Wikipedia is already published since Wikimania. Andreas Brändle has finished his thesis now but it's in German and not published yet. http://en.wikibooks.org/wiki/Wikimania05/Paper-AB1
wikitech-l@lists.wikimedia.org