Shilad,
Very cool! Thanks for sharing. I do have a couple of questions...
On Sun, Apr 22, 2012 at 7:15 PM, Shilad Sen <ssen(a)macalester.edu> wrote:
Greetings!
I'm a CS Professor at Macalester College in St. Paul and I'm on research
sabbatical at GroupLens this year. I've been working with Heather Ford and
Dave Musicant to explore several research questions related to citation use
on Wikipedia.
We're still in the middle of analyzing data, and working through parsing
lots of messy forms of citation references. However, I'll summarize our
findings as they stand.
As of Jan 1, 2011 there are 6384425 total citations in the main namespace
for English Wikipedia.
Does this count both templated and non-templated citations? Do you
count citations appearing in any area of the article (e.g. inline
footnotes, "references" section, "further reading" or
"bibliography"
section, and "external links"?) Or is anything left out?
Our top-line research questions focus on citations
containing URLs, so we
broke down our results into citations with a URL (78%) and those without
(22%).
The top 5 domains in citations with a URL are:
1.
books.google.com (73777 - 1.48%)
2. news.bbc.co.uk (52347 - 1.05%)
3.
www.stat.gov.pl (51598 - 1.03%)
4.
www.nytimes.com (39454 - 0.79%)
5.
www.imdb.com (24993 - 0.50%)
This will probably be part of your published results, but it would be
very interesting to see a long-tail list of these domains, and maybe
try and break them out into types -- that would start to get at
questions like how many paywalled journals are cited, etc.
The top 5 types of citations without a URL are:
1. cite book (190090 - 13.65%)
2. citation needed (148339 - 10.65%)
3. cite journal (63722 - 4.58%)
4. cite news (25052 - 1.80%)
5. citation (22773 - 1.64%)
"Citation needed" is really the absence of a citation, not an actual
citation, right? :) The others look like the standard reference
templates.. so my question above about templates applies.
We have also looked at the *inequality* in citation
domains. In other words,
what share of citations do the most popular domains receive? Citation
inequality has been steadily growing; the Gini coefficient grew from 0.63 in
Jan 2007 to 0.81 in Nov 2011.
Interesting! Thanks so much for sharing!
-- phoebe