Sure! At a high level, our process for matching citations is:

1. Look for any template starting whose name begins with "cite" or "citation". We search through at most one nested level of templates.

2. For each "<ref>" tag:
    * For each wiki links and url inside the ref tag:
         * Add any links that haven't already been added as part of a citation template.

We spot checked the procedure on a sample of randomly-selected pages, and it does miss some things that appear to be citations to humans. For example, a wiki list of urls under the "References" section not in <ref> tags. But citations are messy, and we decided to draw the line at these few simple steps for now. If you have any suggestions for extra rules to add to the above procedure, I'd love to include them!

FYI, you can browse the current code at: http://code.google.com/p/wikipedia-map-reduce/source/browse/trunk/wikipedia-map-reduce/src/wmr/citations/CitationParser.java

On Sun, Apr 22, 2012 at 10:04 PM, Ward Cunningham <ward@c2.com> wrote:
Shilad -- I was on the verge of parsing a recent dump using my Exploratory Parsing tools. I'd be interested in duplicating your work to see if I could get the same numbers. Are you willing to share the criteria you have for citations, even if messy? Thanks and best regards. -- Ward


On Apr 22, 2012, at 7:15 PM, Shilad Sen wrote:

Greetings! 

I'm a CS Professor at Macalester College in St. Paul and I'm on research sabbatical at GroupLens this year. I've been working with Heather Ford and Dave Musicant to explore several research questions related to citation use on Wikipedia.

We're still in the middle of analyzing data, and working through parsing lots of messy forms of citation references. However, I'll summarize our findings as they stand.

As of Jan 1, 2011 there are 6384425 total citations in the main namespace for English Wikipedia.

Our top-line research questions focus on citations containing URLs, so we broke down our results into citations with a URL (78%) and those without (22%).

The top 5 domains in citations with a URL are:
1. books.google.com (73777 - 1.48%)
2. news.bbc.co.uk (52347 - 1.05%)
3. www.stat.gov.pl (51598 - 1.03%)
4. www.nytimes.com (39454 - 0.79%)
5. www.imdb.com (24993 - 0.50%)

The top 5 types of citations without a URL are:
1. cite book (190090 - 13.65%)
2. citation needed (148339 - 10.65%)
3. cite journal (63722 - 4.58%)
4. cite news (25052 - 1.80%)
5. citation (22773 - 1.64%)

We have also looked at the *inequality* in citation domains. In other words, what share of citations do the most popular domains receive? Citation inequality has been steadily growing; the Gini coefficient grew from 0.63 in Jan 2007 to 0.81 in Nov 2011.

We hope to write up our results to share them formally in the not-too-distant future. Until then, I'm happy to answer questions!

-Shilad




On Sun, Apr 22, 2012 at 3:30 PM, phoebe ayers <phoebe.wiki@gmail.com> wrote:
Joe: that's the same question the alt-metrics people were getting at
in the paper I posted earlier... does being cited in WP give you a
measurable citations boost? Does the same boost carry over even if the
work is only in print or behind a paywall vs open access? Or, does it
have an effect at all on citations (versus *viewings*, where the
additional exposure in WP must make a difference) since most published
scholarship depends on larger lit reviews than are typically done in
WP -- and there is a larger filter effect at work among the body of
already published literature that may have a stronger effect on what
gets cited and what doesn't?

>From the "building an encyclopedia" perspective, it's in Wikipedia's
interest to cite general works, famous works, and review papers/works,
which already are more likely to be more highly cited than average
research papers. So one possibility is any citation boost from
citations in WP just reinforces existing citation trends.

-- phoebe


On Sat, Apr 21, 2012 at 11:01 AM, Joe Corneli <holtzermann17@gmail.com> wrote:
> Another interesting question (that would take a broader scope) would
> look at the frequency of *downstream* citations to works that are
> cited in Wikipedia versus all other citations.
>
> Per usual, correlation would definitely not be causation, BUT my usual
> practice is
>
> 1. Google
> 2. Wikipedia
> 3. Read some of the papers cited
> 4. (A miracle occurs)
> 5. Write my own paper
>
> Interesting to wonder how many of the papers read at step 3 survive
> the semantic leap to step 5.
>
> On Sat, Apr 21, 2012 at 5:52 PM, phoebe ayers <phoebe.wiki@gmail.com> wrote:
>> Thank you everyone! Grouplens folks, if you could send a link to your
>> work too, that would be awesome.
>>
>> What I'm curious about: if we can give (within an order of magnitude,
>> say) an approximation of how many sources are cited within Wikipedia
>> -- then maybe broken out into references to printed works and
>> references to online-only, etc. What does our project look like viewed
>> as an ad-hoc catalog of scholarship? How does that compare to the
>> major databases? (It's going to be a tiny, tiny percentage of the
>> total scholarship in the world -- Pubmed has 21M records, Worldcat
>> around 246M -- but how tiny?) This may only be answerable if someone
>> creates a wikicite project :)
>>
>> thanks,
>> phoebe
>>
>> On Sat, Apr 21, 2012 at 7:52 AM, Paolo Massa <paolo@gnuband.org> wrote:
>>> I know of this paper
>>> "Scientific citations in Wikipedia" by Finn Årup Nielsen
>>> First Monday, volume 12, number 8 (August 2007),
>>> URL: http://firstmonday.org/issues/issue12_8/nielsen/index.html
>>>
>>> but, as the title says, it took into account only citations to
>>> scientific journals.
>>>
>>> On Fri, Apr 20, 2012 at 7:31 PM, phoebe ayers <phoebe.wiki@gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> Has there been any research done into: the number of citations (e.g.
>>>> to books, journal articles, online sources, everything together) on
>>>> Wikipedia (any language, or all)? The distribution of citations over
>>>> different kinds or qualities of articles? # of uses of citation
>>>> templates? Anything like this?
>>>>
>>>> I realize this is hard to count, averages are meaningless in this
>>>> context, and any number will no doubt be imprecise! But anything would
>>>> be helpful. I have vague memories of seeing some citation studies like
>>>> this but don't remember the details.
>>>>
>>>> Thanks,
>>>> -- phoebe
>>>>
>>>> --
>>>> * I use this address for lists; send personal messages to phoebe.ayers
>>>> <at> gmail.com *
>>>>
>>>> _______________________________________________
>>>> Wiki-research-l mailing list
>>>> Wiki-research-l@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
* I use this address for lists; send personal messages to phoebe.ayers
<at> gmail.com *

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Shilad W. Sen
Assistant Professor
Mathematics, Statistics, and Computer Science Dept.
Macalester College
ssen@macalester.edu
651-696-6273
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Shilad W. Sen
Assistant Professor
Mathematics, Statistics, and Computer Science Dept.
Macalester College
ssen@macalester.edu
651-696-6273