Scholarly citations by DOI in Wikipedia

List overview All Threads
Download

newer

older

Open Access and Wikimedia

PASTEUR4OA Newsletter February 2015

Aaron Halfaker

10 Feb 2015 10 Feb '15

4:59 a.m.

Hey folks,

Dario and I just updated the scholarly citations dataset to include Digital Object Identifiers. We found 742k citations (524k unique DOIs) in 172k articles. Our spot checking suggests that 98% of these DOIs resolve. The remaining 2% were extracted correctly, but they appear to be typos.

http://dx.doi.org/10.6084/m9.figshare.1299540

Like the dataset that we released for PubMed Identifiers, this dataset includes the first known occurrence of a DOI citation in an English Wikipedia article and the associated revision metadata, based on the most recent complete content dump of English Wikipedia.

Feel free to share this with anyone interested via: https://twitter.com/WikiResearch/status/564908585008627712

We'll be organizing our own work and analysis of these citations here: https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wiki...

-Aaron

Attachments:

attachment.htm (text/html — 1.6 KB)

Show replies by date

Andy Mabbett

10 Feb 10 Feb

6:24 a.m.

On 9 February 2015 at 22:59, Aaron Halfaker ahalfaker@wikimedia.org wrote:

...

Our spot checking suggests that 98% of these DOIs resolve. The remaining 2% were extracted correctly, but they appear to be typos.

All on en.Wikipedia?

Do DoIs not incude check digits? We should test for tehse in citation templates. Does your data show which templates (if any) the broken DoIs were in?

-- Andy Mabbett @pigsonthewing http://pigsonthewing.org.uk

Dario Taraborelli

6:40 a.m.

Hey Andy

...

On Feb 9, 2015, at 4:24 PM, Andy Mabbett andy@pigsonthewing.org.uk wrote:

On 9 February 2015 at 22:59, Aaron Halfaker ahalfaker@wikimedia.org wrote:

...
Our spot checking suggests that 98% of these DOIs resolve. The remaining 2% were extracted correctly, but they appear to be typos.

All on en.Wikipedia?

correct, we haven’t looked at other projects for this release

...

Do DoIs not incude check digits?

they don’t, validation can be done via the CrossRef API or the DOI resolver. This method is not 100% reliable, especially when DOIs include special characters. CrossRef advised to use a 200 HTTP response code from the resolver with a noredirect flag (e.g. http://dx.doi.org/%7Bdoi%7D?noredirect=true) as an indication that the DOI is valid and resolves.

...

We should test for tehse in citation templates. Does your data show which templates (if any) the broken DoIs were in?

we haven’t checked if these errors occur systematically within specific templates, but we know that the code extracted them correctly with no parsing errors. We’ll share the list of broken DOIs so they can be reviewed and fixed.

Dario

Andy Mabbett

5:38 p.m.

On 10 February 2015 at 00:40, Dario Taraborelli dtaraborelli@wikimedia.org wrote:

...

...
Do DoIs not incude check digits?

they don’t, validation can be done via the CrossRef API or the DOI resolver. This method is not 100% reliable, especially when DOIs include special characters. CrossRef advised to use a 200 HTTP response code from the resolver with a noredirect flag (e.g. http://dx.doi.org/%7Bdoi%7D?noredirect=true) as an indication that the DOI is valid and resolves.

...
We should test for tehse in citation templates. Does your data show which templates (if any) the broken DoIs were in?

we haven’t checked if these errors occur systematically within specific templates, but we know that the code extracted them correctly with no parsing errors.

Thank you. I was thinking more of ensuring that check digits were verified by the template code; but that's clearly not possible. It's also not possible for a template to check a DOI by an http request; though a bot could do so.

...

We’ll share the list of broken DOIs so they can be reviewed and fixed.

Thank you.

-- Andy Mabbett @pigsonthewing http://pigsonthewing.org.uk

Andrew Gray

6:19 p.m.

New subject: [Spam] Re: Scholarly citations by DOI in Wikipedia

On 10 February 2015 at 00:40, Dario Taraborelli dtaraborelli@wikimedia.org wrote:

...

...
We should test for tehse in citation templates. Does your data show which templates (if any) the broken DoIs were in?

we haven’t checked if these errors occur systematically within specific templates, but we know that the code extracted them correctly with no parsing errors. We’ll share the list of broken DOIs so they can be reviewed and fixed.

FWIW, it looks like it might be possible to automatically fix a good proportion of them. Glancing through the list, I saw quite a few entries like:

10.1046/j.1095-8339.2003.00157.x/abs/ 10.1111/1532-7795.1301001/enhancedabs/

Both represent a valid DOI + extra text. These will probably have been copy-pasted from the website URL, which often uses the DOI plus a suffix like pdf, abstract, etc, and it's an easy mistake to make. About a fifth of the errors match this pattern - 4000 of the entries have /abs* in them, and 1200 /pdf

It suggests we could try automatically trimming broken ones that match this pattern and seeing if the new one resolves. If so, the odds are good it's the intended DOI...

One other thing that might be worth checking for is invisible characters - I couldn't spot any in this file, but I don't know if it's been sanitised in any way. I've had recurrent problems with user-provided DOIs in repository data turning out to have zero width spaces buried somewhere in them, possibly as a result of someone copy-pasting from a PDF.

Andrew.

PS: really pleased that the one September 2002 DOI is still working fine :-)

-- - Andrew Gray andrew.gray@dunelm.org.uk

Oliver Keyes

10:46 a.m.

New subject: [Wiki-research-l] Scholarly citations by DOI in Wikipedia

Sweet! Can I ask that we make the 2% explicitly available to wiki gnomes? :)

On Monday, 9 February 2015, Aaron Halfaker ahalfaker@wikimedia.org wrote:

...

Hey folks,

Dario and I just updated the scholarly citations dataset to include Digital Object Identifiers. We found 742k citations (524k unique DOIs) in 172k articles. Our spot checking suggests that 98% of these DOIs resolve. The remaining 2% were extracted correctly, but they appear to be typos.

http://dx.doi.org/10.6084/m9.figshare.1299540

Like the dataset that we released for PubMed Identifiers, this dataset includes the first known occurrence of a DOI citation in an English Wikipedia article and the associated revision metadata, based on the most recent complete content dump of English Wikipedia.

Feel free to share this with anyone interested via: https://twitter.com/WikiResearch/status/564908585008627712

We'll be organizing our own work and analysis of these citations here: https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wiki...

-Aaron

-- Sent from my mobile computing device of Lovecraftian complexity and horror.

Dan Andreescu

7:24 p.m.

New subject: [Analytics] Scholarly citations by DOI in Wikipedia

...

The remaining 2% were extracted correctly, but they appear to be typos.

If you're trying to clean the data, including fixing misspellings, sounds like http://openrefine.org/ might help. I'm happy to give it a shot.

3575

Age (days ago)

3576

Last active (days ago)

openaccess@lists.wikimedia.org

6 comments

6 participants

tags (0)

participants (6)

Aaron Halfaker
Andrew Gray
Andy Mabbett
Dan Andreescu
Dario Taraborelli
Oliver Keyes