Hey folks,
Dario and I just updated the scholarly citations dataset to include Digital Object Identifiers. We found 742k citations (524k unique DOIs) in 172k articles. Our spot checking suggests that 98% of these DOIs resolve. The remaining 2% were extracted correctly, but they appear to be typos.
http://dx.doi.org/10.6084/m9.figshare.1299540
Like the dataset that we released for PubMed Identifiers, this dataset includes the first known occurrence of a DOI citation in an English Wikipedia article and the associated revision metadata, based on the most recent complete content dump of English Wikipedia.
Feel free to share this with anyone interested via: https://twitter.com/WikiResearch/status/564908585008627712
We'll be organizing our own work and analysis of these citations here: https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wiki...
-Aaron
On 9 February 2015 at 22:59, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Our spot checking suggests that 98% of these DOIs resolve. The remaining 2% were extracted correctly, but they appear to be typos.
All on en.Wikipedia?
Do DoIs not incude check digits? We should test for tehse in citation templates. Does your data show which templates (if any) the broken DoIs were in?
Hey Andy
On Feb 9, 2015, at 4:24 PM, Andy Mabbett andy@pigsonthewing.org.uk wrote:
On 9 February 2015 at 22:59, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Our spot checking suggests that 98% of these DOIs resolve. The remaining 2% were extracted correctly, but they appear to be typos.
All on en.Wikipedia?
correct, we haven’t looked at other projects for this release
Do DoIs not incude check digits?
they don’t, validation can be done via the CrossRef API or the DOI resolver. This method is not 100% reliable, especially when DOIs include special characters. CrossRef advised to use a 200 HTTP response code from the resolver with a noredirect flag (e.g. http://dx.doi.org/%7Bdoi%7D?noredirect=true) as an indication that the DOI is valid and resolves.
We should test for tehse in citation templates. Does your data show which templates (if any) the broken DoIs were in?
we haven’t checked if these errors occur systematically within specific templates, but we know that the code extracted them correctly with no parsing errors. We’ll share the list of broken DOIs so they can be reviewed and fixed.
Dario
On 10 February 2015 at 00:40, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Do DoIs not incude check digits?
they don’t, validation can be done via the CrossRef API or the DOI resolver. This method is not 100% reliable, especially when DOIs include special characters. CrossRef advised to use a 200 HTTP response code from the resolver with a noredirect flag (e.g. http://dx.doi.org/%7Bdoi%7D?noredirect=true) as an indication that the DOI is valid and resolves.
We should test for tehse in citation templates. Does your data show which templates (if any) the broken DoIs were in?
we haven’t checked if these errors occur systematically within specific templates, but we know that the code extracted them correctly with no parsing errors.
Thank you. I was thinking more of ensuring that check digits were verified by the template code; but that's clearly not possible. It's also not possible for a template to check a DOI by an http request; though a bot could do so.
We’ll share the list of broken DOIs so they can be reviewed and fixed.
Thank you.
On 10 February 2015 at 00:40, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
We should test for tehse in citation templates. Does your data show which templates (if any) the broken DoIs were in?
we haven’t checked if these errors occur systematically within specific templates, but we know that the code extracted them correctly with no parsing errors. We’ll share the list of broken DOIs so they can be reviewed and fixed.
FWIW, it looks like it might be possible to automatically fix a good proportion of them. Glancing through the list, I saw quite a few entries like:
10.1046/j.1095-8339.2003.00157.x/abs/ 10.1111/1532-7795.1301001/enhancedabs/
Both represent a valid DOI + extra text. These will probably have been copy-pasted from the website URL, which often uses the DOI plus a suffix like pdf, abstract, etc, and it's an easy mistake to make. About a fifth of the errors match this pattern - 4000 of the entries have /abs* in them, and 1200 /pdf
It suggests we could try automatically trimming broken ones that match this pattern and seeing if the new one resolves. If so, the odds are good it's the intended DOI...
One other thing that might be worth checking for is invisible characters - I couldn't spot any in this file, but I don't know if it's been sanitised in any way. I've had recurrent problems with user-provided DOIs in repository data turning out to have zero width spaces buried somewhere in them, possibly as a result of someone copy-pasting from a PDF.
Andrew.
PS: really pleased that the one September 2002 DOI is still working fine :-)
Sweet! Can I ask that we make the 2% explicitly available to wiki gnomes? :)
On Monday, 9 February 2015, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Hey folks,
Dario and I just updated the scholarly citations dataset to include Digital Object Identifiers. We found 742k citations (524k unique DOIs) in 172k articles. Our spot checking suggests that 98% of these DOIs resolve. The remaining 2% were extracted correctly, but they appear to be typos.
http://dx.doi.org/10.6084/m9.figshare.1299540
Like the dataset that we released for PubMed Identifiers, this dataset includes the first known occurrence of a DOI citation in an English Wikipedia article and the associated revision metadata, based on the most recent complete content dump of English Wikipedia.
Feel free to share this with anyone interested via: https://twitter.com/WikiResearch/status/564908585008627712
We'll be organizing our own work and analysis of these citations here: https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wiki...
-Aaron
The remaining 2% were extracted correctly, but they appear to be typos.
If you're trying to clean the data, including fixing misspellings, sounds like http://openrefine.org/ might help. I'm happy to give it a shot.
openaccess@lists.wikimedia.org