On 10 February 2015 at 00:40, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
We should test for tehse in citation templates. Does your data show which templates (if any) the broken DoIs were in?
we haven’t checked if these errors occur systematically within specific templates, but we know that the code extracted them correctly with no parsing errors. We’ll share the list of broken DOIs so they can be reviewed and fixed.
FWIW, it looks like it might be possible to automatically fix a good proportion of them. Glancing through the list, I saw quite a few entries like:
10.1046/j.1095-8339.2003.00157.x/abs/ 10.1111/1532-7795.1301001/enhancedabs/
Both represent a valid DOI + extra text. These will probably have been copy-pasted from the website URL, which often uses the DOI plus a suffix like pdf, abstract, etc, and it's an easy mistake to make. About a fifth of the errors match this pattern - 4000 of the entries have /abs* in them, and 1200 /pdf
It suggests we could try automatically trimming broken ones that match this pattern and seeing if the new one resolves. If so, the odds are good it's the intended DOI...
One other thing that might be worth checking for is invisible characters - I couldn't spot any in this file, but I don't know if it's been sanitised in any way. I've had recurrent problems with user-provided DOIs in repository data turning out to have zero width spaces buried somewhere in them, possibly as a result of someone copy-pasting from a PDF.
Andrew.
PS: really pleased that the one September 2002 DOI is still working fine :-)