Hey all,
we just released a dataset of scholarly citations in the English Wikipedia by Pubmed / Pubmed Central ID.
http://dx.doi.org/10.6084/m9.figshare.1299540
The dataset currently includes the first known occurrence of a PMID or PMCID citation in an English Wikipedia article and the associated revision metadata, based on the most recent complete content dump of English Wikipedia. We’re planning on expanding this dataset to include other types of scholarly identifier soon.
Feel free to share this with anyone interested or spread the word via: https://twitter.com/WikiResearch/status/562422538613956608
Dario and Aaron
Dario Taraborelli, 03/02/2015 03:06:
The dataset currently includes the first known occurrence of a PMID or PMCID citation in an English Wikipedia article and the associated revision metadata, based on the most recent complete content dump of English Wikipedia.
Do you accepted patches for inclusion of other wikis? The easiest way to include all Wikimedia projects is probably to use the externallinks table, we can see how big a difference there is.
Nemo
Hi Nemo
The dataset currently includes the first known occurrence of a PMID or PMCID citation in an English Wikipedia article and the associated revision metadata, based on the most recent complete content dump of English Wikipedia.
Do you accepted patches for inclusion of other wikis? The easiest way to include all Wikimedia projects is probably to use the external links table, we can see how big a difference there is.
we definitely welcome patches and pull requests [1]. This is our current priority list (subject to other priorities unrelated to this project):
1. add other identifiers (DOIs are next) 2. include other languages / projects 3. generate recurring reports (e.g. once a month)
Aaron, does that sound about right? Also note that other people on this list (Max, Daniel) are working on real-time reporting of DOI citations in collaboration with CrossRef.
D
[1] https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia
+1. Right now, we can incorporate other projects by simply running the same script on other XML dumps. We'll likely want to set up a job that tracks the creation of new historical dumps so that we can produce new, updated ID dumps ASAP.
If we drop the requirement of knowing when a citation was first added to an article, we could use the externallinks tables. That would allow us to generate these datasets much faster. I'd like to only pursue this option if we find that processing the dumps becomes difficult to do on the monthly basis. Right now, it doesn't look like that will be the case.
The realtime reporting project sounds interesting. Is there a project page or some code we could review?
-Aaron
On Tue, Feb 3, 2015 at 9:28 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
Hi Nemo
The dataset currently includes the first known occurrence of a PMID or
PMCID citation in an English Wikipedia article and the associated revision metadata, based on the most recent complete content dump of English Wikipedia.
Do you accepted patches for inclusion of other wikis? The easiest way to
include all Wikimedia projects is probably to use the external links table, we can see how big a difference there is.
we definitely welcome patches and pull requests [1]. This is our current priority list (subject to other priorities unrelated to this project):
- add other identifiers (DOIs are next)
- include other languages / projects
- generate recurring reports (e.g. once a month)
Aaron, does that sound about right? Also note that other people on this list (Max, Daniel) are working on real-time reporting of DOI citations in collaboration with CrossRef.
D
[1] https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia
On the subject of DOIs, I just spotted this (related) announcement from Altmetrics:
http://www.altmetric.com/blog/new-source-alert-wikipedia/
"We capture [Wikipedia use] as a mention and add it to the details page for that output. Users can then click on the heading to be taken to the original Wikipedia article, on the username of the person who wrote the mention to see their profile on Wikipedia, and on the date stamp to view the edit record for that article. ... Because we’re looking for lots of different identifiers and have some constraints around auditability – every mention we collect has to have an author and a timestamp – we wrote our own system to get the data out in the format we need but we’re hoping to share tips, approaches and data with others in the community."
So looks like they've independently created a similar method, which - reading between the lines - is picking up DOIs as well as PMIDs, etc.
Andrew.
On 3 February 2015 at 17:28, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Hi Nemo
The dataset currently includes the first known occurrence of a PMID or PMCID citation in an English Wikipedia article and the associated revision metadata, based on the most recent complete content dump of English Wikipedia.
Do you accepted patches for inclusion of other wikis? The easiest way to include all Wikimedia projects is probably to use the external links table, we can see how big a difference there is.
we definitely welcome patches and pull requests [1]. This is our current priority list (subject to other priorities unrelated to this project):
- add other identifiers (DOIs are next)
- include other languages / projects
- generate recurring reports (e.g. once a month)
Aaron, does that sound about right? Also note that other people on this list (Max, Daniel) are working on real-time reporting of DOI citations in collaboration with CrossRef.
D
[1] https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia _______________________________________________ OpenAccess mailing list OpenAccess@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/openaccess
Andrew,
most altmetrics services (ImpactStory, Plum Analytics, PLOS ALM, altmetric.com http://altmetric.com/) track DOI mentions in Wikipedia. They are often limited to the English Wikipedia only and rely on the MediaWiki search API as a data source. Their approach is also different from ours: they match an exact occurrence of a given DOI (which is relatively straightforward) as opposed to extracting all DOI mentions (more challenging). I created a draft project page on enwiki [1] to aggregate all information about altmetrics data, research and services related to English Wikipedia: contributions are welcome.
Finally, allow me to be pedantic and point out that altmetric.com http://altmetric.com/ is not “Altmetrics”: altmetric.com http://altmetric.com/ is a proprietary service by Macmillan that applies specific metrics, primarily focused on social media, to scholarly articles. “altmetrics” is the name of the broader concept, the movement and the scholarship behind it.
Dario
[1] https://en.wikipedia.org/wiki/Draft:WP:Altmetrics https://en.wikipedia.org/wiki/Draft:WP:Altmetrics
On Feb 4, 2015, at 3:34 AM, Andrew Gray <andrew.gray@dunelm.org.uk mailto:andrew.gray@dunelm.org.uk> wrote:
On the subject of DOIs, I just spotted this (related) announcement from Altmetrics:
http://www.altmetric.com/blog/new-source-alert-wikipedia/ http://www.altmetric.com/blog/new-source-alert-wikipedia/
"We capture [Wikipedia use] as a mention and add it to the details page for that output. Users can then click on the heading to be taken to the original Wikipedia article, on the username of the person who wrote the mention to see their profile on Wikipedia, and on the date stamp to view the edit record for that article. ... Because we’re looking for lots of different identifiers and have some constraints around auditability – every mention we collect has to have an author and a timestamp – we wrote our own system to get the data out in the format we need but we’re hoping to share tips, approaches and data with others in the community."
So looks like they've independently created a similar method, which - reading between the lines - is picking up DOIs as well as PMIDs, etc.
Andrew.
On 3 February 2015 at 17:28, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Hi Nemo
The dataset currently includes the first known occurrence of a PMID or PMCID citation in an English Wikipedia article and the associated revision metadata, based on the most recent complete content dump of English Wikipedia.
Do you accepted patches for inclusion of other wikis? The easiest way to include all Wikimedia projects is probably to use the external links table, we can see how big a difference there is.
we definitely welcome patches and pull requests [1]. This is our current priority list (subject to other priorities unrelated to this project):
- add other identifiers (DOIs are next)
- include other languages / projects
- generate recurring reports (e.g. once a month)
Aaron, does that sound about right? Also note that other people on this list (Max, Daniel) are working on real-time reporting of DOI citations in collaboration with CrossRef.
D
[1] https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia _______________________________________________ OpenAccess mailing list OpenAccess@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/openaccess
--
- Andrew Gray
andrew.gray@dunelm.org.uk
FYI, repository now added to listing. https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia
I'll be generalizing the script to extract new types of identifiers soon. Pull requests welcome.
On Mon, Feb 2, 2015 at 6:06 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
Hey all,
we just released a dataset of scholarly citations in the English Wikipedia by Pubmed / Pubmed Central ID.
http://dx.doi.org/10.6084/m9.figshare.1299540
The dataset currently includes the first known occurrence of a PMID or PMCID citation in an English Wikipedia article and the associated revision metadata, based on the most recent complete content dump of English Wikipedia. We’re planning on expanding this dataset to include other types of scholarly identifier soon.
Feel free to share this with anyone interested or spread the word via: https://twitter.com/WikiResearch/status/562422538613956608
Dario and Aaron _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
openaccess@lists.wikimedia.org