+1. Right now, we can incorporate other projects by simply running the same script on other XML dumps. We'll likely want to set up a job that tracks the creation of new historical dumps so that we can produce new, updated ID dumps ASAP.
If we drop the requirement of knowing when a citation was first added to an article, we could use the externallinks tables. That would allow us to generate these datasets much faster. I'd like to only pursue this option if we find that processing the dumps becomes difficult to do on the monthly basis. Right now, it doesn't look like that will be the case.
The realtime reporting project sounds interesting. Is there a project page or some code we could review?
-Aaron
On Tue, Feb 3, 2015 at 9:28 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
Hi Nemo
The dataset currently includes the first known occurrence of a PMID or
PMCID citation in an English Wikipedia article and the associated revision metadata, based on the most recent complete content dump of English Wikipedia.
Do you accepted patches for inclusion of other wikis? The easiest way to
include all Wikimedia projects is probably to use the external links table, we can see how big a difference there is.
we definitely welcome patches and pull requests [1]. This is our current priority list (subject to other priorities unrelated to this project):
- add other identifiers (DOIs are next)
- include other languages / projects
- generate recurring reports (e.g. once a month)
Aaron, does that sound about right? Also note that other people on this list (Max, Daniel) are working on real-time reporting of DOI citations in collaboration with CrossRef.
D
[1] https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia