+1. Right now, we can incorporate other projects by simply running the same script on other XML dumps. We'll likely want to set up a job that tracks the creation of new historical dumps so that we can produce new, updated ID dumps ASAP.
If we drop the requirement of knowing when a citation was first added to an article, we could use the externallinks tables. That would allow us to generate these datasets much faster. I'd like to only pursue this option if we find that processing the dumps becomes difficult to do on the monthly basis. Right now, it doesn't look like that will be the case.
The realtime reporting project sounds interesting. Is there a project page or some code we could review?
-Aaron