Greg, 22/08/19 06:19:
I do not know the current status of wikicite or if/when this could be used for this inquiry--either to examine all, or a sensible subset of the citations.
If I see correctly, you still did not receive an answer on the data available.
It's true that the Figshare item for https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wikipedia was deleted (I've asked about it on the talk page), but it's trivial to run https://pypi.org/project/mwcites/ and extract the data yourself, at least for citations which use an identifier.
Some example datasets produced this way: https://zenodo.org/record/15871 https://zenodo.org/record/55004 https://zenodo.org/record/54799
Once you extract the list of works, the fun begins. You'll need to intersect with other data sources (Wikidata, ORCID, other?) and account for a number of factors until you manage to find a subset of the data which has a sufficiently high signal:noise ratio. For instance you might need to filter or normalise by * year of publication (some year recent enough to have good data but old enough to allow the work to be cited elsewhere, be archived after embargos); * country or institution (some probably have better ORCID coverage); * field/discipline and language; * open access status (per Unpaywall); * number of expected pageviews and clicks (for instance using https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews and https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Releases; a link from 10k articles on asteroids or proteins is not the same as being the lone link from a popular article which is not the same as a link buried among a thousand others on a big article); * time or duration of the addition (with one of the various diff extraction libraries, content persistence data or possibly historical eventstream if such a thing is available).
To avoid having to invent everything yourself, maybe you can reuse the method of some similar study, for instance the one on the open access citation advantage or one of the many which studied the gender imbalance of citations and peer review in journals.
However, it's very possible that the noise is just too much for a general computational method. You might consider a more manual approach on a sample of relevant events, for instance the *removal* of citations, which is in my opinion more significant than the addition.* You might extract all the diffs which removed a citation from an article in the last N years (probably they'll be in the order of 10^5 rather than 10^6), remove some massive events or outliers, sample 500-1000 of them randomly and verify the required data manually.
As usual it will be impossible to have an objective assessment of whether that citation was really (in)appropriate in that context according to the (English or whatever) Wikipedia guidelines. To test that too, you should replicate one of the various studies of the gender imbalance of peer review, perhaps one of those which tried to assess the impact of a double blind peer review system on the gender imbalance. However, because the sources are already published, you'd need to provide the agendered information yourself and make sure the participants perform their assessment in some controlled environment where they don't have access to any gendered information (i.e. where you cut them off the internet).
How many years do you have to work on this project? :-)
Federico
(*) I might add a citation just because it's the first result a popular search engine gives me, after glancing at the abstract and maybe the journal home page; but if I remove an existing citation, hopefully I've at least assessed its content and made a judgement about it, apart from cases of mass removals for specific problems with certain articles or publication venues.