Greg, 22/08/19 06:19:
I do not know the current status of wikicite or
if/when this
could be used for this inquiry--either to examine all, or a sensible subset
of the citations.
If I see correctly, you still did not receive an answer on the data
available.
It's true that the Figshare item for
<https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wikipedia>
was deleted (I've asked about it on the talk page), but it's trivial to
run
https://pypi.org/project/mwcites/ and extract the data yourself, at
least for citations which use an identifier.
Some example datasets produced this way:
https://zenodo.org/record/15871
https://zenodo.org/record/55004
https://zenodo.org/record/54799
Once you extract the list of works, the fun begins. You'll need to
intersect with other data sources (Wikidata, ORCID, other?) and account
for a number of factors until you manage to find a subset of the data
which has a sufficiently high signal:noise ratio. For instance you might
need to filter or normalise by
* year of publication (some year recent enough to have good data but old
enough to allow the work to be cited elsewhere, be archived after embargos);
* country or institution (some probably have better ORCID coverage);
* field/discipline and language;
* open access status (per Unpaywall);
* number of expected pageviews and clicks (for instance using
<https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews> and
<https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Releases>;
a link from 10k articles on asteroids or proteins is not the same as
being the lone link from a popular article which is not the same as a
link buried among a thousand others on a big article);
* time or duration of the addition (with one of the various diff
extraction libraries, content persistence data or possibly historical
eventstream if such a thing is available).
To avoid having to invent everything yourself, maybe you can reuse the
method of some similar study, for instance the one on the open access
citation advantage or one of the many which studied the gender imbalance
of citations and peer review in journals.
However, it's very possible that the noise is just too much for a
general computational method. You might consider a more manual approach
on a sample of relevant events, for instance the *removal* of citations,
which is in my opinion more significant than the addition.* You might
extract all the diffs which removed a citation from an article in the
last N years (probably they'll be in the order of 10^5 rather than
10^6), remove some massive events or outliers, sample 500-1000 of them
randomly and verify the required data manually.
As usual it will be impossible to have an objective assessment of
whether that citation was really (in)appropriate in that context
according to the (English or whatever) Wikipedia guidelines. To test
that too, you should replicate one of the various studies of the gender
imbalance of peer review, perhaps one of those which tried to assess the
impact of a double blind peer review system on the gender imbalance.
However, because the sources are already published, you'd need to
provide the agendered information yourself and make sure the
participants perform their assessment in some controlled environment
where they don't have access to any gendered information (i.e. where you
cut them off the internet).
How many years do you have to work on this project? :-)
Federico
(*) I might add a citation just because it's the first result a popular
search engine gives me, after glancing at the abstract and maybe the
journal home page; but if I remove an existing citation, hopefully I've
at least assessed its content and made a judgement about it, apart from
cases of mass removals for specific problems with certain articles or
publication venues.