Re: [Wiki-research-l] gender balance of wikipedia citations

26 Aug 2019


      Greg, 22/08/19 06:19:
...
I do not know the current status of wikicite or if/when this
could be used for this inquiry--either to examine all, or a sensible subset
of the citations.
If I see correctly, you still did not receive an answer on the data 
available.
It's true that the Figshare item for 
https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wikipedia 
was deleted (I've asked about it on the talk page), but it's trivial to 
run https://pypi.org/project/mwcites/ and extract the data yourself, at 
least for citations which use an identifier.
Some example datasets produced this way:
https://zenodo.org/record/15871
https://zenodo.org/record/55004
https://zenodo.org/record/54799
Once you extract the list of works, the fun begins. You'll need to 
intersect with other data sources (Wikidata, ORCID, other?) and account 
for a number of factors until you manage to find a subset of the data 
which has a sufficiently high signal:noise ratio. For instance you might 
need to filter or normalise by
* year of publication (some year recent enough to have good data but old 
enough to allow the work to be cited elsewhere, be archived after embargos);
* country or institution (some probably have better ORCID coverage);
* field/discipline and language;
* open access status (per Unpaywall);
* number of expected pageviews and clicks (for instance using 
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews and 
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Releases; 
a link from 10k articles on asteroids or proteins is not the same as 
being the lone link from a popular article which is not the same as a 
link buried among a thousand others on a big article);
* time or duration of the addition (with one of the various diff 
extraction libraries, content persistence data or possibly historical 
eventstream if such a thing is available).
To avoid having to invent everything yourself, maybe you can reuse the 
method of some similar study, for instance the one on the open access 
citation advantage or one of the many which studied the gender imbalance 
of citations and peer review in journals.
However, it's very possible that the noise is just too much for a 
general computational method. You might consider a more manual approach 
on a sample of relevant events, for instance the *removal* of citations, 
which is in my opinion more significant than the addition.* You might 
extract all the diffs which removed a citation from an article in the 
last N years (probably they'll be in the order of 10^5 rather than 
10^6), remove some massive events or outliers, sample 500-1000 of them 
randomly and verify the required data manually.
As usual it will be impossible to have an objective assessment of 
whether that citation was really (in)appropriate in that context 
according to the (English or whatever) Wikipedia guidelines. To test 
that too, you should replicate one of the various studies of the gender 
imbalance of peer review, perhaps one of those which tried to assess the 
impact of a double blind peer review system on the gender imbalance. 
However, because the sources are already published, you'd need to 
provide the agendered information yourself and make sure the 
participants perform their assessment in some controlled environment 
where they don't have access to any gendered information (i.e. where you 
cut them off the internet).
How many years do you have to work on this project? :-)
Federico
(*) I might add a citation just because it's the first result a popular 
search engine gives me, after glancing at the abstract and maybe the 
journal home page; but if I remove an existing citation, hopefully I've 
at least assessed its content and made a judgement about it, apart from 
cases of mass removals for specific problems with certain articles or 
publication venues.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] gender balance of wikipedia citations