Re: [Wiki-research-l] gender balance of wikipedia citations

26 Aug 2019

      Thanks, Isaac and Federico. These notes and links are very helpful--and
will require some time to process. As for how many years I have to work on
this, I'm retired! In truth, I keep hoping that someone on this list will
express interest in working on these matters. The questions are all very
interesting and quite relevant. The idea of studying removed citations is
both complex and compelling.
Greg
On Mon, Aug 26, 2019 at 6:49 AM Isaac Johnson isaac@wikimedia.org wrote:
...
Regarding data, I have not been a part of these projects but I think that
I can help a bit with working links:

The (I believe) original dataset can also be found here:

https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/

A newer version of this dataset was produced that also included

information about whether the source was openly available and its topic:
** Meta page:
https://meta.wikimedia.org/wiki/Research:Towards_Modeling_Citation_Quality
** Figshare:
https://figshare.com/articles/Accessibility_and_topics_of_citations_with_ide...
On Mon, Aug 26, 2019 at 3:53 AM Federico Leva (Nemo) nemowiki@gmail.com
wrote:
...
Greg, 22/08/19 06:19:
...
I do not know the current status of wikicite or if/when this
could be used for this inquiry--either to examine all, or a sensible
subset
...
of the citations.
If I see correctly, you still did not receive an answer on the data
available.
It's true that the Figshare item for
<
https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wiki...
was deleted (I've asked about it on the talk page), but it's trivial to
run https://pypi.org/project/mwcites/ and extract the data yourself, at
least for citations which use an identifier.
Some example datasets produced this way:
https://zenodo.org/record/15871
https://zenodo.org/record/55004
https://zenodo.org/record/54799
Once you extract the list of works, the fun begins. You'll need to
intersect with other data sources (Wikidata, ORCID, other?) and account
for a number of factors until you manage to find a subset of the data
which has a sufficiently high signal:noise ratio. For instance you might
need to filter or normalise by

year of publication (some year recent enough to have good data but old

enough to allow the work to be cited elsewhere, be archived after
embargos);

country or institution (some probably have better ORCID coverage);
field/discipline and language;
open access status (per Unpaywall);
number of expected pageviews and clicks (for instance using

https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews and
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Releases;
a link from 10k articles on asteroids or proteins is not the same as
being the lone link from a popular article which is not the same as a
link buried among a thousand others on a big article);

time or duration of the addition (with one of the various diff

extraction libraries, content persistence data or possibly historical
eventstream if such a thing is available).
To avoid having to invent everything yourself, maybe you can reuse the
method of some similar study, for instance the one on the open access
citation advantage or one of the many which studied the gender imbalance
of citations and peer review in journals.
However, it's very possible that the noise is just too much for a
general computational method. You might consider a more manual approach
on a sample of relevant events, for instance the *removal* of citations,
which is in my opinion more significant than the addition.* You might
extract all the diffs which removed a citation from an article in the
last N years (probably they'll be in the order of 10^5 rather than
10^6), remove some massive events or outliers, sample 500-1000 of them
randomly and verify the required data manually.
As usual it will be impossible to have an objective assessment of
whether that citation was really (in)appropriate in that context
according to the (English or whatever) Wikipedia guidelines. To test
that too, you should replicate one of the various studies of the gender
imbalance of peer review, perhaps one of those which tried to assess the
impact of a double blind peer review system on the gender imbalance.
However, because the sources are already published, you'd need to
provide the agendered information yourself and make sure the
participants perform their assessment in some controlled environment
where they don't have access to any gendered information (i.e. where you
cut them off the internet).
How many years do you have to work on this project? :-)
Federico
(*) I might add a citation just because it's the first result a popular
search engine gives me, after glancing at the abstract and maybe the
journal home page; but if I remove an existing citation, hopefully I've
at least assessed its content and made a judgement about it, apart from
cases of mass removals for specific problems with certain articles or
publication venues.

Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Isaac Johnson -- Research Scientist -- Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] gender balance of wikipedia citations