Hi Jane,
Thanks for the link. It's clear that there is a lot of work being done, and even more left to do.
I've been thinking about what you said about second authors and was wondering if instead of fixing it (or in addition to fixing it), it would make sense to put some sort of tag on the page itself (like the ones I see questioning notability or requests for additional citations). Something along the lines of authors missing from a particular citation and how to fix that, or no work by women cited in this article (if this is the case). It strikes me that by fixing it yourself, you are doing great work, but that maybe it also makes sense to spread awareness about these issues to the broader editing community so more people are thinking about it/doing it. At any rate, I thought I'd float the idea. Such a tag/the response (if any), could also be interesting to study, though perhaps something like this already exists and I'm just not aware of it, or perhaps there is good reason not to do it.
All best, Greg
On Tue, Aug 27, 2019 at 5:00 AM wiki-research-l-request@lists.wikimedia.org wrote:
Send Wiki-research-l mailing list submissions to wiki-research-l@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/wiki-research-l or, via email, send a message with subject or body 'help' to wiki-research-l-request@lists.wikimedia.org
You can reach the person managing the list at wiki-research-l-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Wiki-research-l digest..."
Today's Topics:
- Re: gender balance of wikipedia citations (Greg)
- Re: gender balance of Wikipedia citations (Jane Darnell)
Message: 1 Date: Mon, 26 Aug 2019 18:56:12 -0700 From: Greg thenatureprogram@gmail.com To: Isaac Johnson isaac@wikimedia.org Cc: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] gender balance of wikipedia citations Message-ID: < CAOO9DNv92bVR2COT2XmpHDU5kJOvD0yD3bahG+6Fkuma+HYDEg@mail.gmail.com> Content-Type: text/plain; charset="UTF-8"
Thanks, Isaac and Federico. These notes and links are very helpful--and will require some time to process. As for how many years I have to work on this, I'm retired! In truth, I keep hoping that someone on this list will express interest in working on these matters. The questions are all very interesting and quite relevant. The idea of studying removed citations is both complex and compelling.
Greg
On Mon, Aug 26, 2019 at 6:49 AM Isaac Johnson isaac@wikimedia.org wrote:
Regarding data, I have not been a part of these projects but I think that I can help a bit with working links:
- The (I believe) original dataset can also be found here:
https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/
- A newer version of this dataset was produced that also included
information about whether the source was openly available and its topic: ** Meta page:
https://meta.wikimedia.org/wiki/Research:Towards_Modeling_Citation_Quality
** Figshare:
https://figshare.com/articles/Accessibility_and_topics_of_citations_with_ide...
On Mon, Aug 26, 2019 at 3:53 AM Federico Leva (Nemo) <nemowiki@gmail.com
wrote:
Greg, 22/08/19 06:19:
I do not know the current status of wikicite or if/when this could be used for this inquiry--either to examine all, or a sensible
subset
of the citations.
If I see correctly, you still did not receive an answer on the data available.
It's true that the Figshare item for <
https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wiki...
was deleted (I've asked about it on the talk page), but it's trivial to run https://pypi.org/project/mwcites/ and extract the data yourself, at least for citations which use an identifier.
Some example datasets produced this way: https://zenodo.org/record/15871 https://zenodo.org/record/55004 https://zenodo.org/record/54799
Once you extract the list of works, the fun begins. You'll need to intersect with other data sources (Wikidata, ORCID, other?) and account for a number of factors until you manage to find a subset of the data which has a sufficiently high signal:noise ratio. For instance you might need to filter or normalise by
- year of publication (some year recent enough to have good data but old
enough to allow the work to be cited elsewhere, be archived after embargos);
- country or institution (some probably have better ORCID coverage);
- field/discipline and language;
- open access status (per Unpaywall);
- number of expected pageviews and clicks (for instance using
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews and <
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Releases%3E;
a link from 10k articles on asteroids or proteins is not the same as being the lone link from a popular article which is not the same as a link buried among a thousand others on a big article);
- time or duration of the addition (with one of the various diff
extraction libraries, content persistence data or possibly historical eventstream if such a thing is available).
To avoid having to invent everything yourself, maybe you can reuse the method of some similar study, for instance the one on the open access citation advantage or one of the many which studied the gender imbalance of citations and peer review in journals.
However, it's very possible that the noise is just too much for a general computational method. You might consider a more manual approach on a sample of relevant events, for instance the *removal* of citations, which is in my opinion more significant than the addition.* You might extract all the diffs which removed a citation from an article in the last N years (probably they'll be in the order of 10^5 rather than 10^6), remove some massive events or outliers, sample 500-1000 of them randomly and verify the required data manually.
As usual it will be impossible to have an objective assessment of whether that citation was really (in)appropriate in that context according to the (English or whatever) Wikipedia guidelines. To test that too, you should replicate one of the various studies of the gender imbalance of peer review, perhaps one of those which tried to assess the impact of a double blind peer review system on the gender imbalance. However, because the sources are already published, you'd need to provide the agendered information yourself and make sure the participants perform their assessment in some controlled environment where they don't have access to any gendered information (i.e. where you cut them off the internet).
How many years do you have to work on this project? :-)
Federico
(*) I might add a citation just because it's the first result a popular search engine gives me, after glancing at the abstract and maybe the journal home page; but if I remove an existing citation, hopefully I've at least assessed its content and made a judgement about it, apart from cases of mass removals for specific problems with certain articles or publication venues.
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Isaac Johnson -- Research Scientist -- Wikimedia Foundation
Message: 2 Date: Tue, 27 Aug 2019 08:00:45 +0200 From: Jane Darnell jane023@gmail.com To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] gender balance of Wikipedia citations Message-ID: <CAFVcA-HqVicR0k65J4iox0PD= oc3HBPMZLfXVO5zqkFD+EnSxQ@mail.gmail.com> Content-Type: text/plain; charset="UTF-8"
Greg, Yes that's what I meant. On Wikipedia you get what you measure, so many Wikipedians are page-creators and page-hit junkies because we can measure that. The trick to motivating editors is giving them other measurements for progress. Here is the link to the Women writers Wikiproject and as you scroll down you can see what is measured. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Women_writers Jane
On Tue, Aug 27, 2019 at 3:39 AM Greg thenatureprogram@gmail.com wrote:
Thanks for sharing your experience and thoughts, Jane. I did not know
this
was happening--I'm hardly an expert, so that's not surprising, and yet
it's
still very troubling to hear. I'm not sure what you mean by setting up a Wikiproject. Do you mean of ways for how to study this gap--i.e., the
ideas
that have been floated in this thread to this point? Or are you thinking
of
something else?
Greg
On Mon, Aug 26, 2019 at 5:00 AM < wiki-research-l-request@lists.wikimedia.org> wrote:
Send Wiki-research-l mailing list submissions to wiki-research-l@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/wiki-research-l or, via email, send a message with subject or body 'help' to wiki-research-l-request@lists.wikimedia.org
You can reach the person managing the list at wiki-research-l-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Wiki-research-l digest..."
Today's Topics:
- Re: gender balance of Wikipedia citations (WereSpielChequers)
- Re: gender balance of Wikipedia citations (Greg)
- Re: sockpuppets and how to find them sooner (Federico Leva
(Nemo))
- Re: gender balance of Wikipedia citations (Jane Darnell)
- Re: gender balance of wikipedia citations (Federico Leva (Nemo))
Message: 1 Date: Sun, 25 Aug 2019 14:28:25 +0100 From: WereSpielChequers werespielchequers@gmail.com To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] gender balance of Wikipedia citations Message-ID: <CAAanWP3qJnMpLB4tr9Eqt4EJLg2kCihkb50UY-d8= ShNONhSAA@mail.gmail.com> Content-Type: text/plain; charset="UTF-8"
Hi Greg,
One of the major step changes in the early growth of the English
Wikipedia
was when a bot called RamBot created stub articles on US places. I
think
they were cited to the census. Others have created articles on rivers
in
countries and various other topics by similar programmatic means.
Nowadays
such article creation is unlikely to get consensus on the English Wikipedia, but there are some languages which are very open to such creations and have them by the million.
I'm not sure if the fastest updating of existing articles is automated
or
just semiautomated. But looking at the bot requests page, it certainly looks like some people are running such maintenance bots "updating GDP
by
country" is a current bot request. https://en.wikipedia.org/wiki/Wikipedia:Bot_requests.
I'm not sure how "the ease of a source for purposes of converting into
a
table and generating a separate article for each row" relates to
gender.
But i suspect "number of times cited in wikipedia" deserves less kudos
than
"number of times cited in academia".
WSC
On Sun, 25 Aug 2019 at 05:22, Greg thenatureprogram@gmail.com wrote:
Thanks again, Kerry. I am hoping that someone with access to more
resources
(knowledge, support, etc) than I have will look into this.
A few more thoughts/questions:
- The link to the citation dataset from the Medium article ("What
are
the
ten most cited sources on Wikipedia? Let’s ask the data.") is broken. 2. As far as I can tell, every named author in the top ten most cited sources on Wikipedia is male. One piece is by a working group 3. This line from the Medium piece struck me: "Many of these
publications
have been cited by Wikipedians across large series of articles using powerful bots and automated tools."
Are citations being added by bots? I'm not sure that I understand
that
line
correctly.
Greg
Message: 2 Date: Sun, 25 Aug 2019 21:16:25 -0700 From: Greg thenatureprogram@gmail.com To: wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] gender balance of Wikipedia citations Message-ID: <CAOO9DNvGyfvJkzyRq60cSQi-T80mAkUa= vCPkzFbEysfGQqnVg@mail.gmail.com> Content-Type: text/plain; charset="UTF-8"
Thanks, WSC. All very interesting.
I've been thinking about Wiklpedia citations less in terms of kudos and more in terms of a feedback loop. The cited sources get a significant amount of attention (1 click per 200 pageviews is the number I saw recently). When I imagine total Wikipedia traffic, that's huge. How
many
students are finding sources this way? How many academics? And how many
of
these citations are finding their way back into academic publications
via
this mechanism?
Assuming this is happening to some degree, the gender imbalance of the citations is also reflected. If the Wikipedia imbalance is the same as
the
one in academia, that's one thing; if it is better on Wikipedia than it
is
in academia, that's reason to celebrate; if the balance is worse,
that's
concerning. In fact, if the gender imbalance conforms to my fears
instead
of my hopes, and is magnified by the massive website traffic, I imagine
it
could even explain the growth in the citation disparity researchers
note
in
their study of political science texts. (I link to that study in a
previous
post; it was mentioned in the Washington Post recently)
There is a very real possibility that Wikipedia is making the citation gender gap worse. I think we need to understand what is happening and
take
immediate action if the news is not good.
Greg
Message: 3 Date: Mon, 26 Aug 2019 10:59:07 +0300 From: "Federico Leva (Nemo)" nemowiki@gmail.com To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org, Aaron Halfaker ahalfaker@wikimedia.org, Kerry Raymond <
kerry.raymond@gmail.com>
Subject: Re: [Wiki-research-l] sockpuppets and how to find them sooner Message-ID: cf2734ff-d2cf-3108-691f-8ecf46125ed7@gmail.com Content-Type: text/plain; charset=utf-8; format=flowed
Please everyone avoid using jargon specific to the English Wikipedia on this cross-language and cross-wiki mailing list.
Aaron Halfaker, 23/08/19 17:36:
I think embeddings[1] would be a nice way to create a signature.
There is some discussion of acceptable user fingerprinting (presumably to be available to CheckUsers only), other than the usual over-reliance on IP addresses, in particular at <
https://meta.wikimedia.org/wiki/Talk:IP_Editing:_Privacy_Enhancement_and_Abu...
.
Federico
Message: 4 Date: Mon, 26 Aug 2019 10:17:46 +0200 From: Jane Darnell jane023@gmail.com To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] gender balance of Wikipedia citations Message-ID: <CAFVcA-G87k26nBMr=-e-+C8o6eG0KQvVihH= f4M40faVNbKkqw@mail.gmail.com> Content-Type: text/plain; charset="UTF-8"
Greg, Thanks for worrying. This is a known problem and yes, Wikipedia
contributes
to the Gendergap in citations and no, it's not an easy fix, since it is
the
fault of systemic bias in academia. So fewer women are head author on scientific publications, and it is generally only the head author that
gets
cited on Wikipedia. This is not just a problem with written works in
the
field of politics. I spend most of my time working on paintings and
their
documented catalogs, so generally I only notice and fix this problem in
art
catalogs. Women rarely appear as lead author mentioned. I will always
add
them in to descriptions when I add items for their works on Wikidata,
but I
can not always find them! Sometimes I can't even create items for them because all I have is a name and a work and nothing else available
online
anywhere. You see this most often with women who spent entire careers working at a single institution and the institution doesn't bother to promote their work or even list them in exhibition catalogs. With luck there might be a local obituary, but not always. If you have
suggestions
how to set up a Wikiproject to tackle this it would be a good idea. In
my
onwiki experience the Women-in-Red community can be very positive in
their
response to gendergap-related issues for women writers. Jane
On Mon, Aug 26, 2019 at 6:17 AM Greg thenatureprogram@gmail.com
wrote:
Thanks, WSC. All very interesting.
I've been thinking about Wiklpedia citations less in terms of kudos
and
more in terms of a feedback loop. The cited sources get a significant amount of attention (1 click per 200 pageviews is the number I saw recently). When I imagine total Wikipedia traffic, that's huge. How
many
students are finding sources this way? How many academics? And how
many
of
these citations are finding their way back into academic publications
via
this mechanism?
Assuming this is happening to some degree, the gender imbalance of
the
citations is also reflected. If the Wikipedia imbalance is the same
as
the
one in academia, that's one thing; if it is better on Wikipedia than
it
is
in academia, that's reason to celebrate; if the balance is worse,
that's
concerning. In fact, if the gender imbalance conforms to my fears
instead
of my hopes, and is magnified by the massive website traffic, I
imagine
it
could even explain the growth in the citation disparity researchers
note
in
their study of political science texts. (I link to that study in a
previous
post; it was mentioned in the Washington Post recently)
There is a very real possibility that Wikipedia is making the
citation
gender gap worse. I think we need to understand what is happening and
take
immediate action if the news is not good.
Greg
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Message: 5 Date: Mon, 26 Aug 2019 11:45:09 +0300 From: "Federico Leva (Nemo)" nemowiki@gmail.com To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org, Greg thenatureprogram@gmail.com Subject: Re: [Wiki-research-l] gender balance of wikipedia citations Message-ID: 835202af-4653-641e-782e-c619458bdd7f@gmail.com Content-Type: text/plain; charset=utf-8; format=flowed
Greg, 22/08/19 06:19:
I do not know the current status of wikicite or if/when this could be used for this inquiry--either to examine all, or a sensible
subset
of the citations.
If I see correctly, you still did not receive an answer on the data available.
It's true that the Figshare item for <
https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wiki...
was deleted (I've asked about it on the talk page), but it's trivial to run https://pypi.org/project/mwcites/ and extract the data yourself,
at
least for citations which use an identifier.
Some example datasets produced this way: https://zenodo.org/record/15871 https://zenodo.org/record/55004 https://zenodo.org/record/54799
Once you extract the list of works, the fun begins. You'll need to intersect with other data sources (Wikidata, ORCID, other?) and account for a number of factors until you manage to find a subset of the data which has a sufficiently high signal:noise ratio. For instance you
might
need to filter or normalise by
- year of publication (some year recent enough to have good data but
old
enough to allow the work to be cited elsewhere, be archived after embargos);
- country or institution (some probably have better ORCID coverage);
- field/discipline and language;
- open access status (per Unpaywall);
- number of expected pageviews and clicks (for instance using
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews and <
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Releases
;
a link from 10k articles on asteroids or proteins is not the same as being the lone link from a popular article which is not the same as a link buried among a thousand others on a big article);
- time or duration of the addition (with one of the various diff
extraction libraries, content persistence data or possibly historical eventstream if such a thing is available).
To avoid having to invent everything yourself, maybe you can reuse the method of some similar study, for instance the one on the open access citation advantage or one of the many which studied the gender
imbalance
of citations and peer review in journals.
However, it's very possible that the noise is just too much for a general computational method. You might consider a more manual approach on a sample of relevant events, for instance the *removal* of
citations,
which is in my opinion more significant than the addition.* You might extract all the diffs which removed a citation from an article in the last N years (probably they'll be in the order of 10^5 rather than 10^6), remove some massive events or outliers, sample 500-1000 of them randomly and verify the required data manually.
As usual it will be impossible to have an objective assessment of whether that citation was really (in)appropriate in that context according to the (English or whatever) Wikipedia guidelines. To test that too, you should replicate one of the various studies of the gender imbalance of peer review, perhaps one of those which tried to assess
the
impact of a double blind peer review system on the gender imbalance. However, because the sources are already published, you'd need to provide the agendered information yourself and make sure the participants perform their assessment in some controlled environment where they don't have access to any gendered information (i.e. where
you
cut them off the internet).
How many years do you have to work on this project? :-)
Federico
(*) I might add a citation just because it's the first result a popular search engine gives me, after glancing at the abstract and maybe the journal home page; but if I remove an existing citation, hopefully I've at least assessed its content and made a judgement about it, apart from cases of mass removals for specific problems with certain articles or publication venues.
Subject: Digest Footer
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
End of Wiki-research-l Digest, Vol 168, Issue 20
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Subject: Digest Footer
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
End of Wiki-research-l Digest, Vol 168, Issue 22
FWIW, I think there would be pushback against a quality tag that highlighted little/no citation of women's work (whether we are talking first author or not) in an article. There's two reasons for this. One is the misogyny that really does exist within the English Wikipedia "community" (those who do most of the shouting and hence decision making); they will argue that firstly gender balance of citations doesn't matter, secondly it is a reflection of the real world and thirdly that Wikipedia has a policy that it is not there to Right Great Wrongs.
More practically, we know that whole-of-article quality tagging doesn't tend to have a lot of impact in terms getting people to fix anything, compared to more specific tags like "citation needed", "dubious", "says who" and so on placed on specific pieces of text. People are much more likely to fix a specific problem and then remove the specific tag. Even when a person does respond to a generic tag like "more references needed" and add in some more references, they rarely remove the generic tag thinking "well, there's still plenty of scope here to add more references". Who among us is willing to declare "that article is 100% fully referenced by reliable sources"? Nobody it seems, it's a tag that lingers forever ...
So I think a specific tag to encourage the expansion of "Bloggs et al" citations to full author listings might work. It's a somewhat boring and mechanical task to expand "et al" but we do have people who are happy to contribute in that way. It might even be possible to build a tool to assist them which looks up the paper in WikiCite or Google Scholar etc to extract the full author list as published (just as we have tools to make it easier to typo and spelling fixes, disambiguate links and so forth). That would address the problem of women authors not being first cited and lost in the mists of "et al". However, as it is unlikely to be obvious to the average contributor that the paper with the full author list of A.B. Brown, C.D. Jones, E.F. Smith and G.H. Walker does or doesn't have any female authors, so I can't see that it's going to be easy to motivate people to try to find additional citations which do have more female authors.
And, as much as gender equity is a wrong I'd like to see rightened, I don't want to see campaigns just to "add in more female authored citations" (I call this "citation sprinkling") on Wikipedia. A citation has to be there because it verifies the information in the article and not to meet a gender quota. Remember that for a lot of Wikipedia contributors, academic literature is mostly behind a paywall so they can't actually read more than the title and abstract at best. A "sprinkling" campaign is likely to see citations based only on title and abstract ("well, it sounds like this paper which includes a woman author is talking about this topic") but the paper may not support the specific claim made in the text (indeed, it might say the exact opposite). A sprinkling campaign should only target the Further Reading section whose role is:
"The Further reading section of an article contains a bulleted list of a reasonable number of works which a reader may consult for additional and more detailed coverage of the subject of the article. In articles with numerous footnotes, it probably is not obvious which ones are suitable for further reading. The "Further reading" section can help the readers by listing selected titles without worrying about duplications."
which would avoid the risk of adding a citation that doesn't support the specific claims being made in the article. So maybe it would be possible to add a "skewed gender balance" tag onto the Further reading section and/or External links section whose role is
"Some acceptable links include those that contain further research that is accurate and on-topic, information that could not be added to the article for reasons such as copyright or amount of detail, or other meaningful, relevant content that is not suitable for inclusion in an article for reasons unrelated to its accuracy."
The downside is this idea for adding female authors to the Further Reading and External Links sections is whether anyone ever looks at them. Currently over 50% of Wikipedia hits are now via mobile device. The mobile render of a Wikipedia article is not the whole article as you see on desktop and laptop but rather you select the sections you want to read, so for mobile readers we do know precisely what sections they are opening from which we have learned that people in developed countries are not generally reading whole articles but specific sections (suggesting seeking answers to a specific need rather than a desire to fully appreciate the topic), and they don't tend to open anything after the References as a rule, so they aren't looking at Further Reading and External links anyway. Are desktop/laptop readers looking at them either? We don't really know as they get the whole article rendered as a single result and it would really only be eye-tracking studies (an expensive type of experiment) that would give us this insight with the same accuracy as our mobile data.
Aside, in less developed countries, readers are more likely to read whole articles on a mobile device. While the reasons for this different are not proven, I'd be prepared to guess at two interlinked hypotheses. Firstly, such countries have poorer standards of education so people may be using Wikipedia to supplement their limited formal education. Also such countries are more likely to be using rote learning in their education system (valuing the ability to memorise and reproduce) rather than the more problem-solving learning approaches increasingly in use in the education systems of more developed countries. That would also explain whole-of-article viewing rather than selecting specific sub-sections.
In some ways, I think a better solution might be to try to get Google scholar interested in the issue of gender. What if articles listed on Google scholar came with a little gender balance score (a bit like hotel ratings). One blue star (or some other symbol) for one male author, two blue stars (two male authors), one pink and one blue star (first author female, second author male), etc. Why I like the idea is that it is a simple-to-understand visual aid to draw attention to gender imbalance more widely but without a specific call to action (which as I outline above may backfire if citations get added for gender balance rather than content). It potentially helps address the real world problem which would hopefully flow through to Wikipedia. Also Google Scholar is probably a lot better resourced to build the tools to do the legwork of determining gender (I guess a white star is used when it can't). The risks though that Leia has previously mentioned is that automated tools don't do a great job of getting gender correct particularly as the tools are often trained on limited data sets such as mostly white people making the automated gender guessing of non-white people more likely to be incorrect. However, as authors can establish their own Google Scholar profile (if the author's name is underlined, it's a link to their profile, that's a place where they could disclose their gender if they desired or correct Google Scholar's mistaken guess or demand that Google Scholar not show their gender (whatever should be their choice). Hmm, might it lead to catfishing? Authors passing themselves off as a different gender? Hmm ...
Another place we might explore is marking gender in some easily visible way is in WikiCite but frankly I know little about that project so cannot comment on it nor the merits of doing it there rather than on Google scholar. I don't think traditional journal publishers are likely to be keen to show gender balance on their own websites as I think they would realise it would enable webscraping to reveal their overall gender balance profile, leading to some adverse headlines about "Brandname journals worst for gender equity". But Google Scholar has less to fear unless it was demonstrated that they exhibited stronger gender bias than the journals themselves but I would think that Google Scholar aggregates papers without any regard to the gender of the authors, but I guess it might not aggregate all topic areas equally. For example, if they didn't make much effort to include (say) nursing publications (a more female academic discipline) but went hard on engineering publications (a more male academic discipline), I guess it would skew their author gender balance towards men.
Kerry
-----Original Message----- From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Greg Sent: Thursday, 29 August 2019 4:06 AM To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org; jane023@gmail.com Subject: Re: [Wiki-research-l] gender balance of Wikipedia citations
Hi Jane,
Thanks for the link. It's clear that there is a lot of work being done, and even more left to do.
I've been thinking about what you said about second authors and was wondering if instead of fixing it (or in addition to fixing it), it would make sense to put some sort of tag on the page itself (like the ones I see questioning notability or requests for additional citations). Something along the lines of authors missing from a particular citation and how to fix that, or no work by women cited in this article (if this is the case). It strikes me that by fixing it yourself, you are doing great work, but that maybe it also makes sense to spread awareness about these issues to the broader editing community so more people are thinking about it/doing it. At any rate, I thought I'd float the idea. Such a tag/the response (if any), could also be interesting to study, though perhaps something like this already exists and I'm just not aware of it, or perhaps there is good reason not to do it.
All best, Greg
On Tue, Aug 27, 2019 at 5:00 AM wiki-research-l-request@lists.wikimedia.org wrote:
Send Wiki-research-l mailing list submissions to wiki-research-l@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/wiki-research-l or, via email, send a message with subject or body 'help' to wiki-research-l-request@lists.wikimedia.org
You can reach the person managing the list at wiki-research-l-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Wiki-research-l digest..."
Today's Topics:
- Re: gender balance of wikipedia citations (Greg)
- Re: gender balance of Wikipedia citations (Jane Darnell)
Message: 1 Date: Mon, 26 Aug 2019 18:56:12 -0700 From: Greg thenatureprogram@gmail.com To: Isaac Johnson isaac@wikimedia.org Cc: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] gender balance of wikipedia citations Message-ID: < CAOO9DNv92bVR2COT2XmpHDU5kJOvD0yD3bahG+6Fkuma+HYDEg@mail.gmail.com> Content-Type: text/plain; charset="UTF-8"
Thanks, Isaac and Federico. These notes and links are very helpful--and will require some time to process. As for how many years I have to work on this, I'm retired! In truth, I keep hoping that someone on this list will express interest in working on these matters. The questions are all very interesting and quite relevant. The idea of studying removed citations is both complex and compelling.
Greg
On Mon, Aug 26, 2019 at 6:49 AM Isaac Johnson isaac@wikimedia.org wrote:
Regarding data, I have not been a part of these projects but I think that I can help a bit with working links:
- The (I believe) original dataset can also be found here:
https://analytics.wikimedia.org/datasets/archive/public-datasets/all/m wrefs/
- A newer version of this dataset was produced that also included
information about whether the source was openly available and its topic: ** Meta page:
https://meta.wikimedia.org/wiki/Research:Towards_Modeling_Citation_Qua lity
** Figshare:
https://figshare.com/articles/Accessibility_and_topics_of_citations_wi th_identifiers_in_Wikipedia/6819710
On Mon, Aug 26, 2019 at 3:53 AM Federico Leva (Nemo) <nemowiki@gmail.com
wrote:
Greg, 22/08/19 06:19:
I do not know the current status of wikicite or if/when this could be used for this inquiry--either to examine all, or a sensible
subset
of the citations.
If I see correctly, you still did not receive an answer on the data available.
It's true that the Figshare item for <
https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_i n_Wikipedia
was deleted (I've asked about it on the talk page), but it's trivial to run https://pypi.org/project/mwcites/ and extract the data yourself, at least for citations which use an identifier.
Some example datasets produced this way: https://zenodo.org/record/15871 https://zenodo.org/record/55004 https://zenodo.org/record/54799
Once you extract the list of works, the fun begins. You'll need to intersect with other data sources (Wikidata, ORCID, other?) and account for a number of factors until you manage to find a subset of the data which has a sufficiently high signal:noise ratio. For instance you might need to filter or normalise by
- year of publication (some year recent enough to have good data
but old enough to allow the work to be cited elsewhere, be archived after embargos);
- country or institution (some probably have better ORCID
coverage);
- field/discipline and language;
- open access status (per Unpaywall);
- number of expected pageviews and clicks (for instance using
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews and <
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Release s>;
a link from 10k articles on asteroids or proteins is not the same as being the lone link from a popular article which is not the same as a link buried among a thousand others on a big article);
- time or duration of the addition (with one of the various diff
extraction libraries, content persistence data or possibly historical eventstream if such a thing is available).
To avoid having to invent everything yourself, maybe you can reuse the method of some similar study, for instance the one on the open access citation advantage or one of the many which studied the gender imbalance of citations and peer review in journals.
However, it's very possible that the noise is just too much for a general computational method. You might consider a more manual approach on a sample of relevant events, for instance the *removal* of citations, which is in my opinion more significant than the addition.* You might extract all the diffs which removed a citation from an article in the last N years (probably they'll be in the order of 10^5 rather than 10^6), remove some massive events or outliers, sample 500-1000 of them randomly and verify the required data manually.
As usual it will be impossible to have an objective assessment of whether that citation was really (in)appropriate in that context according to the (English or whatever) Wikipedia guidelines. To test that too, you should replicate one of the various studies of the gender imbalance of peer review, perhaps one of those which tried to assess the impact of a double blind peer review system on the gender imbalance. However, because the sources are already published, you'd need to provide the agendered information yourself and make sure the participants perform their assessment in some controlled environment where they don't have access to any gendered information (i.e. where you cut them off the internet).
How many years do you have to work on this project? :-)
Federico
(*) I might add a citation just because it's the first result a popular search engine gives me, after glancing at the abstract and maybe the journal home page; but if I remove an existing citation, hopefully I've at least assessed its content and made a judgement about it, apart from cases of mass removals for specific problems with certain articles or publication venues.
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Isaac Johnson -- Research Scientist -- Wikimedia Foundation
Message: 2 Date: Tue, 27 Aug 2019 08:00:45 +0200 From: Jane Darnell jane023@gmail.com To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] gender balance of Wikipedia citations Message-ID: <CAFVcA-HqVicR0k65J4iox0PD= oc3HBPMZLfXVO5zqkFD+EnSxQ@mail.gmail.com> Content-Type: text/plain; charset="UTF-8"
Greg, Yes that's what I meant. On Wikipedia you get what you measure, so many Wikipedians are page-creators and page-hit junkies because we can measure that. The trick to motivating editors is giving them other measurements for progress. Here is the link to the Women writers Wikiproject and as you scroll down you can see what is measured. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Women_writers Jane
On Tue, Aug 27, 2019 at 3:39 AM Greg thenatureprogram@gmail.com wrote:
Thanks for sharing your experience and thoughts, Jane. I did not know
this
was happening--I'm hardly an expert, so that's not surprising, and yet
it's
still very troubling to hear. I'm not sure what you mean by setting up a Wikiproject. Do you mean of ways for how to study this gap--i.e., the
ideas
that have been floated in this thread to this point? Or are you thinking
of
something else?
Greg
On Mon, Aug 26, 2019 at 5:00 AM < wiki-research-l-request@lists.wikimedia.org> wrote:
Send Wiki-research-l mailing list submissions to wiki-research-l@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l or, via email, send a message with subject or body 'help' to wiki-research-l-request@lists.wikimedia.org
You can reach the person managing the list at wiki-research-l-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Wiki-research-l digest..."
Today's Topics:
- Re: gender balance of Wikipedia citations (WereSpielChequers)
- Re: gender balance of Wikipedia citations (Greg)
- Re: sockpuppets and how to find them sooner (Federico Leva
(Nemo))
- Re: gender balance of Wikipedia citations (Jane Darnell)
- Re: gender balance of wikipedia citations (Federico Leva
(Nemo))
Message: 1 Date: Sun, 25 Aug 2019 14:28:25 +0100 From: WereSpielChequers werespielchequers@gmail.com To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] gender balance of Wikipedia citations Message-ID: <CAAanWP3qJnMpLB4tr9Eqt4EJLg2kCihkb50UY-d8= ShNONhSAA@mail.gmail.com> Content-Type: text/plain; charset="UTF-8"
Hi Greg,
One of the major step changes in the early growth of the English
Wikipedia
was when a bot called RamBot created stub articles on US places. I
think
they were cited to the census. Others have created articles on rivers
in
countries and various other topics by similar programmatic means.
Nowadays
such article creation is unlikely to get consensus on the English Wikipedia, but there are some languages which are very open to such creations and have them by the million.
I'm not sure if the fastest updating of existing articles is automated
or
just semiautomated. But looking at the bot requests page, it certainly looks like some people are running such maintenance bots "updating GDP
by
country" is a current bot request. https://en.wikipedia.org/wiki/Wikipedia:Bot_requests.
I'm not sure how "the ease of a source for purposes of converting into
a
table and generating a separate article for each row" relates to
gender.
But i suspect "number of times cited in wikipedia" deserves less kudos
than
"number of times cited in academia".
WSC
On Sun, 25 Aug 2019 at 05:22, Greg thenatureprogram@gmail.com wrote:
Thanks again, Kerry. I am hoping that someone with access to more
resources
(knowledge, support, etc) than I have will look into this.
A few more thoughts/questions:
- The link to the citation dataset from the Medium article
("What
are
the
ten most cited sources on Wikipedia? Let’s ask the data.") is broken. 2. As far as I can tell, every named author in the top ten most cited sources on Wikipedia is male. One piece is by a working group 3. This line from the Medium piece struck me: "Many of these
publications
have been cited by Wikipedians across large series of articles using powerful bots and automated tools."
Are citations being added by bots? I'm not sure that I understand
that
line
correctly.
Greg
Message: 2 Date: Sun, 25 Aug 2019 21:16:25 -0700 From: Greg thenatureprogram@gmail.com To: wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] gender balance of Wikipedia citations Message-ID: <CAOO9DNvGyfvJkzyRq60cSQi-T80mAkUa= vCPkzFbEysfGQqnVg@mail.gmail.com> Content-Type: text/plain; charset="UTF-8"
Thanks, WSC. All very interesting.
I've been thinking about Wiklpedia citations less in terms of kudos and more in terms of a feedback loop. The cited sources get a significant amount of attention (1 click per 200 pageviews is the number I saw recently). When I imagine total Wikipedia traffic, that's huge. How
many
students are finding sources this way? How many academics? And how many
of
these citations are finding their way back into academic publications
via
this mechanism?
Assuming this is happening to some degree, the gender imbalance of the citations is also reflected. If the Wikipedia imbalance is the same as
the
one in academia, that's one thing; if it is better on Wikipedia than it
is
in academia, that's reason to celebrate; if the balance is worse,
that's
concerning. In fact, if the gender imbalance conforms to my fears
instead
of my hopes, and is magnified by the massive website traffic, I imagine
it
could even explain the growth in the citation disparity researchers
note
in
their study of political science texts. (I link to that study in a
previous
post; it was mentioned in the Washington Post recently)
There is a very real possibility that Wikipedia is making the citation gender gap worse. I think we need to understand what is happening and
take
immediate action if the news is not good.
Greg
Message: 3 Date: Mon, 26 Aug 2019 10:59:07 +0300 From: "Federico Leva (Nemo)" nemowiki@gmail.com To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org, Aaron Halfaker ahalfaker@wikimedia.org, Kerry Raymond <
kerry.raymond@gmail.com>
Subject: Re: [Wiki-research-l] sockpuppets and how to find them sooner Message-ID: cf2734ff-d2cf-3108-691f-8ecf46125ed7@gmail.com Content-Type: text/plain; charset=utf-8; format=flowed
Please everyone avoid using jargon specific to the English Wikipedia on this cross-language and cross-wiki mailing list.
Aaron Halfaker, 23/08/19 17:36:
I think embeddings[1] would be a nice way to create a signature.
There is some discussion of acceptable user fingerprinting (presumably to be available to CheckUsers only), other than the usual over-reliance on IP addresses, in particular at <
https://meta.wikimedia.org/wiki/Talk:IP_Editing:_Privacy_Enhancement_and_Abu...
.
Federico
Message: 4 Date: Mon, 26 Aug 2019 10:17:46 +0200 From: Jane Darnell jane023@gmail.com To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] gender balance of Wikipedia citations Message-ID: <CAFVcA-G87k26nBMr=-e-+C8o6eG0KQvVihH= f4M40faVNbKkqw@mail.gmail.com> Content-Type: text/plain; charset="UTF-8"
Greg, Thanks for worrying. This is a known problem and yes, Wikipedia
contributes
to the Gendergap in citations and no, it's not an easy fix, since it is
the
fault of systemic bias in academia. So fewer women are head author on scientific publications, and it is generally only the head author that
gets
cited on Wikipedia. This is not just a problem with written works in
the
field of politics. I spend most of my time working on paintings and
their
documented catalogs, so generally I only notice and fix this problem in
art
catalogs. Women rarely appear as lead author mentioned. I will always
add
them in to descriptions when I add items for their works on Wikidata,
but I
can not always find them! Sometimes I can't even create items for them because all I have is a name and a work and nothing else available
online
anywhere. You see this most often with women who spent entire careers working at a single institution and the institution doesn't bother to promote their work or even list them in exhibition catalogs. With luck there might be a local obituary, but not always. If you have
suggestions
how to set up a Wikiproject to tackle this it would be a good idea. In
my
onwiki experience the Women-in-Red community can be very positive in
their
response to gendergap-related issues for women writers. Jane
On Mon, Aug 26, 2019 at 6:17 AM Greg thenatureprogram@gmail.com
wrote:
Thanks, WSC. All very interesting.
I've been thinking about Wiklpedia citations less in terms of kudos
and
more in terms of a feedback loop. The cited sources get a significant amount of attention (1 click per 200 pageviews is the number I saw recently). When I imagine total Wikipedia traffic, that's huge. How
many
students are finding sources this way? How many academics? And how
many
of
these citations are finding their way back into academic publications
via
this mechanism?
Assuming this is happening to some degree, the gender imbalance of
the
citations is also reflected. If the Wikipedia imbalance is the same
as
the
one in academia, that's one thing; if it is better on Wikipedia than
it
is
in academia, that's reason to celebrate; if the balance is worse,
that's
concerning. In fact, if the gender imbalance conforms to my fears
instead
of my hopes, and is magnified by the massive website traffic, I
imagine
it
could even explain the growth in the citation disparity researchers
note
in
their study of political science texts. (I link to that study in a
previous
post; it was mentioned in the Washington Post recently)
There is a very real possibility that Wikipedia is making the
citation
gender gap worse. I think we need to understand what is happening and
take
immediate action if the news is not good.
Greg
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Message: 5 Date: Mon, 26 Aug 2019 11:45:09 +0300 From: "Federico Leva (Nemo)" nemowiki@gmail.com To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org, Greg thenatureprogram@gmail.com Subject: Re: [Wiki-research-l] gender balance of wikipedia citations Message-ID: 835202af-4653-641e-782e-c619458bdd7f@gmail.com Content-Type: text/plain; charset=utf-8; format=flowed
Greg, 22/08/19 06:19:
I do not know the current status of wikicite or if/when this could be used for this inquiry--either to examine all, or a sensible
subset
of the citations.
If I see correctly, you still did not receive an answer on the data available.
It's true that the Figshare item for <
https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wiki...
was deleted (I've asked about it on the talk page), but it's trivial to run https://pypi.org/project/mwcites/ and extract the data yourself,
at
least for citations which use an identifier.
Some example datasets produced this way: https://zenodo.org/record/15871 https://zenodo.org/record/55004 https://zenodo.org/record/54799
Once you extract the list of works, the fun begins. You'll need to intersect with other data sources (Wikidata, ORCID, other?) and account for a number of factors until you manage to find a subset of the data which has a sufficiently high signal:noise ratio. For instance you
might
need to filter or normalise by
- year of publication (some year recent enough to have good data but
old
enough to allow the work to be cited elsewhere, be archived after embargos);
- country or institution (some probably have better ORCID coverage);
- field/discipline and language;
- open access status (per Unpaywall);
- number of expected pageviews and clicks (for instance using
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews and <
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Releases
;
a link from 10k articles on asteroids or proteins is not the same as being the lone link from a popular article which is not the same as a link buried among a thousand others on a big article);
- time or duration of the addition (with one of the various diff
extraction libraries, content persistence data or possibly historical eventstream if such a thing is available).
To avoid having to invent everything yourself, maybe you can reuse the method of some similar study, for instance the one on the open access citation advantage or one of the many which studied the gender
imbalance
of citations and peer review in journals.
However, it's very possible that the noise is just too much for a general computational method. You might consider a more manual approach on a sample of relevant events, for instance the *removal* of
citations,
which is in my opinion more significant than the addition.* You might extract all the diffs which removed a citation from an article in the last N years (probably they'll be in the order of 10^5 rather than 10^6), remove some massive events or outliers, sample 500-1000 of them randomly and verify the required data manually.
As usual it will be impossible to have an objective assessment of whether that citation was really (in)appropriate in that context according to the (English or whatever) Wikipedia guidelines. To test that too, you should replicate one of the various studies of the gender imbalance of peer review, perhaps one of those which tried to assess
the
impact of a double blind peer review system on the gender imbalance. However, because the sources are already published, you'd need to provide the agendered information yourself and make sure the participants perform their assessment in some controlled environment where they don't have access to any gendered information (i.e. where
you
cut them off the internet).
How many years do you have to work on this project? :-)
Federico
(*) I might add a citation just because it's the first result a popular search engine gives me, after glancing at the abstract and maybe the journal home page; but if I remove an existing citation, hopefully I've at least assessed its content and made a judgement about it, apart from cases of mass removals for specific problems with certain articles or publication venues.
Subject: Digest Footer
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
End of Wiki-research-l Digest, Vol 168, Issue 20
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Subject: Digest Footer
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
End of Wiki-research-l Digest, Vol 168, Issue 22
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org