Dear Wiki Community,
My name is Mackenzie Lemieux and I am a neuroscience researcher at the Salk
Institute for Biological Studies and I am interested in exploring biases on
My research hypothesis is that gender or ethnicity mediate the rate of
flagging and deletion of pages for women in STEM. I hope to
retrospectively analyze Wikipedia's deletion history, harvest the
biographical articles about scientists that have been created over the past
n years and then confirm the gender and ethnicity of a large sample.
It appears that we can identify deleted pages with Wikipedia's deletion log
<https://en.wikipedia.org/wiki/Wikipedia:Deletion_log>, but to actually see
the page that was deleted we need to be members of one of these Wikipedia
user groups: Administrators
Does anyone have advice on how to obtain researcher status or is there
anyone willing to collaborate who has access to the data we need?
220 Gilmour Avenue
We’re preparing for the September 2020 research newsletter and looking for
contributors. Please take a look at
https://etherpad.wikimedia.org/p/WRN202009 and add your name next to any
paper you are interested in covering. Our target publication time is 27
September 15:59 UTC. If you can't make this deadline but would like to
cover a particular paper in the subsequent issue, leave a note next to the
paper's entry below. As usual, short notes and one-paragraph reviews are
*Highlights from this month:*
- A decade of writing on Wikipedia: A comparative study of three articles
- A Taxonomy of Knowledge Gaps for Wikimedia Projects (First Draft)
- Biased Representation of Politicians in Google and Wikipedia Search?
The Joint Effect of Party Identity, Gender Identity and Elections
- Covid-on-the-Web: Knowledge Graph and Services to Advance COVID-19
- ideoCutTool - Online Video Editor Tool for Wikimedia Commons
- Mobile Recognition of Wikipedia Featured Sites using Deep Learning and
- PNEL: Pointer Network based End-To-End Entity Linking over Knowledge
- Using logical constraints to validate information in collaborative
knowledge graphs: a study of COVID-19 on Wikidata
- What if we had no Wikipedia? Domain-independent Term Extraction from a
Large News Corpus
- Wikidata on MARS
*Masssly and Tilman Bayer*
 WikiResearch (@WikiResearch) | Twitter
The next Research Showcase will be live-streamed on Wednesday, September
23, at 9:30 AM PDT/16:30 UTC, and will be on the theme of knowledge gaps.
Miriam Redi will give an overview on the first draft of the taxonomy of
knowledge gaps in Wikimedia projects. The taxonomy is a first milestone
towards developing a framework to understand and measure knowledge gaps
with the goal of capturing the multi-dimensional aspect of knowledge gaps
and inform long-term decision making.
YouTube stream: https://www.youtube.com/watch?v=GJDsKPsz64o
As usual, you can join the conversation on IRC at #wikimedia-research. You
can also watch our past research showcases here:
This month's presentation:
A first draft of the knowledge gaps taxonomy for Wikimedia projects
By the Wikimedia Foundation Research Team <https://research.wikimedia.org/>
In response to Wikimedia Movement’s 2030 strategic direction
team <https://research.wikimedia.org/team.html> at the Wikimedia Foundation
is developing a framework to understand and measure knowledge gaps. The
goal is to capture the multi-dimensional aspect of knowledge gaps and
inform long-term decision making. The first milestone was to develop a
taxonomy of knowledge gaps which offers a grouping and descriptions of the
different Wikimedia knowledge gaps. The first draft of the taxonomy is now
published <https://arxiv.org/abs/2008.12314> and we seek your feedback to
improve it. In this talk, we will give an overview over the first draft of
the taxonomy of knowledge gaps in Wikimedia projects. Following that, we
will host an extended Q&A in which we would like to get your feedback and
discuss with you the taxonomy and knowledge gaps more generally.
- More information:
Janna Layton (she/her)
Administrative Associate - Product & Technology
Wikimedia Foundation <https://wikimediafoundation.org/>
Thanks for your questions.
Imagine you are a fan of Mollywood (a Hollywood inspired nickname for Malayalam Cinema) and you want to improve the article about the following movie: https://en.wikipedia.org/wiki/Aarohanam_(1980_film)<https://slack-redir.net/link?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FAa…>
You just watched the movie, and you want to tell the world about the argument of the movie. Then you create a new section named "Plot". You have an idea about which Wikipedia articles to include in the section, but for a whole picture you ask the tool for recommendations. You query -- format of the query is (article, section name, type of entities to suggest) -- the model to get the following information.
- Where was the argument of the movie supposed to happen?
Which would be translated into the query Query(Aarohanam_(1980_film), "Plot", Location)
Recommendations: (Kerala, Puducherry, India, Malabar Coast)
- Which topics are addressed in the plot of the movie?
Query(Aarohanam_(1980_film), "Plot", TopicalConcept)
Recommendations: (Bipolar disorder, Poverty)
Given these recommendations and your previous knowledge you are ready to start editing the section!
You also know this movie had an impact on the Malayalam culture. Then you decide to include a new section named "Impact on the Malayalam society". As before, you want recommendations from the model before editing. Now you query the tool with
Query(Aarohanam_(1980_film), "Impact on the Malayalam society", Person)
Query(Aarohanam_(1980_film), "Impact on the Malayalam society", Event)
The model provides some suggestions to these queries. The suggestions will be of type Person and Event, respectively. This, along with your previous knowledge you are ready to start editing.
To sum up, the tool suggests Wikipedia entities to insert in the respective text of the section you are going to start editing for the first time. You can also use the tool in cases where the section already exists and it contains some text and links. In this case, you can check whether the section is missing some important entity that has been recommended by the tool.
However, one of our concerns relates to the requirement of specifying the type of entity (Person, Event, TopicalConcept) for which the editor wants recommendations. We are wondering if this requirement is limiting or not. Note that as indicated in our original post, the total number of entity types is in a manageable range (~20), and can be presented in a visual manner (using a dropdown list) to the editor.
We are researchers from the dlab at EPFL working with Bob West.
We have plans to build a graph-based ML algorithm, which will further facilitate development of a tool to assist Wikipedia editors by providing recommendations on two novel use-cases. One consists of suggesting hyperlinks (Wikipedia articles) to be inserted within a section of an article. Note that this is different from "classical link prediction".
We feel the tool could be of great value, as it can work with newly created sections that do not have any content yet. What's more, the editor can type *any* section name (either non-existent in that article or even in the whole Wiki project) and the tool would have the power to suggest hyperlinks that are likely to be of interest for that section in the article. We think that (specially) stub articles can benefit from this tool.
However, we have one assumption. In addition to the section name, the editor must provide the "entity type" (Place, People, Date, Organization...) of the Wikipedia articles she would like to insert in the section. The reason is that within a section you can find links to articles of diverse types.
The reason we are reaching out to you is two fold:
(1) To check whether such a tool would be of interest and likely to be used by the editors.
(2) How limiting is the assumption that the editor needs to specify the entity type of the Wikipedia articles for which she needs recommendations from the tool?
One one hand, some of us think this is not a problem as the number of entity types is relatively small (between 10 and 20) and they can be easily and visually presented to the editor with a dropdown list. On the other side, others think this requirement is limiting.
We would like to know your opinion to decide whether we should move forward with this project.
I’m wondering if any large-scale surveys have been done that ask Wikipedia editors about their race, ethnicity, or religion?
Also, have any researchers considered asking these questions in editor surveys, but chosen not to ask them for particular reasons?
The Research team at the Wikimedia Foundation  has officially started a
new Formal Collaboration  with Djellel Difallah (NYU Abu Dhabi) to work
collaboratively on sockpuppet detection  as part of the Improve
Knowledge Integrity program  and link recommendation  as part of the
Address Knowledge Gaps program . You may recognize Djellel as a former
member of the Research team and we are glad to be able to continue to
collaborate with him as he rejoins academia!
Here are a few pieces of information about this collaboration that we would
like to share with you:
* We aim to keep the research documentation for these projects in the
corresponding research page on meta (sockpuppet detection)  and
phabricator ticket (link recommendation) .
* We are thankful to Djellel for agreeing to spend his time and expertise
on these projects in the coming year, and to those of you who have worked
with us to improve these models.
* I will act as the point of contact for the sockpuppet detection research
and Martin Gerlach (cc'ed) will act as the point of contact for the link
recommendation research in the Wikimedia Foundation. Please feel free to
reach out to one of us (directly, if it cannot be shared publicly) if you
have comments or questions about a specific project.
Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation
So far, Wikipedia's full revision history has been available only in wiki
markup, not in HTML -- a big limitation for researchers. We are changing
this by releasing WikiHist.html, Wikipedia's full history (up until March
2019) in HTML:
Caveat emptor: 7 TB!
Wikipedia is written in the wikitext markup language. When serving content,
the MediaWiki software that powers Wikipedia parses wikitext to HTML,
thereby inserting additional content by expanding macros (templates and
modules). Hence, researchers who intend to analyze Wikipedia as seen by its
readers should work with HTML, rather than wikitext. Since Wikipedia’s
revision history is made publicly available by the Wikimedia Foundation
exclusively in wikitext format, researchers have had to produce HTML
themselves, typically by using Wikipedia’s REST API for ad-hoc
wikitext-to-HTML parsing. This approach, however, (1) does not scale to
very large amounts of data and (2) does not correctly expand macros in
historical article revisions.
We have solved these problems by developing a parallelized architecture for
parsing massive amounts of wikitext using local instances of MediaWiki,
enhanced with the capacity of correct historical macro expansion. By
deploying our system, we produce and hereby release WikiHist.html, English
Wikipedia’s full revision history in HTML format. It comprises the HTML
content of 580M revisions of 5.8M articles generated from the full English
Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019.
Boilerplate content such as page headers, footers, and navigation sidebars
are not included in the HTML.
For more details, please refer to https://zenodo.org/record/3605388
<https://t.co/ZhK7kKaPCi?amp=1> and to the dataset paper:
Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English
Wikipedia’s Full Revision History in HTML Format. In *Proceedings of the
14th International AAAI Conference on Web and Social Media,* 2020.