Pursuant to prior discussions about the need for a research
policy on Wikipedia, WikiProject Research is drafting a
policy regarding the recruitment of Wikipedia users to
participate in studies.
At this time, we have a proposed policy, and an accompanying
group that would facilitate recruitment of subjects in much
the same way that the Bot Approvals Group approves bots.
The policy proposal can be found at:
The Subject Recruitment Approvals Group mentioned in the proposal
is being described at:
Before we move forward with seeking approval from the Wikipedia
community, we would like additional input about the proposal,
and would welcome additional help improving it.
Also, please consider participating in WikiProject Research at:
University of Minnesota
We’re preparing for the July 2019 research newsletter and looking for contributors. Please take a look at https://etherpad.wikimedia.org/p/WRN201907 and add your name next to any paper you are interested in covering. In case you have time over this weekend, the writing deadline is on Monday, July 30 already. If you can't make this deadline but would like to cover a particular paper in the subsequent issue, leave a note next to the paper's entry below. As usual, short notes and one-paragraph reviews are most welcome.
Highlights from this month:
- Revealing the Role of User Moods in Struggling Search Tasks
- Building a Knowledge Graph for Recommending Experts
- Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability
- Anomaly Detection in the Dynamics of Web and Social Networks Using Associative Memory
- Openness, Inclusion And Self-Affirmation: Indigenous Knowledge In Open Knowledge Projects
- The Quality and Readability of English Wikipedia Anatomy Articles
- Different Topic, Different Traffic: How Search and Navigation Interplay on Wikipedia
- Uncovering the Semantics of Wikipedia Categories
- Adapting NMT to caption translation in Wikimedia Commons for low-resource languages
- Discovering Implicational Knowledge in Wikidata
Mohammed S. Abdulai and Tilman Bayer
 Research:Newsletter - Meta WikiResearch (@WikiResearch) on Twitter
Second Call for Workshops and Tutorials, ECIR 2020 in Lisbon
Deadlines: 1 September (Workshops) & 15 November (Tutorials)
ECIR 2020 - 42nd European Conference on Information Retrieval
Call for Workshops and Tutorials
Lisbon, Portugal - April 14-17, 2020
The purpose of workshops is to provide a platform for presenting novel
ideas and research results in a focused and more interactive way.
Workshops can be of either a half-day (3 hours plus breaks) or a full
day (6 hours plus breaks). Workshops are encouraged to be as dynamic
and interactive as possible and should lead to a concrete outcome,
such as the publication of a summary paper and/or workshop
proceedings. The information required for a workshop proposal is on
the conference website. Workshop proposals will be reviewed by the
workshop committee. A summary paper of the workshop will be published
in the conference proceedings.
Please find more information at
1 September 2019 – Workshop submission
1 October 2019 – Workshop notification
14 April 2020 – Workshops and Tutorials
Tutorials inform the community on recent advances in core IR research,
related research, or on novel application areas related to IR. They
may focus on specific problems or specific domains in which IR
research may be applied. Tutorials can be of either a half-day (3
hours plus breaks) or a full day (6 hours plus breaks). Tutorials are
encouraged to be as interactive as possible. Please follow tutorial
proposal instructions on the conference website (link below). Tutorial
proposals will be reviewed by the tutorial committee. A summary of the
tutorial will be published in the conference proceedings.
Further information can be found at
15 November 2019 – Tutorial submission
15 December 2019 – Tutorial notification
14 April 2020 – Workshops and Tutorials
Hope to see you in Lisbon!
The next Research Showcase will be live-streamed next Wednesday, July 17,
at 11:30 AM PDT/18:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=i9vvwV5KfW4
As usual, you can join the conversation on IRC at #wikimedia-research. You
can also watch our past research showcases here:
This month's presentations:
Characterizing Incivility on Wikipedia
Elizabeth Whittaker, University of Michigan School of Information
In a society whose citizens have a variety of viewpoints, there is a
question of how citizens can govern themselves in ways that allow these
viewpoints to co-exist. Online deliberation has been posited as a problem
solving mechanism in this context, and civility can be thought of as a
mechanism that facilitates this deliberation. Civility can thus be thought
of as a method of interaction that encourages collaboration, while
incivility disrupts collaboration. However, it is important to note that
the nature of online civility is shaped by its history and the technical
architecture scaffolding it. Civility as a concept has been used both to
promote equal deliberation and to exclude the marginalized from
deliberation, so we should be careful to ensure that our conceptualizations
of incivility reflect what we intend them to in order to avoid
unintentionally reinforcing inequality.
To this end, we examined Wikipedia editors’ perceptions of interactions
that disrupt collaboration through 15 semi-structured interviews. Wikipedia
is a highly deliberative platform, as editors need to reach consensus about
what will appear on the article page, a process that often involves
deliberation to coordinate, and any disruption to this process should be
apparent. We found that incivility on Wikipedia typically occurs in one of
three ways: through weaponization of Wikipedia’s policies, weaponization of
Wikipedia’s technical features, and through more typical vitriolic content.
These methods of incivility were gendered, and had the practical effect of
discouraging women from editing. We implicate this pattern as one of the
underlying causes of Wikipedia’s gender gap.
Hidden Gems in the Wikipedia Discussions: The Wikipedians’ Rationales
Lu Xiao, Syracuse University School of Information Studies
I will present a series of completed and ongoing studies that are aimed at
understanding the role of the Wikipedians’ rationales in Wikipedia
discussions. We define a rationale as one’s justification of her viewpoint
and suggestions. Our studies demonstrate the potential of leveraging the
Wikipedians’ rationales in discussions as resources for future
decision-making and as resources for eliciting knowledge about the
community’s norms, practices and policies. Viewed as rich digital traces in
these environments, we consider them to be beneficial for the community
members, such as helping newcomers familiarize themselves on the commonly
accepted justificatory reasoning styles. We call for more research
attention to the discussion content from this rationale study perspective.
Janna Layton (she, her)
Administrative Assistant - Audiences & Technology
Wikimedia Foundation <https://wikimediafoundation.org/>
A recent research article of mine, Artificial Wisdom , pertains to search engines and recommender systems for: anecdotes, proverbs, quotations, lyrics, poetry, narratives (e.g. parables, allegories), and humor. I refer to these materials as wisdom materials.
A variety of social media bot is described with an integrated recommender system for wisdom materials. Wiki technology is described as a reasonable option to consider for producing and maintaining dynamic corpora of wisdom materials to be utilized by such systems.
Social Media Bots with Integrated Recommender Systems
Wisdom materials could be provided to users daily, quote-of-the-day services, or on an as-needed basis, whenever a social media bot determines that a wisdom material could provide value to a user. Recommended wisdom materials could be sent to users via a number of possible channels (e.g. social media websites, instant messenger applications, e-mail).
To ensure that recommended wisdom materials provide value to users, recommender systems could utilize users’ social media posts and other content to obtain context data. Systems could process users’ posts and content including to estimate users’ affect or complex mental states. Should a user indicate that they are sad, utilizing a status update or by otherwise sharing that information in a post with their friends and a social media bot, the social media bot could send one or more wisdom materials to uplift the user.
Recommender systems for wisdom materials, operating at scale, could learn and improve over the course of time. Users could be provided with a number of options for providing feedback with respect to wisdom material recommendations. Systems could also process users’ social media posts and content before, during and after encounters with recommended wisdom materials. Did a recommended wisdom material provide value to, e.g. uplift, a user in a context?
Utilizing technology such as Solid, social media bots with integrated recommender systems can provide value to users while simultaneously protecting users’ data privacy.
Wiki technology is a reasonable option to consider for producing and maintaining dynamic corpora of wisdom materials. Such Wiki systems could be utilized by organizations stewarding social media bots, by larger numbers of users, or by the general public.
Some varieties of wisdom materials tend to make use of figurative language. Accordingly, the indexing of such materials shouldn’t necessarily utilize the literal text of the items. For instance, the proverb “a rolling stone gathers no moss” is not best indexed and searched for by its lexemes. One desires to search, somehow, for the contents of the interpretations of the items rather than for the literal text of the items. Contextual searching is also desirable. Varieties of wisdom materials should be modeled; a wisdom material, for instance, could have multiple interpretations and each interpretation could have text content for indexing, keywords, categories, and so forth.
The modeling of wisdom materials, Wiki-based technology for producing and maintaining dynamic corpora of wisdom materials, and social media bots with integrated recommender systems for wisdom materials are indicated to be contemporary research topics.
Hi Leila and Kate,
adding a few words after Nuria's email to clarify my original intentions.
My point was that any important and vital file that needs to be preserved
may be stored in HDFS rather than on stat/notebooks due to the absence of
backups of the home directories. My concern was that people had a different
understanding about backups and I wanted to clarify.
We (as Analytics team) don't have any good way at the moment to
periodically scan HDFS and home directories across hosts to find PII data
that is retained more than the allowed period of time. The main motivation
is that we'd need to find a way to check a huge amount of files, with
different names and formats, and figure out if the data contained in them
is PII and retained more than X days. It is not an impossible task but not
easy or trivial, we'd need a lot more staff in my opinion to create and
maintain something similar :) We started recently with the clean up of old
home directories (i.e. belonging to users not active anymore) and we
established a process with SRE to get pinged when a user is offboarded to
verify what data should be kept and what not (I know that both of you are
aware of this since you have been working with us on several tasks, I am
writing it to allow other people to get the context :). This is only a
starting point, I really hope to have something more robust and complete in
the future. In the meantime, I'd say that every user is responsible of the
data that he/she handles on the Analytics infrastructure, periodically
reviewing it and deleting when necessary. I don't have a specific
guideline/process to suggest, but we can definitely have a chat together
and decide something shared among our teams!
Let me know if this makes sense or not :)
Il giorno mer 10 lug 2019 alle ore 23:15 Nuria Ruiz <nuria(a)wikimedia.org>
> >I have one question for you: As you allow/encourage for more copies of
> >the files to exist
> To be extra clear, we do not encourage for data to be in that notebooks
> hosts at all, there is no capacity of them to neither process nor hosts
> large amounts of data. Data that you are working with is best placed on
> /user/your-username databse in hadoop so far from encouraging multiple
> copies we are rather encouraging you keep the data outside the notebook
> On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman <kzimmerman(a)wikimedia.org>
>> I second Leila's question. The issue of how we flag PII data and ensure
>> it's appropriately scrubbed came up in our team meeting yesterday. We're
>> discussing team practices for data/project backups tomorrow and plan to
>> come out with some proposals, at least for the short term.
>> Are there any existing processes or guidelines I should be aware of?
>> Kate Zimmerman (she/they)
>> Head of Product Analytics
>> Wikimedia Foundation
>> On Wed, Jul 10, 2019 at 9:00 AM Leila Zia <leila(a)wikimedia.org> wrote:
>>> Hi Luca,
>>> Thanks for the heads up. Isaac is coordinating a response from the
>>> Research side.
>>> I have one question for you: As you allow/encourage for more copies of
>>> the files to exist, what is the mechanism you'd like to put in place
>>> for reducing the chances of PII to be copied in new folders that then
>>> will be even harder (for your team) to keep track of? Having an
>>> explicit process/understanding about this will be very helpful.
>>> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano <ltoscano(a)wikimedia.org>
>>> > Hi everybody,
>>> > as part of https://phabricator.wikimedia.org/T201165 the Analytics
>>> > thought to reach out to everybody to make it clear that all the home
>>> > directories on the stat/notebook nodes are not backed up periodically.
>>> > run on a software RAID configuration spanning multiple disks of
>>> course, so
>>> > we are resilient on a disk failure, but even if unlikely if might
>>> > that a host could loose all its data. Please keep this in mind when
>>> > on important projects and/or handling important data that you care
>>> > I just added a warning to
>>> > If you have really important data that is too big to backup, keep in
>>> > that you can use your home directory (/user/your-username) on HDFS
>>> > replicates data three times across multiple nodes).
>>> > Please let us know if you have comments/suggestions/etc.. in the
>>> > aforementioned task.
>>> > Thanks in advance!
>>> > Luca (on behalf of the Analytics team)
>>> > _______________________________________________
>>> > Wiki-research-l mailing list
>>> > Wiki-research-l(a)lists.wikimedia.org
>>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>> Analytics mailing list
> Analytics mailing list
due to https://phabricator.wikimedia.org/T227941 we'd need to take down
Oozie/Hive/etc.. on an-coord1001. The maintenance should not last long, but
if you have any issue please reach out to us on IRC (#wikimedia-analytics
Luca (on behalf of the Analytics team)
Some time ago I'm sure I remember reading something about a study that
looked at the link between the number of times a reference was used on
Wikipedia and the number of times the source was referenced in journal
articles. Does anyone know what I'm talking about or have something
Thanks very much
[You can safely skip this message if you have already seen it in the
Wikidata mailing list, and pardon for the spam]
TL;DR: soweego version 1 will be released soon. In the meanwhile, why
don't you consider endorsing the next steps?
This is a pre-release notification for early feedback.
Does the name *soweego* ring you a bell?
It is a machine learning-based pipeline that links Wikidata to large
It is a close friend of Mix'n'match , which mainly caters for small
The first version is almost done, and will start uploading results soon.
Confident links are going to feed Wikidata via a bot , while others
will get into Mix'n'match for curation.
The next short-term steps are detailed in a rapid grant proposal ,
and I would be really grateful if you could consider an endorsement there.
The soweego team has also tried its best to address the following
1. plan a sync mechanism between Wikidata and large catalogs / implement
checks against external catalogs to find mismatches in Wikidata;
2. enable users to add links to new catalogs in a reasonable time.
So, here is the most valuable contribution you can give to the project
right now: understand how to *import a new catalog* .
Can't wait for your reactions.
 see past contributions: