For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Curious, what percentage of digital assistants (Alexa, Siri, Cortana,
Google) cite Wikipedia when a person asks a question?
Does the current Wikipedia mobile app support voice search?
Are there any reports on this? Thanks in advance!
Stella Yu | STELLARESULTS | 415 690 7827
"Chronicling heritage brands and legendary people."
We are excited to announce that the 5th annual Wiki Workshop  will
take place in Lyon on April 24, 2018 and as part of The Web Conference
2018 (a.k.a. WWW2018) .
You can access the call for papers at
http://wikiworkshop.org/2018/#call . Please submit your ongoing or
completed research related to Wikimedia projects to the workshop. Note
that 2018-01-28 is the submission deadline if you want your paper to
appear in the proceedings, and 2018-03-11 is for all other papers.
Following the past year's model, the workshop will have a set of
invited talks (Jon Kleinberg and Markus Kroetzsch have already
accepted our invitation  \o/), a poster session, and more.
Questions and comments are welcome. Otherwise, we're looking forward
to receiving your submissions and seeing you in Lyon in April. :)
Leila, on behalf of the organizers 
Senior Research Scientist
At the moment I am writing about the Wikipedia rules with regard to
research. some researchers are interested in Wikipedia talk pages, others
want to interview Wikipedians, others again make „experiments“ within the
wiki in order to watch Wikipedians‘ reactions.
Researchers try to stick to some general ethics such as respecting
anonymity and not causing harm.
To my knowledge, only Wikipedia in English has some specific lines about
research in its set of rules (e.g. with the advice to disclose research
interests on a user page). Do you know about research related rules in
other language versions?
The next Research Showcase will be live-streamed this Wednesday, February
21, 2018 at 11:30 AM (PST) 18:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=fpmRWCE7F_I
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
This month's presentation:
*Visual enrichment of collaborative knowledge bases*
By Miriam Redi, Wikimedia Foundation
Images allow us to explain, enrich and complement knowledge without
language barriers . They can help illustrate the content of an item in a
language-agnostic way to external data consumers. Images can be extremely
helpful in multilingual collaborative knowledge bases such as Wikidata.
However, a large proportion of Wikidata items lack images. More than 3.6M
Wikidata items are about humans (Q5), but only 17% of them have an image
associated with them. Only 2.2M of 40 Million Wikidata items have an image.
A wider presence of images in such a rich, cross-lingual repository could
enable a more complete representation of human knowledge.
In this talk, we will discuss challenges and opportunities faced when using
machine learning and computer vision tools for the visual enrichment of
collaborative knowledge bases. We will share research to help Wikidata
contributors make Wikidata more “visual” by recommending high-quality
Commons images to Wikidata items. We will show the first results on
free-licence image quality scoring and recommendation and discuss future
work in this direction.
 Van Hook, Steven R. "Modes and models for transcending cultural
differences in international classrooms." Journal of Research in
International Education 10.1 (2011): 5-27.
*Backlogs—backlogs everywhere: Using machine classification to clean up the
new page backlog*
By Aaron Halfaker, Wikimedia Foundation
If there's one insight that I've had about the functioning of Wikipedia and
other wiki-based online communities, it's that eventually self-directed
work breaks down and some form of organization becomes important for task
routing. In Wikipedia specifically, the notion of "backlogs" has become
dominant. There's backlogs of articles to create, articles to clean up,
articles to assess, new editor contributions to review, manual of style
rules to apply, etc. To a community of people working on a backlog, the
state of that backlog has deep effects on their emotional well being. A
backlog that only grows is frustrating and exhausting.
Backlogs aren't inevitable though and there are many shapes that backlogs
can take. In my presentation, I'll tell a story about where English
Wikipedia editors defined a process and set of roles that formed a backlog
around new page creations. I'll make the argument that this formalization
of quality control practices has created a choke point and that
alternatives exist. Finally I'll present a vision for such an alternative
using models that we have developed for ORES, the open machine prediction
service my team maintains.
Sarah R. Rodlund
Senior Project Coordinator-Product & Technology, Wikimedia Foundation
Having a look at the new WMF research site, I noticed that it seems that
notification and recommendations mechanisms are the key strategy being
focused on re. the filling of Wikipedia's content gaps. Having just
finished a research project on just this problem and coming to the opposite
conclusion i.e. that automated mechanisms were insufficient for solving the
gaps problem, I was curious to find out more.
This latest research that I was involved in with colleagues was based on an
action research project aiming to fill gaps in topics relating to South
Africa. The team tried a range of different strategies discussed in the
literature for filling Wikipedia's gaps without any wild success. Automated
mechanisms that featured missing and incomplete articles catalysed very few
When looking for related research, it seemed that others had come to a
similar conclusion i.e. that automated notification/recommendations alone
didn't lead to improvements in particular target areas. That makes me think
that a) I just haven't come across the right research or b) that there are
different types of gaps and that those different types require different
solutions i.e. the difference between filling gaps across language
versions, gaps created by incomplete articles about topics for which there
are few online/reliable sources is different from the lack of articles
about topics for which there are many online/reliable sources, gaps in
articles about particular topics, relating to particular geographic areas
Does anyone have any insight here? - either on research that would help
practitioners decide how to go about a project of filling gaps in a
particular subject area or about whether the key focus of research at the
WMF is on filling gaps via automated means such as recommendation and
we  would like to announce a research project with the goal of studying
whether user interactions recorded at the time of editing are suitable to
predict vandalism in real time.
Should vandal editing behavior be sufficiently different from normal
editing behavior, this would allow for a number of interesting real-time
prevention techniques. For example:
- withholding confidently suspicious edits for review before publishing
- a popup asking "I am not a vandal" (as in Google's "I am not a robot") to
analyze vandal reactions,
- a popup with a chat box to personally engage vandals, e.g., to help them
find other ways of stress relief or to understand them better,
- or at the very least: a new signal to improve traditional vandalism
We have set up a laboratory environment to study editor behavior in a
realistic setting using a private mirror of Wikipedia. No editing
whatsoever is conducted on the real Wikipedia as part of our experiments,
and all test subjects of our user studies are made aware of the
experimental nature of their editing. We plan on making use of
crowdsourcing as a means to attain scale and diversity.
If you wish to participate in this study as a test subject yourself, please
get in touch. The more diversity, the more insightful the results will be.
We are also happy to collaborate and to answer all questions that may arise
in relation to the project. For example, our setup and tooling may turn out
to be useful to study other user behavior-related things without having to
actually deploy experiments within the live MediaWiki.
PS: The AICaptcha project seems most closely related. @Vinitha and Gergő:
If you wish, we can set up a Skype meeting to talk about a avenues for
 A group of students and researchers from Bauhaus-Universität Weimar (
www.webis.de) and Leipzig University (www.temir.org); project PI: Martin
The hadoop cluster maintenance (upgrade to Java 8) was planned to happen
earlier today but is finally happening now.
Il will require a complete shutdown and should not last longer than a
couple of hours (expected less than one).
Joseph on behalf of the Analytics-Team
Hi Analytics folks,
*TL;DR: Hadoop cluster maintenance postponed to Tue 13th February*
We've experienced an issue in getting some data onto the cluster this
month, making some of our monthly datasets (the ones that depend on that
late data) not yet computed.
We have decided to postpone the maintenance of the cluster to next week,
allowing for those jobs to be finished.
We are very sorry about the short notice and will send another email the
day before maintenance.
Joseph Allemandou on behalf of the Analytics-Team
Data Engineer @ Wikimedia Foundation