For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
My new research project, inspired by the following CfP (
*http://www.asanet.org/journals/TS/SpecialIssueCall.cfm)* aims at trying
to judge how effective our teaching assignments on Wikipedia have been,
in the context of my globalization lectures in which students have
created or expanded dozens of Wikipedia articles (you can see partial
list of articles created by my students at
to get an idea of what I had them to do over the past few years). It is
clear that Wikipedia benefits, but what about the students? Here are my
two questions to you.
First, my main source of data is going to be a survey of my former
students (N<100). I wonder if anyone is familiar with literature on
relevant metrics (i.e. how to design a survey to measure the
effectiveness of a teaching instrument)? I have never surveyed students
before, and while I am in the middle of a lit review, any suggestions
would be appreciated. I am somewhat familiar with the literature on
teaching with Wikipedia, but sadly few works have published surveys
used. If anything comes to mind that you think would be good to use for
comparative studies, that would also be helpful.
Second, here is my draft survey: http://tinyurl.com/hehckvs
I'd appreciate any comments: is it too long? Are some questions
ambiguous? Unnecessary? Leading and creating bias in subsequent
questions? Should I rephrase something? Should I ask something else?
Thank you for any comments, and do not hesitate to be critical - I'd
much rather redo the survey now then after I send it out :)
Piotr Konieczny, PhD
We are doing research on a similar situation combining machine vision processing with volunteer annotations in a citizen science project. It would be interesting to see how much translates across these settings, e.g., if our ideas about using the machine annotations are applicable here as well.
Kevin Crowston | Associate Dean for Research and Distinguished Professor of Information Science | School of Information Studies
348 Hinds Hall
Syracuse, New York 13244
t (315) 443.1676 f 315.443.5806 e crowston(a)syr.edu <mailto:firstname.lastname@example.org>
From: Jan Dittrich <jan.dittrich(a)wikimedia.de<mailto:email@example.com>>
Subject: Re: [Wiki-research-l] Google open source research on automatic image captioning
I find it interesting which impact this could have on the sense of
achievement for volunteers, if captions are autogenerated or suggested and
them possibly affirmed or corrected.
On one hand one could assume a decreased sense of ownership,
on the other hand, it might be more easier to comment/correct then to write
from scratch and feel much more efficient.
2016-09-27 23:08 GMT+02:00<http://airmail.calendar/2016-09-27%2017:08:00%20EDT> Dario Taraborelli <dtaraborelli(a)wikimedia.org<mailto:firstname.lastname@example.org>>:
> I forwarded this separately to internally at WMF a few days ago. Clearly –
> before thinking of building workflows for human contributors to generate
> captions or rich descriptors of media files in Commons – we should look at
> what's available in terms of off-the-shelf machine learning services and
> #1 rule of sane citizen science/crowdsourcing projects: don't ask humans
> to perform tedious tasks machines are pretty good at, get humans to curate
> inputs and outputs of machines instead.
[forwarding my answer from analytics ml, I forgot to subscribe to this list too]
one solution may be to use a query on Wikidata to retrieve the name
for the stubs category in all the different languages. Then you could
use a tool like PetScan to retrive all the pages in such categories,
or write your own tool by using either a query on the database or
You can find a sample solution here:
I wrote that thing while on a train, so it may be messy and/or sub-optimal.
I would like to thank Alex Monk and Yuvi Panda for their help with SQL
on paws today.
2016-09-20 11:26 GMT+02:00 Robert West <west(a)cs.stanford.edu>:
> Hi everyone,
> Does anyone know if there's a straightforward (ideally language-independent)
> way of identifying stub articles in Wikipedia?
> Whatever works is ok, whether it's publicly available data or data
> accessible only on the WMF cluster.
> I've found lists for various languages (e.g., Italian or English), but the
> lists are in different formats, so separate code is required for each
> language, which doesn't scale.
> I guess in the worst case, I'll have to grep for the respective stub
> templates in the respective wikitext dumps, but even this requires to know
> for each language what the respective template is. So if anyone could point
> me to a list of stub templates in different languages, that would also be
> Up for a little language game? -- http://www.unfun.me
> Analytics mailing list
The next Research Showcase will be live-streamed this Wednesday, September
21, 2016 at 11:30 AM (PST) 18:30 (UTC).
YouTube stream: https://www.youtube.com/watch?v=fTDkVeqjw80
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
This month's showcase includes.
Finding News Citations for WikipediaBy *Besnik Fetahu
<http://www.l3s.de/~fetahu/> (Leibniz University of Hannover)*An important
editing policy in Wikipedia is to provide citations for added statements in
Wikipedia pages, where statements can be arbitrary pieces of text, ranging
from a sentence to a paragraph. In many cases citations are either outdated
or missing altogether. In this work we address the problem of finding and
updating news citations for statements in entity pages. We propose a two-
stage supervised approach for this problem. In the first step, we construct
a classifier to find out whether statements need a news citation or other
kinds of citations (web, book, journal, etc.). In the second step, we
develop a news citation algorithm for Wikipedia statements, which
recommends appropriate citations from a given news collection. Apart from
IR techniques that use the statement to query the news collection, we also
formalize three properties of an appropriate citation, namely: (i) the
citation should entail the Wikipedia statement, (ii) the statement should
be central to the citation, and (iii) the citation should be from an
authoritative source. We perform an extensive evaluation of both steps,
using 20 million articles from a real-world news collection. Our results
are quite promising, and show that we can perform this task with high
precision and at scale.
Designing and Building Online Discussion SystemsBy *Amy X. Zhang
<http://people.csail.mit.edu/axz/> (MIT)*Today, conversations are
everywhere on the Internet and come in many different forms. However, there
are still many problems with discussion interfaces today. In my talk, I
will first give an overview of some of the problems with discussion
systems, including difficulty dealing with large scales, which exacerbates
additional problems with navigating deep threads containing lots of
back-and-forth and getting an overall summary of a discussion. Other
problems include dealing with moderation and harassment in discussion
systems and gaining control over filtering, customization, and means of
access. Then I will focus on a few projects I am working on in this space
now. The first is Wikum, a system I developed to allow users to
collaboratively generate a wiki-like summary from threaded discussion. The
second, which I have just begun, is exploring the design space of
presentation and navigation of threaded discussion. I will next discuss
Murmur, a mailing list hybrid system we have built to implement and test
ideas around filtering, customization, and flexibility of access, as well
as combating harassment. Finally, I'll wrap up with what I am working on at
Google Research this summer: developing a taxonomy to describe online forum
discussion and using this information to extract meaningful content useful
for search, summarization of discussions, and characterization of
Hope to see you there!
Sarah R. Rodlund
Senior Project Coordinator-Engineering, Wikimedia Foundation
(btw, if we cc analytics and research does that reach everyone using these
boxes? Like discovery, fundraising, etc? Basically, everyone who doesn't
see this message, raise your hand :))
On Wed, Sep 21, 2016 at 10:44 AM, Luca Toscano <ltoscano(a)wikimedia.org>
> Hi everybody,
> the Analytics team is going to reboot all the stat hosts (stat1002,
> stat1003 and stat1004) and the Hadoop cluster nodes to install new kernels
> (security upgrade required). The work will start tomorrow morning (Sep
> 22nd) at around 9:00 AM CEST.
> This task might interfere with ongoing Hadoop jobs or processes running on
> the stat* hosts, so please let me know if there is any motivation to
> postpone the maintenance.
> Please also feel free to reach out to the analytics IRC channel or to me
> directly if you have more questions :)
> Analytics mailing list
Does anyone know if there's a straightforward (ideally
language-independent) way of identifying stub articles in Wikipedia?
Whatever works is ok, whether it's publicly available data or data
accessible only on the WMF cluster.
I've found lists for various languages (e.g., Italian
<https://it.wikipedia.org/wiki/Categoria:Stub> or English
<https://en.wikipedia.org/wiki/Category:All_stub_articles>), but the lists
are in different formats, so separate code is required for each language,
which doesn't scale.
I guess in the worst case, I'll have to grep for the respective stub
templates in the respective wikitext dumps, but even this requires to know
for each language what the respective template is. So if anyone could point
me to a list of stub templates in different languages, that would also be
Up for a little language game? -- http://www.unfun.me