For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
tl;dr: Stop using stat100 by September 1st.
We’re finally replacing stat1002 and stat1003. These boxes are out of
warranty, and are running Ubuntu Trusty, while most of the production fleet
is already on Debian Jessie or even Debian Stretch.
stat1005 is the new stat1002 replacement. If you have access to stat1002,
you also have access to stat1005. I’ve copied over home directories from
stat1006 is the new stat1003 replacement. If you have access to stat1003,
you also have access to stat1006. I’ve copied over home directories from
I have not migrated any personal cron jobs running on stat1002 or
stat1003. I need your help for this!
Both of these boxes are running Debian Stretch. As such, packages that
your work depends on may have upgraded. Please log into the new boxes and
try stuff out! If you find anything that doesn’t work, please let me know
by commenting on https://phabricator.wikimedia.org/T152712.
Please be fully migrated to the new nodes by September 1st. This will give
us enough time to fully decommission stat1002 and stat1003 by the end of
I’ve only done a single rsync of home directories. If there is new data on
stat1002 or stat1003 that you want rsynced over, let me know on the ticket.
A few notes:
- stat1002 used to have /a. This has been removed in favor of /srv. /a no
- Home directories are now much larger. You no longer need to create
personal directories in /srv.
- /tmp is still small, so please be careful. If you are running long jobs
that generate temporary data, please have those jobs write into your home
directory, rather than /tmp.
- We might implement user home directory quotas in the future.
Thanks all! I’ll send another email in about a months time to remind you
of the impending deadline of Sept 1.
The next Research Showcase will be live-streamed this Wednesday, July 26,
2017 at 11:30 AM (PST) 18:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=yC1jgK8C8aQ
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
This month's presentation:
Freedom versus Standardization: Structured Data Generation in a Peer
Production CommunityBy *Andrew Hall*In addition to encyclopedia articles
and software, peer production communities produce *structured data*, e.g.,
Wikidata and OpenStreetMap’s metadata. Structured data from peer production
communities has become increasingly important due to its use by
computational applications, such as CartoCSS, MapBox, and Wikipedia
infoboxes. However, this structured data is usable by applications only if
it follows *standards.* We did an interview study focused on
OpenStreetMap’s knowledge production processes to investigate how – and how
successfully – this community creates and applies its data standards. Our
study revealed a fundamental tension between the need to produce structured
data in a standardized way and OpenStreetMap’s tradition of contributor
freedom. We extracted six themes that manifested this tension and three
overarching concepts, *correctness, community,* and *code,* which help make
sense of and synthesize the themes. We also offer suggestions for improving
OpenStreetMap’s knowledge production processes, including new data models,
sociotechnical tools, and community practices.
Sarah R. Rodlund
Senior Project Coordinator-Product & Technology, Wikimedia Foundation
With the start of the new fiscal year in Wikimedia Foundation on July
1, the Research team has officially started the work on Program 12:
Growing contributor diversity.  Here are a few
announcements/pointers about this program and the research and work
that will be going to it:
* We aim to keep the research documentation for this project on the
corresponding research page on meta. 
* Research tasks are hard to break down and track in task-tracking
systems. This being said, any task that we can break down and track
will be documented under the corresponding Epic task on Phabricator.
* The goals for this Program for July-September 2017 (Quarter 1) are
captured on MediaWiki.  (The Phabricator epic will be updated with
corresponding tasks as we start working on them.)
* Our three formal collaborators (cc-ed) will contribute to this
program: Jérôme Hergueux from ETH, Paul Seabright from TSE, and Bob
West from EPFL. We are thankful to these people who have agreed to
spend their time and expertise on this project in the coming year, and
to those of you who have already worked with us as we were shaping the
proposal for this project and are planning to continue your
contributions to this program. :)
* I act as the point of contact for this research in Wikimedia
Foundation. Please feel free to reach out to me (directly, if it
cannot be shared publicly) if you have comments/questions about the
project in the coming year.
Senior Research Scientist
This was in the recent Research Newsletter:
They found a correlation between the length of articles about tourist
destinations and the number of tourists visiting them. They tried to
influence other destinations by adding content and did not find a
correlation in the subsequent number of tourists, suggesting that the
causation flows from tourism to article length instead.
But I was taken aback by the last line of their paper, "using the
suggested research design to study other areas of information
acquisition, such as medicine or school choices could be fruitful
Are there any ethical guidelines concerning whether this is
reasonable? Should there be?
[If you are not interested in discussions related to the category system
(on English Wikipedia)
, you can stop here. :)]
We have run into a problem that some of you may have thought about or
addressed before. We are trying to clean up the category system on English
Wikipedia by turning the category structure to an IS-A hierarchy. (The
output of this work can be useful for the research on template
recommendation , for example, but the use-cases won't stop there). One
issue that we are facing is the following:
We are currently
SQL dumps to extract categories associated with every article on English
Wikipedia (main namespace). 
Using this approach, we get 5 categories associated with Flow cytometry
bioinformatics article :
The problem is that only the first two categories are the ones we are
interested in. We have one cleaning step through which we only keep
categories that belong to category Article and that step removes the last
category above, but the other two Wikipedia_... remain there. We need to
somehow prune the data and clean it from those two categories.
One way we could do the above would be to parse wikitext instead of the SQL
dumps and focus on extracting categories marked by pattern [[Category:XX]],
but in that case, we would lose a good category such as
because that's generated by a template.
Any ideas on how we can start with a "cleaner" dataset of categories
related to the topic of the articles as opposed to maintenance related or
other types of categories?
 The exact code we use is
SELECT p.page_id id, p.page_title title, cl.cl_to category
FROM categorylinks cl
JOIN page p
on cl.cl_from = p.page_id
where cl_type = 'page'
and page_namespace = 0
and page_is_redirect = 0
and the edges of the category graph are extracted with
*SELECT p.page_title category, cl.cl_to parent *
*FROM categorylinks cl *
*JOIN page p *
*ON p.page_id = cl.cl_from *
*where p.page_namespace = 14*
Thank you, iolanda, for highlighting the role of educators in contributing
We do recognize that teachers often have highly relevant domain expertise.
Moreover, in many cases teachers are better than academics at providing an
overview of a topic in a balanced way.
Your approach for involving academic experts is interesting (i.e. having
them produce scientific reports that are linked - but not part of - the
Thank you for sharing the links and report.We would be very interested in
reading your paper on the topic; please do share it when ready.
As for our own project -
As we wrote before, we take a very narrow definition of formal expertise
and focus on those publishing research papers on the particular topic of
the article they are contributing to.
Investigating the role of teachers in Wikipedia is a worthy quest, but
unfortunately it falls outside the scope of our current project.
Alex, Einat and Ofer