For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
the webrequest and pageview_hourly tables on Hive contain the very
useful user_agent_map field, which stores the following data extracted
from the raw user agent (still available as a separate field):
device_family, browser_family, browser_major, os_family, os_major,
os_minor and wmf_app_version. (The Analytics Engineering team has
built a dashboard that uses this data and last month published a
popular blog post about it.) I understand it is mainly based on the
ua-parser library (http://www.uaparser.org/ ) .
In contrast, the event capsule in our EventLogging tables only
contains the raw, unparsed user agent.
* Does anyone on this list have experience in parsing user agents in
EventLogging data for the purpose of detecting browser family, version
etc, and would like to share advice on how to do this most
efficiently? (In the past, I have written some expressions in MySQL to
extract the app version number for the Wikipedia apps. But it seems a
bit of a pain to do that for classifying browsers in general. One
option would be to export the data and use the Python version of
ua-parser, however doing it directly in MySQL would fit better into
* Assuming it is technically possible to add such a pre-parsed
user_agent_map field to the EventLogging tables, would other analysts
be interested in using it too?
This came up recently with the Reading web team, for the purpose of
investigating whether certain issues are caused by certain browsers
only. But I imagine it has arisen in other places as well.
IRC (Freenode): HaeB
Hope this finds you all well. I'm wondering if there is a way/tool to
identify the articles that exist in the one edition of Wikipedia and have
counterparts in another. I'm also wondering if there is a way to generate a
list of these articles' titles for certain categories.
*Kind regards,Reem Al-Kashif*
On Saturday Oct 29, at 8 am UTC, the web server for the dumps and other
datasets will be unavailable due to maintenance. This should take no
longer than 10 minutes. Thanks for your understanding.
due to a severe kernel vulnerability (
https://access.redhat.com/security/vulnerabilities/2706661) I need to
reboot the stat1002, stat1003 and stat1004 hosts to install the new kernel.
The reboots are scheduled for 9 AM CEST tomorrow (Oct 21st), please follow
up with me or anybody in the Analytics team if you have ongoing work that
can't be stopped.
The Analytics Hadoop and Kafka clusters will be rebooted too during the
next hours. Event if this maintenance shouldn't cause any major issue, you
might experience some service degradation. More up to date information on
IRC in the analytics and operations channels.
Thanks and apologies in advance for the trouble!
The next Research Showcase will be live-streamed this Wednesday, October
19, 2016 at 11:30 AM (PST) 18:30 (UTC).
Link for remote presenters to join the Hangout on Air:
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
YouTube stream: https://www.youtube.com/watch?v=cBImUZ_si5s
This month's showcase includes.
Human centered design for using and editing structured data in Wikipedia
infoboxesBy *Charlie Kritschmar
Intern, Wikimedia Deutschland
<https://meta.wikimedia.org/wiki/Wikimedia_Deutschland>*Wikidata is a
Wikimedia project which stores structured data to be used by other
Wikimedia projects like Wikipedia. Currently, integrating its data in
Wikipedia is difficult for users, since there’s no predefined way to do so
and requires some technical knowledge. To tackle these issues,
human-centered design methods were applied to find needs from which
solutions were generated and evaluated with the help of the community. The
concept may serve as a basis which may be implemented into various Wiki
projects in the future to make editing Wikidata from within another
Wikimedia project more user-friendly and improve the project’s acceptance
in the community.
Emergent Work in WikipediaBy *Ofer Arazy
<http://oferarazy.com/> (University of Haifa)*Online production communities
present an exciting opportunity for investigating novel organizational
forms. Extant theoretical accounts of knowledge co-production point to
organizational policies, norms, and communication as key mechanisms
enabling the coordination of work. Yet, in practice participants in
initiatives such as Wikipedia are often occasional contributors who are
unaware of community policies and do not communicate with other members.
How then is work coordinated and how does the organization maintain
stability in the face of dynamics in individuals’ task enactment? In this
study we develop a conceptualization of emergent roles - the prototypical
activity patterns that organically emerge from individuals’ spontaneous
actions – and investigate the temporal dynamics of emergent role behaviors.
Conducing a multi-level large-scale empirical study stretching over a
decade, we tracked co-production of a thousand Wikipedia articles, logging
two hundred thousand distinct participants and seven hundred thousand
co-production activities. Using a combination of manual tagging and machine
learning, we annotated each activity type, and then clustered participants’
activity profiles to arrive at seven prototypical emergent roles. Our
analysis shows that participants’ behavior is turbulent, with substantial
flow in and out of co-production work and across roles. Our findings at the
organizational level, however, show that work is organized around a highly
stable set of emergent roles, despite the absence of traditional
stabilizing mechanisms such as pre-defined work procedures or role
expectations. We conceptualize this dualism in emergent work as “Turbulent
Stability”. Further analyses suggest that co-production is
artifact-centric, where contributors mutually adjust according to the
artifact’s changing needs. Our study advances the theoretical
understandings of self-organizing knowledge co-production and particularly
the nature of emergent roles.
Hope to see you there!
Sarah R. Rodlund
Senior Project Coordinator-Engineering, Wikimedia Foundation
The Wikimedia Developer Summit
<https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit> is the annual
meeting to push the evolution of MediaWiki and other technologies
supporting the Wikimedia movement. The next edition will be held in San
Francisco on January 9-11, 2017.
We welcome all Wikimedia technical contributors, third party developers,
and users of MediaWiki and the Wikimedia APIs. We specifically want to
increase the participation of volunteer developers and other contributors
dealing with extensions, apps, tools, bots, gadgets, and templates.
- Monday, October 24: This is the last day to request travel
sponsorship. Applying takes less than five minutes.
- Monday, October 31: This is the last day to propose an activity. Bring
the topics you care about!
Subscribe to weekly updates: https://www.mediawiki.org/
Please feel free to forward this email to anyone who might be interested in
In case you recently observed unexpected drops in Wikimedia site traffic
from France, see below.
---------- Forwarded message ----------
From: "geni" <geniice(a)gmail.com>
Date: Oct 17, 2016 1:55 PM
Subject: [Wikimedia-l] We appear have been partially blocked in France
To: "Wikimedia Mailing List" <wikimedia-l(a)lists.wikimedia.org>
Apparently on the orders of the french government orange added us to
their blocked terrorist sites list. This did apparently have the fun
effect of DOS the government page people were redirected to, Source
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
New messages to: Wikimedia-l(a)lists.wikimedia.org