For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
The Analytics team would like to announce that we have migrated the
reportcard to a new domain:
The migrated reportcard includes both legacy and current pageview data,
daily unique devices and new editors data. Pageview and devices data is
updated daily but editor data is still updated ad-hoc.
The team is working at this time on revamping the way we compute edit data
and we hope to be able to provide monthly updates for the main edit metrics
this quarter. Some of those will be visible in the reportcard but the new
wikistats will have more detailed reports.
You can follow the new wikistats project here:
tl;dr: Stop using stat100 by September 1st.
We’re finally replacing stat1002 and stat1003. These boxes are out of
warranty, and are running Ubuntu Trusty, while most of the production fleet
is already on Debian Jessie or even Debian Stretch.
stat1005 is the new stat1002 replacement. If you have access to stat1002,
you also have access to stat1005. I’ve copied over home directories from
stat1006 is the new stat1003 replacement. If you have access to stat1003,
you also have access to stat1006. I’ve copied over home directories from
I have not migrated any personal cron jobs running on stat1002 or
stat1003. I need your help for this!
Both of these boxes are running Debian Stretch. As such, packages that
your work depends on may have upgraded. Please log into the new boxes and
try stuff out! If you find anything that doesn’t work, please let me know
by commenting on https://phabricator.wikimedia.org/T152712.
Please be fully migrated to the new nodes by September 1st. This will give
us enough time to fully decommission stat1002 and stat1003 by the end of
I’ve only done a single rsync of home directories. If there is new data on
stat1002 or stat1003 that you want rsynced over, let me know on the ticket.
A few notes:
- stat1002 used to have /a. This has been removed in favor of /srv. /a no
- Home directories are now much larger. You no longer need to create
personal directories in /srv.
- /tmp is still small, so please be careful. If you are running long jobs
that generate temporary data, please have those jobs write into your home
directory, rather than /tmp.
- We might implement user home directory quotas in the future.
Thanks all! I’ll send another email in about a months time to remind you
of the impending deadline of Sept 1.
EventStreams just experienced a 24 hour ‘outage’. There were no dropped
messages, but for about 24 hours no messages were sent to connected
I’ve written up the Incident Report here:
The worst part about this is that we didn’t know that there was a problem
until a user notified me on IRC. We monitor and alert on pieces of
EventStreams infrastructure, but don’t monitor topic volume, as it varies
and is hard to get right. However, this shouldn’t have taken 24 hours and
a user for us (me) to notice, so I’ve created
https://phabricator.wikimedia.org/T174493 to help us catch something like
this in the future.
Apologies if this caused any inconvenience.
Systems Engineer, Wikimedia Foundation
Hi Analytics Fellows,
Yesterday we broke and fixed hive wmf.webrequest table.
Jobs not monitored by the Analytics team might have failed - Check your
Yesterday at 9am UTC we deployed a change to the hive wmf.webrequest table
that broke some of its functionality. More precisely, queries to the table
that needed to read parquet columns in detail would fail with a hive
The problem had gone unnoticed for a few hours since most of our complex
computation jobs run only at night, and we only got aware of it after some
hours (~18am UTC, kudos @bearloga!).
We quickly fixed the issue and restarted the needed jobs over the
Given the type of failure of the jobs with the problem, we are sure that
there have been no data corruption: jobs would fail even before starting to
try to compute anything. For production jobs we monitor, we know which jobs
have failed and we've taken care of it, however for jobs that are not
monitored (report-updater, manual scripts etc), some silent failures might
have occurred. Please check your logs :)
Data Engineer @ Wikimedia Foundation
The next Research Showcase will be live-streamed this Wednesday, August 23,
2017 at 11:30 AM (PST) 18:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=Fa0Ztv2iF4w
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
This month's presentation:
Sneha Narayan (Northwestern University)
*The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for
Integrating new users into a community with complex norms presents a
challenge for peer production projects like Wikipedia. We present The
Wikipedia Adventure (TWA): an interactive tutorial that offers a structured
and gamified introduction to Wikipedia. In addition to describing the
design of the system, we present two empirical evaluations. First, we
report on a survey of users, who responded very positively to the tutorial.
Second, we report results from a large-scale invitation-based field
experiment that tests whether using TWA increased newcomers' subsequent
contributions to Wikipedia. We find no effect of either using the tutorial
or of being invited to do so over a period of 180 days. We conclude that
TWA produces a positive socialization experience for those who choose to
use it, but that it does not alter patterns of newcomer activity. We
reflect on the implications of these mixed results for the evaluation of
similar social computing systems.
Andrew Su (Scripps Research Institute)
*The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical
The Gene Wiki project began in 2007 with the goal of creating a
collaboratively-written, community-reviewed, and continuously-updated
review article for every human gene within Wikipedia. In 2013, shortly
after the creation of the Wikidata project, the project expanded to include
the organization and integration of structured biomedical data. This talk
will focus on our current and future work, including efforts to encourage
contributions from biomedical domain experts, to build custom applications
that use Wikidata as the back-end knowledge base, and to promote
CC0-licensing among biomedical knowledge resources. Comments, feedback and
contributions are welcome at https://github.com/SuLab/genewikicentral and
Sarah R. Rodlund
Senior Project Coordinator-Product & Technology, Wikimedia Foundation
stats.grok.se (a source of pageview stats for the time before the Wikimedia
API became available) has been down for about a week. I tried emailing
Henrik Abelsson, whom I've previously contacted when the site had issues,
but haven't received a response this time.
Any ideas on why it's down and whom to reach out to to help resolve the
I'm currently working gathering data for the Autoconfirmed article creation
trial project. One of the measures we're interested in is the number of
new articles, both surviving and deleted, that is created per day. I know
that recent data is logged through EventBus, but if possible I'd would also
like to have historic stats on this (e.g. going back a handful of years).
Would there happen to be a dataset of that available somewhere?