For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Page Previews is now fully deployed to all but 2 of the Wikipedias. In
deploying it, we've created a new way to interact with pages without
navigating to them. This impacts the overall and per-page pageviews metrics
that are used in myriad reports, e.g. to editors about the readership of
their articles and in monthly reports to the board. Consequently, we need
to be able to report a user reading the preview of a page just like we do
them navigating to it.
Readers Web are planning to instrument Page Previews such that when a
preview is available and open for longer than X ms, a "page interaction" is
recorded. We're aware of a couple of mechanisms for recording something
like this from the client:
1. All files viewed with the media viewer are recorded by the client
requesting the /beacon/media?duration=X&uri=Y URL at some point  – as
Nuria points out in that thread, requests to /beacon/... are already
filtered and a canned response is sent immediately by Varnish .
2. Requesting a URL with the X-Analytics header  set to "preview". In
this context, we'd make a HEAD request to the URL of the page with the
IMO #1 is preferable from the operations and performance perspectives as
the response is always served from the edge and includes very few headers,
whereas the request in #2 may be served by the application servers if the
user is logged in (or in the mobile site's beta cohort). However, the
requests in #2 are already
We're currently considering recording page interactions when previews are
open for longer than 1000 ms. We estimate that this would increase overall
web requests by 0.3% .
Are there other ways of recording this information? We're fairly confident
that #1 seems like the best choice here but it's referred to as the
"virtual file view hack". Is this really the case? Moreover, should we
request a distinct URL, e.g. /beacon/preview?duration=X&uri=Y, or should we
consolidate the URLs as both represent the same thing essentially?
IRC (Freenode): phuedx
The Analytics team would like to announce that we have migrated the
reportcard to a new domain:
The migrated reportcard includes both legacy and current pageview data,
daily unique devices and new editors data. Pageview and devices data is
updated daily but editor data is still updated ad-hoc.
The team is working at this time on revamping the way we compute edit data
and we hope to be able to provide monthly updates for the main edit metrics
this quarter. Some of those will be visible in the reportcard but the new
wikistats will have more detailed reports.
You can follow the new wikistats project here:
*TL;DR*: The Analytics Hadoop cluster will be completely down for max
2h on *Feb
6th* (EU/CET morning) to upgrade all the daemons to Java 8.
we are planning to upgrade the Analytics Hadoop cluster to Java 8 on *Feb
6th* (EU/CET morning) for https://phabricator.wikimedia.org/T166248.
Sadly we can't do a rolling upgrade of all the jvm-based Hadoop daemons
since the distribution that we use (Cloudera) suggests to perform the
upgrade only after a complete cluster shutdown. This means that for a
couple of hours (hopefully a lot less) all the Hadoop based services will
be unavailable (Hive, Oozie, HDFS, etc..).
We have tested the new configuration in labs and all the regular Analytics
jobs seem to work correctly, so we don't expect major issues after the
upgrade, but if you have any question or concern please follow up in the
Luca and Andrew (on behalf of the Analytics team)
I copy the analytics mailing list to this message, as this is best way to
get answers to your requests or data or technical aspects of tha analytics
The dataset you ask for contains data that we don't provide without NDAs.
To be precise, we don't disclose precisely timestamped hits publicly,
trying to prevent easily reconstructible sessions.
Now the easiest way for you to get your hands on that data would be to set
up a formal collaboration with WMF, involving a NDA.
I'm not an expert in how to do that, you might be willing to contact the
research team (wiki-research-l(a)lists.wikimedia.org), and read more here:
On Fri, Jan 26, 2018 at 1:55 PM, Jianyun Sun <simonjoylet(a)gmail.com> wrote:
> Hi joal,
> I'm a student from Southeast University and now I'm on a research about
> better scheduling of web request.
> For experiment, I need the data of web request of wikimedia, especially
> page request records with timestamps and response size. Only a month-long
> data is enough. Would you please send me a copy or help me get an access
> ticket on Hive so I can get it by myself?
> I'm looking forward to your reply. Thank you sincerely!
Data Engineer @ Wikimedia Foundation
Data Engineer @ Wikimedia Foundation
If you have time, do skim through these docs. I will do the same between
today and tomorrow, they are pretty informative as to how annual plan is
and what audiences is doing.
---------- Forwarded message ----------
From: Jon Katz <jkatz(a)wikimedia.org>
Date: Tue, Jan 30, 2018 at 8:16 PM
Subject: [Product] Fwd: Session #6 and into all hands
To: WMF Product Team <wmfproduct(a)lists.wikimedia.org>
*Annual Planning Context*
Last week we came together and, among other things, talked about strategy
and annual planning. As I promised when presenting, I wanted to share more
context and details about the annual planning process with you. Many of you
probably aren't interested in how we honed in on the product principles or
annual plan goal, but for those of you who *are* the process was pretty
heavily documented and you should feel free to dig in. If you have
specific questions, pinging Danny, Josh or me is probably the fastest way
to get an answer. We're also really interested in learning where the holes
are, so hearing your questions/feedback is really useful.
Below is an email I shared with the planning group before all hands, but
here are some other artifacts you might find helpful:
from all-audiences meeting at all-hands.
- Annual planning outcome doc
document elaborates on what we think the theme means as well as what each
of the 'output' groupings mean. It is definitely a living document, and
subject to modification.
- working doc
the session before all-hands - This is the session where we identified the
- Session notes
Notes from every session of the coordinating group
- The emails
(like the one below) before and after each session
- Shared folder
with all documentation and output
As far as next steps go, product owners will be working with their teams to
identify projects that fit into the core output groupings and that will
lead to the impact we've identified.
*Once we have a rough sense of the year, if we haven't already, we will
need to run it by: - engineering ASAP to assess feasibility and
timelines..- adjacent teams and dependencies like CE, data analytics, etc.*
In parallel, the data analysts, Danny and I are starting to define the
primary metrics we will use to evaluate success--we'll run those by the
PO's for approval. The annual plan draft is due Feb 23.
Again, please reach out with any questions or feedback. If you're
scratching your head, I've probably done something wrong and should try to
fix it :)
---------- Forwarded message ----------
From: Jon Katz <jkatz(a)wikimedia.org>
Date: Sat, Jan 20, 2018 at 2:08 PM
Subject: Session #6 and into all hands
To: Dan Garry <dgarry(a)wikimedia.org>, "ggellerman(a)wikimedia.org" <
ggellerman(a)wikimedia.org>, Joshua Minor <jminor(a)wikimedia.org>, James
Forrester <jforrester(a)wikimedia.org>, Ramsey Isler <risler(a)wikimedia.org>,
Toby Negrin <tnegrin(a)wikimedia.org>, Runa Bhattacharjee <
rbhattacharjee(a)wikimedia.org>, Lydia Pintscher <lydia.pintscher(a)wikimedia.de
>, Charlotte Gauthier <cgauthier(a)wikimedia.org>, Neil Quinn <
nquinn(a)wikimedia.org>, Corey Floyd <cfloyd(a)wikimedia.org>, Trevor Bolliger <
tbolliger(a)wikimedia.org>, Danny Horn <dhorn(a)wikimedia.org>, Roan Kattouw <
rkattouw(a)wikimedia.org>, Amir Aharoni <aaharoni(a)wikimedia.org>, Nirzar
Pangarkar <npangarkar(a)wikimedia.org>, Olga Vasileva <ovasileva(a)wikimedia.org>,
Joe Matazzoni <jmatazzoni(a)wikimedia.org>, Anne Gomez <agomez(a)wikimedia.org>,
Amanda Bittaker <abittaker(a)wikimedia.org>, Adam Baso <abaso(a)wikimedia.org>
*TLDR:* More clarity for 2018-2019! Over the last 2 weeks, the
coordinating team better-defined the audience department's impact theme and
identified 4 specific project areas (outputs) we think the teams should
focus their efforts on to generate that impact. The next steps are to
identify the specific projects under those areas and which team is working
on them. We should start on this next week with our teams. Reasonable
docs to look at: outcome doc
working doc for session #6
*Since last communication*
The last time most of us met in session #5, we discussed our theme and
progress to-date and pointed out the latest research
available to us by the movement direction work.
Leading into session #6, the coordinating team decided that jumping from an
impact theme to projects was probably not constrained enough, so we took a
stab at identifying project groupings, which we are calling "outcomes", and
then narrowing down that list. This way, teams have a lot more focus when
they sit down to identify what it is they should work on. By the end of
session #6, we had cut that group of themes from 9 into 4. We also did a
fair bit of work in refining the overall goal to explicitly answer some of
the questions raised in Session #5. The attached preso
is one that we walked through in session #6, but I have modified it a bit
to show the outcomes and explain why. Our notes
from the session might also be helpful if you want to understand a
particular bit of context.
Coming out of that meeting, we have a draft of the strategy we hope will
provide significant guidance as we plan specifics for the coming year.
[image: Inline image 4]
on what we think the theme means as well as what each of the 'output'
groupings mean. It is definitely a living document, and subject to
modification. Please review and comment or suggest changes. The document
gets particularly sparse at the bottom of this tree, where the ideas are
not as well fleshed out.
The next steps are to identify what the highest priority projects are for
each grouping, carve out a subset for 2018-2019 and then align them with
resources. As part of due diligence, we will need engineering to weigh in
on feasibility and bandwidth and to figure out how we plan on measuring
success for each project. Everyone in the coordinating group felt that this
was probably best done at the team-level.
However, there will be some coordinating oversight necessary to ensure, for
instance, that shared resources like technology team support and community
appetite for change are not oversubscribed. There might also potential
conflicts or disagreements. For instance, if the mobile web team and the
editing team both choose VE on mobile web as a project they want to
tackle. For this reason, I suggest each team comes up with >1 set of
projects they would be interested in, in case there is a conflict.
- 1 annual plan of X projects with 1-2 alternatives for each project
- For each project, please:
- explain theory of how it best leads to outcome and impact (taking
into account constraints)
- explain how you would measure success
- identify assumptions/risks
Please reach out on or off-list if you have any concerns or suggestions
about any of this.
*All Hands Week Opportunities*
- Many of you are meeting with your teams on Monday - feel free to share
this outcome with your teams if you feel comfortable doing so and starting
to think about projects. I am also around on Monday if you'd like me to
come provide context, answer questions or bounce ideas around.
- On Tuesday we will have product day. In the afternoon, we have 2
hours devoted to moving forward on these next steps.
- We will be presenting this process/outcome and a few other related
things at the all-audiences meeting on Wednesday.
- On Wednesday there will be time in the Reading afternoon meeting
(60-90 minutes) and (I believe) in the Contributors meeting to discuss
further with your teams.
Again, your feedback, communication outward, and help filling in the blanks
here is really, really welcome. See you all next week!
Wmfproduct mailing list
some notes about possible reasons below. As a data analyst in the
Foundation's Readers department, I am tracking our overall pageview
numbers on a monthly basis, which we report to the WMF board alongside
other metrics about editor activity etc. (This is also publicly
available at , where this recent pageview decline had already been
remarked upon earlier. What's more, you can check this regularly
updated chart for a visual year-over-year comparison:  )
There are probably multiple causes for this year-over-year decrease
observable during the last few months. We know about one of them for
certain: The recent rollout of "page previews" to all but two
Wikipedia versions. This is a new software feature that shows an
excerpt from the linked article when the reader hovers their mouse
over a link. It is designed to save readers the effort of clicking
through certain links. So a decrease in pageviews was fully expected
and is to some extent actually evidence for the feature's success.
According to our A/B tests, this decrease is around 2-4% (of desktop
pageviews). We are on the other hand going to measure this new,
alternative form of reading Wikipedia (i.e. the number of previews
seen) just like we measure pageviews now; there is currently a
technical discussion about this on the Analytics-l mailing list. But
for now it is not yet reflected in our public traffic reports.
Google-referred pageviews did indeed see a year-over-year decrease of
some percent since November (but not before) , although this may
still not explain the entire rest of the year-over-year change in
overall pageviews. Regarding Google's Knowledge Panel - i.e. their
Wikipedia extracts that you mentioned - a research paper published
last year  has confirmed that it indeed has a negative effect on
our pageviews (which had long been the topic of speculation without
much actual evidence). However, Google already introduced this feature
in 2012, so it has been around over half a decade now and can't be
responsible per se for any recent drops. One would need to look for
more recent changes made by Google. (They actually made a tweak to the
panels for a particular topic category in early November , but to
me it seems rather unlikely to have had a noticeable effect on our
overall Google referrals.)
Likewise, the internet-wide multi-year trend towards mobile doesn't
really explain this recent trend in our total (desktop + mobile)
pageviews - as James already pointed out, just a year ago we were
actually seeing a year-over-year *growth* of several percent for an
extended time period.
Generally, keep in mind that while page requests by bots and spiders
are generally filtered out, the pageview numbers still encompass a
smaller amount of other automated views and artefacts, which can also
be responsible for sizable changes. In the data reported to the board
 I apply various corrections to filter out some more of these. But
the numbers at stats.wikimedia.org still include them. For example, if
you had looked at the same year-over-year change last summer, you
would have encountered an even bigger year-over-year pageview drop
which however is almost entirely spurious: An issue found and
mitigated in July/August 2016 had artificially inflated desktop
traffic up to 30% during these two months. There is a Phabricator task
to correct this in the publicly available data , but it is still
Besides the monthly reports of core metrics at  which come with
brief observations about trends, we also publish a more in-depth slide
deck about readership core metrics once per quarter. The next one
will come out in two weeks and I plan to do some further analysis
(e.g. check if the decrease was focused geographically) in preparation
for that; so perhaps we will know a bit more then.
 http://discovery.wmflabs.org/external/#traffic_by_engine and
http://discovery.wmflabs.org/external/#traffic_summary , select weekly
or monthly smoothing for easier comparison, but keep in mind the
default view includes bots/spiders
 Connor McMahon, Isaac Johnson, Brent Hecht: "The Substantial
Interdependence of Wikipedia and Google: A Case Study on the
Relationship Between Peer Production Communities and Information
. BTW we are still looking for someone to volunteer a summary or
review of this paper for the Wikimedia Research Newsletter/ Wikipedia
Signpost, so that more community members can learn about this research
- contact me in case you're interested.
 Cf. last quarter's edition:
On Tue, Jan 23, 2018 at 2:55 AM, Anders Wennersten
> We are seeing a steady decrease of page views to our projects (Wikipedia). Nov-Dec-Jan it is decreasing in a rate of 5-10% (year-year), and for big languages like Japanese, Spanish close to 10%, or some months even more 
> Is there any insights of why this is so? Could it be that Google take over accesses with their ever better way of showing results direct (but then also with showing extracts of Wikipedia articles) .
> Or that our interface on mobiles is inferior so we miss accesses from mobiles (now being 54% of total). Or horror of horror that users look for facts on all new sites with fake news instead of Wikipedia?
>  https://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
> New messages to: Wikimedia-l(a)lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:email@example.com?subject=unsubscribe>
IRC (Freenode): HaeB
At the Dev Summit, Birgit Müller and I will run a session on Growing the
MediaWiki Technical Community. If you're attending, we hope you will
consider joining us.
Everyone (attending the Dev Summit or not) is welcome and encouraged to
participate at https://phabricator.wikimedia.org/T183318 (please comment
there, rather than by email).
We are discussing the following questions:
* What would allow you to develop and plan your software more efficiently?
* What would make software development more fun for you?
* What other Open Source communities do we share interests with?
* How can we change our processes to take technical debt more seriously?
"Develop" means any kind of work on a software system, including design,
Our topics are:
* Better processes and project management practices, integrating all
developers and allowing them to work more efficiently
* Building partnerships with other Open Source communities on shared
interests (e.g. translation, audio, video)
* Reducing technical debt
a reminder that the livestream of our monthly research showcase starts in
45 minutes (11.30 PT)
- Video: https://www.youtube.com/watch?v=L-1uzYYneUo
- IRC: #wikimedia-research
On Tue, Jan 16, 2018 at 9:45 AM, Lani Goto <lgoto(a)wikimedia.org> wrote:
> Hi Everyone,
> The next Research Showcase will be live-streamed this Wednesday, January
> 17, 2018 at 11:30 AM (PST) 19:30 UTC.
> YouTube stream: https://www.youtube.com/watch?v=L-1uzYYneUo
> As usual, you can join the conversation on IRC at #wikimedia-research. And,
> you can watch our past research showcases here.
> This month's presentation:
> *What motivates experts to contribute to public information goods? A field
> experiment at Wikipedia*
> By Yan Chen, University of Michigan
> Wikipedia is among the most important information sources for the general
> public. Motivating domain experts to contribute to Wikipedia can improve
> the accuracy and completeness of its content. In a field experiment, we
> examine the incentives which might motivate scholars to contribute their
> expertise to Wikipedia. We vary the mentioning of likely citation, public
> acknowledgement and the number of views an article receives. We find that
> experts are significantly more interested in contributing when citation
> benefit is mentioned. Furthermore, cosine similarity between a Wikipedia
> article and the expert's paper abstract is the most significant factor
> leading to more and higher-quality contributions, indicating that better
> matching is a crucial factor in motivating contributions to public
> information goods. Other factors correlated with contribution include
> social distance and researcher reputation.
> *Wikihounding on Wikipedia*
> By Caroline Sinders, WMF
> Wikihounding (a form of digital stalking on Wikipedia) is incredibly
> qualitative and quantitive. What makes wikihounding different then
> mentoring? It's the context of the action or the intention. However, all
> interactions inside of a digital space has a quantitive aspect to it, every
> comment, revert, etc is a data point. By analyzing data points
> comparatively inside of wikihounding cases and reading some of the cases,
> we can create a baseline for what are the actual overlapping similarities
> inside of wikihounding to study what makes up wikihounding. Wikihounding
> currently has a fairly loose definition. Wikihounding, as defined by the
> Harassment policy on en:wp, is: “the singling out of one or more editors,
> joining discussions on multiple pages or topics they may edit or multiple
> debates where they contribute, to repeatedly confront or inhibit their
> work. This is with an apparent aim of creating irritation, annoyance or
> distress to the other editor. Wikihounding usually involves following the
> target from place to place on Wikipedia.” This definition doesn't outline
> parameters around cases such as frequency of interaction, duration, or
> minimum reverts, nor is there a lot known about what a standard or
> canonical case of wikihounding looks like. What is the average wikihounding
> case? This talk will cover the approaches myself and members of the
> research team: Diego Saez-Trumper, Aaron Halfaker and Jonathan Morgan are
> taking on starting this research project.
> Lani Goto
> Project Assistant, Engineering Admin
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> New messages to: Wikimedia-l(a)lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
*Dario Taraborelli *Director, Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter