Are views of republished Wikimedia content, such as on Google and Youtube,
something that we could include in addition to Wikimedia pageview
statistics? I imagine that this would require cooperation from Alphabet and
other companies that reuse Wikimedia content. It would be nice if we could
get that cooperation.
Also, Is this republication taken into account in website traffic rankings?
My guess is that the answer is no, and that other types of republication
such as embedded Youtube videos are not taken into account for their
content provider's site rankings, although I think that Youtube would count
views of embedded videos in its own statistics of video views. I am
thinking that for Youtube and Wikipedia, and other similar sites for which
republication or embedding are common, site rankings which are based on
pageviews could significantly underestimate the popularity and influence of
( https://meta.wikimedia.org/wiki/User:Pine )
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Tomorrow Sept 27th at 10 CEST db1108 (alias analytics-slave) will be down
for a brief (max 30 mins) maintenance (Mariadb and Linux kernel upgrade).
This means that the log database will not be available for querying during
this time frame. Please reach out to me or to the Analytics team if this
impacts your work (elukey or #wikimedia-analytics on IRC Freenode).
the Analytics team needs to replace the Hadoop master node hosts
(analytics100[1,2]) and the Hive/Oozie host (analytics1003) as part of
regular hardware refresh (hosts getting out of warranty). In order to do
things safely we decided to proceed with a full cluster shutdown on Sept
25th at 10 AM CEST. The maintenance should last a couple of hours and all
there shouldn't be any noticeable change for the Hadoop users.
This means that during the maintenance:
- HDFS will not be available
- Yarn will not be available
- Hive/Spark (cluster mode)/Oozie/etc.. will not be available
Please let us know if this impacts your work in
https://phabricator.wikimedia.org/T203635 or on the #wikimedia-analytics
Freenode IRC channel.
Thanks a lot!
Hi there --
I've been doing some analysis using the raw pageviews table in Hive, in
order to try to understand the effect that adding a sitemap to
it.wikipedia.org had on traffic. As part of this analysis, I created
three temporary tables. But, of course, those tables only exist within the
context of my own session, which is sub-optimal since I'm not the only one
trying to understand this.
What's the best way to go about persisting these tables? I can SELECT INTO
to move the data in to a non-temp table, but don't want to do so
(They'll probably need to stick around for about 2 weeks, I would guess,
and each of the three tables in question is about 5 million rows with three
columns each (a string, and two int))
I'm excited to share that our annual survey about Wikimedia communities is
This survey included 170 questions and reaches over 4,000 community
four audiences: Contributors, Affiliate organizers, Program Organizers, and
Volunteer Developers. This survey helps us hear from the experience of
Wikimedians from across the movement so that teams are able to use
community feedback in their planning and their work. This survey also helps
us learn about long term changes in communities, such as community health
The report is available on meta:
For this survey, we worked with 11 teams to develop the questions. Once the
results were analyzed, we spent time with each team to help them understand
their results. Most teams have already identified how they will use the
results to help improve their work to support you.
The report could be useful for your work in the Wikimedia movement as well!
What are you learning from the data? Take some time to read the report and
share your feedback on the talk pages. We have also published a blog that
you can read.
We are hosting a livestream presentation on September 20 at 1600 UTC.
Hope to see you there!
Feel free to email me directly with any questions.
All the best,
Evaluation Strategist, Surveys
Learning & Evaluation
Evaluation Strategist, Surveys
Learning & Evaluation
The next Wikimedia Research Showcase will be live-streamed Wednesday,
September 19 2018 at 11:30 AM (PDT) 18:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=OY8vZ6wES9o
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here.
Hope to see you there!
This month's presentations is:
The impact of news exposure on collective attention in the United States
during the 2016 Zika epidemicBy *Michele Tizzoni, André Panisson, Daniela
Paolotti, Ciro Cattuto*In recent years, many studies have drawn attention
to the important role of collective awareness and human behaviour during
epidemic outbreaks. A number of modelling efforts have investigated the
interaction between the disease transmission dynamics and human behaviour
change mediated by news coverage and by information spreading in the
population. Yet, given the scarcity of data on public awareness during an
epidemic, few studies have relied on empirical data. Here, we use
fine-grained, geo-referenced data from three online sources - Wikipedia,
the GDELT Project and the Internet Archive - to quantify population-scale
information seeking about the 2016 Zika virus epidemic in the U.S.,
explicitly linking such behavioural signal to epidemiological data.
Geo-localized Wikipedia pageview data reveal that visiting patterns of
Zika-related pages in Wikipedia were highly synchronized across the United
States and largely explained by exposure to national television broadcast.
Contrary to the assumption of some theoretical models, news volume and
Wikipedia visiting patterns were not significantly correlated with the
magnitude or the extent of the epidemic. Attention to Zika, in terms of
Zika-related Wikipedia pageviews, was high at the beginning of the
outbreak, when public health agencies raised an international alert and
triggered media coverage, but subsequently exhibited an activity profile
that suggests nonlinear dependencies and memory effects in the relationship
between information seeking, media pressure, and disease dynamics. This
calls for a new and more general modelling framework to describe the
interaction between media exposure, public awareness, and disease dynamics
during epidemic outbreaks.
Deliberation and resolution on WikipediaA case study of requests for
commentsBy *Amy Zhang, Jane Im*Resolving disputes in a timely manner is
crucial for any online production group. We present an analysis of Requests
for Comments (RfCs), one of the main vehicles on Wikipedia for formally
resolving a policy or content dispute. We collected an exhaustive dataset
of 7,316 RfCs on English Wikipedia over the course of 7 years and conducted
a qualitative and quantitative analysis into what issues affect the RfC
process. Our analysis was informed by 10 interviews with frequent RfC
closers. We found that a major issue affecting the RfC process is the
prevalence of RfCs that could have benefited from formal closure but that
linger indefinitely without one, with factors including participants'
interest and expertise impacting the likelihood of resolution. From these
findings, we developed a model that predicts whether
Sarah R. Rodlund
Technical Writer, Developer Advocacy
Hi all you EventLogging users out there!
We will switch EventLogging MySQL ingestion to be based on a schema
whitelist rather than blacklist.
As you know, we currently import EventLogging events into two locations for
analysis: The MySQL ‘log’ database, and the Hive ‘event’ database. MySQL
is not able to handle high volume events. We currently blacklist
any schemas that we know have high volumes from being ingested into MySQL.
This can cause problems when a new high volume schema is deployed, as it
requires knowledge and communication from the schema owners to the
Analytics team, and it requires an Analytics Operations engineer to make a
Puppet commit to blacklist the schema. To address this problem, we will
switch the EventLogging MySQL schema blacklist to a whitelist. All schemas
that are actively being ingested into MySQL today will be whitelisted. In
the future, if you want an event schema to be ingested into MySQL, you’ll
need to ask the Analytics team to whitelist it.
Hive has been working for EventLogging analysis for a while now. It has
almost all of the schemas that MySQL does, plus the high volume ones. One
day in the (distant?) future, we’d like to decommission MySQL storage of
events. (Don’t worry yet, MySQL decommissioning has a lot of blockers and
this work is not planned.) By not ingesting events into MySQL by default,
we hope to encourage more users to switch to Hive.
This switch to a whitelist will happen this week. If you are deploying a
new schema and expect it to show up in MySQL, let us know so we can
-Andrew + the Analytics Engineering team