I hacked up a very quick count of the 2015 video viewing aggregate
figures, using the data that Bartosz put together last year - with the
caveat that the data only goes up to 10 December, but it's probably
indicative of whole-year trends. I haven't yet tried to merge in the
11-31/12 data. Nothing very insightful but I don't recall seeing it
done before, so it might be of interest!
The headline figure is that we had about three billion (!!)
video/audio plays during the year, and that some of the most popular
items are insanely popular - the most popular was viewed an average of
42,000 times a day, every day.
Pine: the video you asked about in the other thread was viewed 187,899
times from 31/10/15 to 10/12/15. So there's half your answer :-)
- Andrew Gray
I should have started this discussion a while ago, but it's easier to catch
up on work during vacation :)
We currently have 3 available static file dumps of pageview data. I will
explain them here and explain my thoughts on simplifying the situation.
Feel free to turn this thread into a wiki.
* PAGECOUNTS-RAW <http://dumps.wikimedia.org/other/pagecounts-raw/>. We
have this data going back to 2007. This is using a very simple pageview
definition which incorrectly counts things like banner views as pageviews
<http://dumps.wikimedia.org/other/pagecounts-all-sites/>. We have this
data starting in late 2014. Compared to PAGECOUNTS-RAW, this dataset also
adds traffic from the mobile versions of our sites. But it's still using
the same simple pageview definition.
* PAGEVIEWS <http://dumps.wikimedia.org/other/pageviews/>. We have this
data starting in May 2015. It implements the new and much improved
pageview definition <https://meta.wikimedia.org/wiki/Research:Page_view>
that we now use. This is the same pageview definition used in the pageview
API. This dataset also removes spider traffic and any automata traffic
that we can detect.
All three datasets are in the same format (Domasz's archive format).
So, before we can simplify this confusing situation, we need your help and
input about what to keep and how to keep it. Here's the approach I would
Combine pagecounts-raw with pagecounts-all-sites into a new dataset called
"pagecounts". Keep producing data to this dataset forever, but remove
"pagecounts-raw" and "pagecounts-all-sites". This way, we can compare new
data with historical data going back as far as we need. We would explain
on dumps.wikimedia.org/other that this dataset gains mobile data starting
in October 2014, to explain the relative local spike that happens there.
This dataset would remain a pretty bad estimate of actual page views, and
would remain sensitive to automata and spider spikes. But in combination
with the "pageviews" dataset, I think it would be useful.
What do you all think? Sound off in this thread, and if we have consensus
I'll start the cleanup.
Hi analytics list,
In the past months the WikimediaBot convention has been mentioned in a
couple threads, but we (Analytics team) never finished establishing and
advertising it. In this email we explain what the convention is today and
what purpose it serves. And also ask for feedback to be sure we can
continue with the next steps.
What is the WikimediaBot convention?
It is a way of better identifying Wikimedia traffic originated by bots.
Today we know that a significant share of Wikimedia traffic comes from
bots. We can recognize a part of that traffic with regular expressions,
but we can not recognize all of it, because some bots do not identify
themselves as such. If we could identify a greater part of the bot traffic,
we could also better isolate the human traffic and permit more accurate
Who should follow the convention?
Computer programs that access Wikimedia sites or the Wikimedia API for
reading purposes* in a periodic, scheduled or automatically triggered way.
Who should NOT follow the convention?
Computer programs that follow the on-site ad-hoc commands of a human, like
browsers. And well known spiders that are otherwise recognizable by their
well known user-agent strings.
How to follow the convention?
The client's user-agent string should contain the word "WikimediaBot". The
word can be anywhere within the user-agent string and is case-sensitive.
So, please, feel free to post your comments/feedback on this thread. In the
course of this discussion we can adjust the convention's definition and, if
no major concerns are raised, in 2 weeks we'll create a documentation page
in Wikitech, send an email to the proper mailing lists and maybe write a
blog post about it.
Thanks a lot!
(*) There is already another convention for bots that EDIT Wikimedia
*Marcel Ruiz Forns*
Hi Analytics folks,
My understanding is that the new pageview definition, which excludes
automata to a certain extent, is now published. I have a few questions:
1. Is stats.grok.se already transitioned to the new definition, or will it?
2. Is there a replacement for stats.grok.se planned or already available? A
reliable substitute would be great, and it would be nice if we could either
replace the existing on-wiki "page view statistics" link or add a
supplemental link to the new resource.
Apologizes if this information was already published and I missed it.
Hi Analytics fellows,
We are experiencing issues with loading data into the hadoop cluster,
therefore blocking the full job pipeline.
When fixed, the cluster will be heavily loaded trying to catch up, so
please, be nice with it and don't run heavy jobs in the next hours.
We'll keep you posted about resolution.
Many thanks, and sorry for the inconvenience.
In order to convert tables on db1046 to the TokuDB engine - we have to
schedule some downtime on the Eventlogging databases between tomorrow,
Thursday Jan 21, 2016 at 16:00 UTC to Monday Jan 25, 2016 16:00 UTC.
What this means for EL users:
1. Eventlogging will still receive data and it will be available in Kafka.
The data will continue to be imported into Hadoop and files.
2. The Mysql consumers of Eventlogging will be stopped, so no data will get
imported into the master(db1046 or m4-master) and by extension to the
3. Querying existing data from analytics-store will still be available, but
data for the next 4 days won't be available.
4. On Monday, after the maintenance window we'll restart the Mysql
consumers, and all the data should get reimported from Kafka.
Analytics and Ops(DBA) will work on this together.
Feel free to reach out to us here or on #wikimedia-analytics if you have
-- Madhu Viswanathan
Software Engineer, Analytics