We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi all,
I hacked up a very quick count of the 2015 video viewing aggregate
figures, using the data that Bartosz put together last year - with the
caveat that the data only goes up to 10 December, but it's probably
indicative of whole-year trends. I haven't yet tried to merge in the
11-31/12 data. Nothing very insightful but I don't recall seeing it
done before, so it might be of interest!
http://www.generalist.org.uk/blog/2016/most-popular-videos-on-wikipedia/
The headline figure is that we had about three billion (!!)
video/audio plays during the year, and that some of the most popular
items are insanely popular - the most popular was viewed an average of
42,000 times a day, every day.
Pine: the video you asked about in the other thread was viewed 187,899
times from 31/10/15 to 10/12/15. So there's half your answer :-)
--
- Andrew Gray
andrew.gray(a)dunelm.org.uk
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is
quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting
it.
Thanks,
Nuria
I should have started this discussion a while ago, but it's easier to catch
up on work during vacation :)
We currently have 3 available static file dumps of pageview data. I will
explain them here and explain my thoughts on simplifying the situation.
Feel free to turn this thread into a wiki.
* PAGECOUNTS-RAW <http://dumps.wikimedia.org/other/pagecounts-raw/>. We
have this data going back to 2007. This is using a very simple pageview
definition which incorrectly counts things like banner views as pageviews
(for example).
* PAGECOUNTS-ALL-SITES
<http://dumps.wikimedia.org/other/pagecounts-all-sites/>. We have this
data starting in late 2014. Compared to PAGECOUNTS-RAW, this dataset also
adds traffic from the mobile versions of our sites. But it's still using
the same simple pageview definition.
* PAGEVIEWS <http://dumps.wikimedia.org/other/pageviews/>. We have this
data starting in May 2015. It implements the new and much improved
pageview definition <https://meta.wikimedia.org/wiki/Research:Page_view>
that we now use. This is the same pageview definition used in the pageview
API. This dataset also removes spider traffic and any automata traffic
that we can detect.
All three datasets are in the same format (Domasz's archive format).
So, before we can simplify this confusing situation, we need your help and
input about what to keep and how to keep it. Here's the approach I would
take:
Combine pagecounts-raw with pagecounts-all-sites into a new dataset called
"pagecounts". Keep producing data to this dataset forever, but remove
"pagecounts-raw" and "pagecounts-all-sites". This way, we can compare new
data with historical data going back as far as we need. We would explain
on dumps.wikimedia.org/other that this dataset gains mobile data starting
in October 2014, to explain the relative local spike that happens there.
This dataset would remain a pretty bad estimate of actual page views, and
would remain sensitive to automata and spider spikes. But in combination
with the "pageviews" dataset, I think it would be useful.
What do you all think? Sound off in this thread, and if we have consensus
I'll start the cleanup.
Hi analytics list,
In the past months the WikimediaBot convention has been mentioned in a
couple threads, but we (Analytics team) never finished establishing and
advertising it. In this email we explain what the convention is today and
what purpose it serves. And also ask for feedback to be sure we can
continue with the next steps.
What is the WikimediaBot convention?
It is a way of better identifying Wikimedia traffic originated by bots.
Today we know that a significant share of Wikimedia traffic comes from
bots. We can recognize a part of that traffic with regular expressions[1],
but we can not recognize all of it, because some bots do not identify
themselves as such. If we could identify a greater part of the bot traffic,
we could also better isolate the human traffic and permit more accurate
analyses.
Who should follow the convention?
Computer programs that access Wikimedia sites or the Wikimedia API for
reading purposes* in a periodic, scheduled or automatically triggered way.
Who should NOT follow the convention?
Computer programs that follow the on-site ad-hoc commands of a human, like
browsers. And well known spiders that are otherwise recognizable by their
well known user-agent strings.
How to follow the convention?
The client's user-agent string should contain the word "WikimediaBot". The
word can be anywhere within the user-agent string and is case-sensitive.
So, please, feel free to post your comments/feedback on this thread. In the
course of this discussion we can adjust the convention's definition and, if
no major concerns are raised, in 2 weeks we'll create a documentation page
in Wikitech, send an email to the proper mailing lists and maybe write a
blog post about it.
Thanks a lot!
(*) There is already another convention[2] for bots that EDIT Wikimedia
content.
[1]
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery…
[2] https://www.mediawiki.org/wiki/Manual:Bots
--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
Hi Analytics folks,
My understanding is that the new pageview definition, which excludes
automata to a certain extent, is now published. I have a few questions:
1. Is stats.grok.se already transitioned to the new definition, or will it?
2. Is there a replacement for stats.grok.se planned or already available? A
reliable substitute would be great, and it would be nice if we could either
replace the existing on-wiki "page view statistics" link or add a
supplemental link to the new resource.
Apologizes if this information was already published and I missed it.
Thanks,
Pine
Hi Analytics fellows,
We are experiencing issues with loading data into the hadoop cluster,
therefore blocking the full job pipeline.
When fixed, the cluster will be heavily loaded trying to catch up, so
please, be nice with it and don't run heavy jobs in the next hours.
We'll keep you posted about resolution.
Many thanks, and sorry for the inconvenience.
Joseph
Hi all,
In order to convert tables on db1046 to the TokuDB engine - we have to
schedule some downtime on the Eventlogging databases between tomorrow,
Thursday Jan 21, 2016 at 16:00 UTC to Monday Jan 25, 2016 16:00 UTC.
What this means for EL users:
1. Eventlogging will still receive data and it will be available in Kafka.
The data will continue to be imported into Hadoop and files.
2. The Mysql consumers of Eventlogging will be stopped, so no data will get
imported into the master(db1046 or m4-master) and by extension to the
analytics-store.
3. Querying existing data from analytics-store will still be available, but
data for the next 4 days won't be available.
4. On Monday, after the maintenance window we'll restart the Mysql
consumers, and all the data should get reimported from Kafka.
Analytics and Ops(DBA) will work on this together.
Feel free to reach out to us here or on #wikimedia-analytics if you have
any concerns/questions.
-- Madhu Viswanathan
Software Engineer, Analytics