We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi everyone,
I'm a phd student studying mathematical models to improve the hit ratio
of web caches. In my research community, we are lacking realistic data
sets and frequently rely on outdated modelling assumptions.
Previously, (~2007) a trace containing 10% of user requests issued to
the Wikipedia was publicly released [1]. This data set has been used
widely for performance evaluations of new caching algorithms, e.g., for
the new Caffeine caching framework for Java [2].
I would like to ask for your comments about compiling a similar
(updated) data set and making it public.
In my understanding, the necessary logs are readily available, e.g., in
the Analytics/Data/Mobile requests stream [3] on stat1002, with a
sampling rate of 1:100. As this request stream contains sensitive data
(e.g., client IPs), it would need anonymization before making it public.
It would be glad to help with that.
The previously released data set [1] contains no client information. It
contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update
flag. I would additionally suggest to include 5) the cache's hostname,
6) the cache_status, and 7) the response size (from the Wikimedia cache
log format).
I believe this format would preserve anonymity, and would be interesting
for many researchers.
Let me know your thoughts.
Thanks,
Daniel Berger
http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger
[1] http://www.wikibench.eu/?page_id=60
[2] https://github.com/ben-manes/caffeine/wiki/Efficiency
[3]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream
Hi all,
I hacked up a very quick count of the 2015 video viewing aggregate
figures, using the data that Bartosz put together last year - with the
caveat that the data only goes up to 10 December, but it's probably
indicative of whole-year trends. I haven't yet tried to merge in the
11-31/12 data. Nothing very insightful but I don't recall seeing it
done before, so it might be of interest!
http://www.generalist.org.uk/blog/2016/most-popular-videos-on-wikipedia/
The headline figure is that we had about three billion (!!)
video/audio plays during the year, and that some of the most popular
items are insanely popular - the most popular was viewed an average of
42,000 times a day, every day.
Pine: the video you asked about in the other thread was viewed 187,899
times from 31/10/15 to 10/12/15. So there's half your answer :-)
--
- Andrew Gray
andrew.gray(a)dunelm.org.uk
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is
quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting
it.
Thanks,
Nuria
I notice you mention in a lot of places that people should contact an administrator. What if the person they are complaining about is an admin? Because I have seen admins violate several of these rules of conduct and it's rare for anything to be donecabout it.
Reguyla
Sent from my T-Mobile 4G LTE device
------ Original message------From: Matthew FlaschenDate: Tue, Feb 23, 2016 8:45 PMTo: Wikitech List;Engineering List;Design List;Wiki Research List;Analytics Public List;hackathonorganizers@lists.wikimedia.org;Subject:[Analytics] Please provide feedback on suggested improvements to the Code of Conduct
Thanks to everyone who’s helped work on the Code of Conduct so far.People have brought up issues they feel were missed when working on "Unacceptable behavior" ( https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Unacceptable_behavior ) and "Report a problem" ( https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Report_a_problem ). Consultants have also suggested changes in these same sections.These are important sections, so please take a look at the proposed changes ( https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Suggested_changes). I apologize that this feedback arrived later than planned, but I think this will create a better document.If you prefer to give your opinion privately, feedback via e-mail is welcome at conduct-discussion(a)wikimedia.org.Thanks,Matt Flaschen_______________________________________________Analytics mailing listAnalytics@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/analytics
Hello there,
I'm a beginner programmer and am working on a project using Wikipedia views, trying to reproduce a paper. I was pulling json from http://stats.grok.se but the last month isn't updated. I'm trying to get the last 30 days of wikipedia views for several pages, but the hourly files are 70mb compressed. Is it possible for me to query the specific data directly through some kind of Wikipedia database directly?
Otherwise in regards to the files (located at http://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-01/), should they be opened/accessed in some way? I'm using a high level language so working with these things sounds like it will break my computer.
Like I said, I'm fairly new to this all so I apologize in advance if these questions seem silly.
Thank you for any help,
Dominic Della Sera
Hi all,
Ops is working on upgrading our web caching software from Varnish 3 to
Varnish 4. At the moment, this would mean we would lost webrequest logs,
since our webrequest logging software (varnishkafka) is incompatible with
Varnish 4. There is an effort to fix this incompatibility, but Ops would
like to be able to test Varnish 4 out in production in a limited fashion
without having to worry about this blocker.
Question: Do any of you use the webrequest_source=‘misc’ partition for
production analysis? If this Hive partition did not have data for a
limited period of time, would this break anyone’s data?
If not, then Ops would like to use the misc cache cluster for testing this
change.
Thanks!
-Andrew & Luca
Hi all – heads up that we extended the submission deadline for the Wiki
Workshop at ICWSM '16 to *Wednesday, March 3, 2016*. (The second deadline
remains unchanged: March 11, 2016).
You can check the workshop's website
<http://snap.stanford.edu/wikiworkshop2016/> for submission instructions or
follow us at @wikiworkshop16 <https://twitter.com/wikiworkshop16> for live
updates.
Looking forward to your contributions.
Dario
Hiya,
We’re ready to upgrade the Analytics Cluster to CDH 5.5. To do so, we need
to schedule a maintenance period during which we can stop all Hadoop
related services. This includes Hive, Oozie, Spark, etc.
I’d like to plan this for Tuesday February 23rd starting at 14:00 UTC
(09:00 US east coast, 06:00 US west coast). We’ve practiced this upgrade a
few times in labs now, and I don’t foresee any issues. I predict that it
will take us no more than 2 hours to finish, but just in case I’d like to
reserve 8 hours for this.
Please plan on not using the Analytics Cluster between 14:00 and 22:00 on
February 23rd. I will update this thread again when we are about to start,
and when we are finished.
Progress is being tracked here: https://phabricator.wikimedia.org/T119646
What we get:
-
http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_rn_new_i…
-
http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_rn_fixed…
Thanks all!
-Andrew + Analytics team