Analytics December 2015

analytics@lists.wikimedia.org

39 participants
36 discussions

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

Echo schema eventlogging

by Nuria Ruiz

Roan: The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used. Can you confirm either way? If it is no longer used we will stop collecting it. Thanks, Nuria

8 years, 1 month

[Pageviews] [Technical] Simplifying the available static dumps of pageview data

by Dan Andreescu

I should have started this discussion a while ago, but it's easier to catch up on work during vacation :) We currently have 3 available static file dumps of pageview data. I will explain them here and explain my thoughts on simplifying the situation. Feel free to turn this thread into a wiki. * PAGECOUNTS-RAW <http://dumps.wikimedia.org/other/pagecounts-raw/>. We have this data going back to 2007. This is using a very simple pageview definition which incorrectly counts things like banner views as pageviews (for example). * PAGECOUNTS-ALL-SITES <http://dumps.wikimedia.org/other/pagecounts-all-sites/>. We have this data starting in late 2014. Compared to PAGECOUNTS-RAW, this dataset also adds traffic from the mobile versions of our sites. But it's still using the same simple pageview definition. * PAGEVIEWS <http://dumps.wikimedia.org/other/pageviews/>. We have this data starting in May 2015. It implements the new and much improved pageview definition <https://meta.wikimedia.org/wiki/Research:Page_view> that we now use. This is the same pageview definition used in the pageview API. This dataset also removes spider traffic and any automata traffic that we can detect. All three datasets are in the same format (Domasz's archive format). So, before we can simplify this confusing situation, we need your help and input about what to keep and how to keep it. Here's the approach I would take: Combine pagecounts-raw with pagecounts-all-sites into a new dataset called "pagecounts". Keep producing data to this dataset forever, but remove "pagecounts-raw" and "pagecounts-all-sites". This way, we can compare new data with historical data going back as far as we need. We would explain on dumps.wikimedia.org/other that this dataset gains mobile data starting in October 2014, to explain the relative local spike that happens there. This dataset would remain a pretty bad estimate of actual page views, and would remain sensitive to automata and spider spikes. But in combination with the "pageviews" dataset, I think it would be useful. What do you all think? Sound off in this thread, and if we have consensus I'll start the cleanup.

8 years, 2 months

How many times has a video been played?

by Pine W

Hi Analytics, How do I determine how many times this video <https://commons.wikimedia.org/wiki/File:Wikipedia_5_million_articles_milest…> has been played in the last 90 days? Thanks, Pine

8 years, 3 months

Backlinks TO Wikipedia

by Edison Nica

Hi, I am interested to know if wikipedia makes public how many backlinks each page gets. I am working on a search for wikipedia, and I as you would expect, it sucks. So I went and tested same searches directly on wikipedia, and no offence, they suck even more. So I went on Google, and performed same searches, with the added site:wikipedia.org, and Google was a little bit better (although not much compared with my 1-day-development-seach-engine). I want to make my wikipedia search better, and having a table that would tell me how many non-wikipedia pages point to a certain wikipedia page, might improve my algorithm. Anyone knows if wikipedia publishes such data? Thank you! Edison Nica Http://www.0pii.com Edisonn(a)0pii.com Sent from my T-Mobile 4G LTE Device

8 years, 3 months

MobileWikiAppShareAFact event stream was: [WikimediaMobile] Stopping eventlogging events into MobileWikiAppShareAFact table

by Nuria Ruiz

Team: This schema MobileWikiAppShareAFact is sending a lot of events, maybe is worth thinking whether we need that many. It is again a case where tables are becoming huge and hard to query fast. cc-ing Jon as schema owner. Can this data be sampled at a higher sampling rate? I have filed a ticket to this fact: https://phabricator.wikimedia.org/T122224 Thanks, Nuria On Tue, Dec 22, 2015 at 8:35 AM, Adam Baso <abaso(a)wikimedia.org> wrote: > Replacing mobile-tech with mobile-l (internal mobile-tech list > discontinued). > > > On Tuesday, December 22, 2015, Nuria Ruiz <nuria(a)wikimedia.org> wrote: > >> Team: >> >> As part of our effort of converting eventlogging mysql database to the >> tokudb engine we need to stop eventlogging events from flowing into the MobileWikiAppShareAFact >> table, we are using this one table to see how long the conversion will take >> in order to plan for a larger outage window. >> >> >> Let us know if data should be backfilled as it can be, we anticipate >> events will not flow into table for the better part of one day. >> >> >> Thanks, >> >> Nuria >> >> >> > _______________________________________________ > Mobile-l mailing list > Mobile-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/mobile-l > >

8 years, 3 months

https://reportcard.wmflabs.org questions

by Pine W

Hi Analytics, The report card site's most recent data for the unique visitors stats is from May 2015. Will this be updated in the future? Also, the information shown on the "New Editors Per Month for All Wikimedia Projects" chart goes back only to late 2012. Is there a way to get the data for that chart all the way back to 2001? I can pull the tables for all Wikipedias back to 2001 from the report cards site, but I can't pull the tables for all Wikimedia projects back to 2001 AFAIK. Thanks! Pine

8 years, 3 months

Top Edits/Views in 2015 per project?

by Itzik - Wikimedia Israel

Hi, I see that the (amazing!) API still can't give us results for the whole 2015. So any way we can get this pages views per project? And also, the most edited articles in 2015 per project? This can be a great PR information for the communication representatives around to world to release to local journalists. *Regards,Itzik Edri* Chairperson, Wikimedia Israel +972-(0)-54-5878078 | http://www.wikimedia.org.il Imagine a world in which every single human being can freely share in the sum of all knowledge. That's our commitment!

8 years, 3 months

Looking for new pagecounts files on stat1003

by Aaron Halfaker

Hey folks. I'm looking at. https://dumps.wikimedia.org/other/pagecounts-all-sites/2015/ Can anyone tell me where I'd fine these files via stat1003? I'm pretty sure I'm getting the pagecount dumps in /mnt/data/pagecounts/, but maybe I'm mistaken. -Aaron

8 years, 3 months

mobile and zero legacy tsvs on stat1002

by Andrew Otto

Hi all, Soon, we will be merging the mobile web cache requests with the text cache requests. text caches will now serve requests for mobile web[1]. This means that the webrequest_source=‘mobile’ partition in the webrequest table in Hive will soon be empty, and all data that was previously in it will be found in the webrequest_source=‘text’ partition. There are only 3 datasets that currently only use the webrequest_source=‘mobile’ partition: - /a/log/webrequest/archive/mobile - /a/log/webrequest/archive/5xx-mobile - /a/log/webrequest/archive/zero (These are paths on stat1002, but they also exist in HDFS.) These datasets originally came from udp2log, but since early last year they have been generated from Hadoop. With the upcoming cache merge, these jobs will have to parse through all text requests, which will make Hadoop busier. Do we know if these are being used? Would anyone be upset if we no longer generated these datasets? Thanks! -Andrew [1] https://phabricator.wikimedia.org/T109286 <https://phabricator.wikimedia.org/T109286>

8 years, 3 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics December 2015