Hey all!
We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
Hope it's useful to people!
Very nice. Do you think that you could pick out a few of your favorite graphs and add them to this week's Recent Research report in a gallery?
Thanks! Pine Hey all!
We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
Hope it's useful to people!
-- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Totally! I'm also going to get together with some NEU hackers tomorrow and work on actually visualising the data on *drumroll* maps, which'd probably be more interesting eye candy than infinite bar plots :)
On 25 February 2015 at 16:19, Pine W wiki.pine@gmail.com wrote:
Very nice. Do you think that you could pick out a few of your favorite graphs and add them to this week's Recent Research report in a gallery?
Thanks! Pine
Hey all!
We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
Hope it's useful to people!
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Excellent!
Pine On Feb 25, 2015 1:26 PM, "Oliver Keyes" okeyes@wikimedia.org wrote:
Totally! I'm also going to get together with some NEU hackers tomorrow and work on actually visualising the data on *drumroll* maps, which'd probably be more interesting eye candy than infinite bar plots :)
On 25 February 2015 at 16:19, Pine W wiki.pine@gmail.com wrote:
Very nice. Do you think that you could pick out a few of your favorite graphs and add them to this week's Recent Research report in a gallery?
Thanks! Pine
Hey all!
We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
Hope it's useful to people!
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Great job.
Who knew Esperanto was big in Japan and China at #2 and #3?
On Wed, Feb 25, 2015 at 4:06 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey all!
We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
Hope it's useful to people!
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
The one major caveat, I think, is that the danger of proportionate data is that it makes small projects very vulnerable to artificial traffic spikes. I'd go out on a limb and say that some of the massive bumps in popularity we see in particular combinations are likely due to either undetected automata or simply a project having so little traffic that a small number of people can sway the results outlandishly.
On 25 February 2015 at 16:32, Andrew Lih andrew.lih@gmail.com wrote:
Great job.
Who knew Esperanto was big in Japan and China at #2 and #3?
On Wed, Feb 25, 2015 at 4:06 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey all!
We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
Hope it's useful to people!
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
I am surprised that the new data, with crawlers excluded, show more wp:en traffic from US (43%) than the old data (36.4% for 2014), which contained much crawler traffic, presumably most of that from US.
Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguageB...
Any thoughts?
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Wednesday, February 25, 2015 22:37 To: Research into Wikimedia content and communities Cc: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Wiki-research-l] [Release]
The one major caveat, I think, is that the danger of proportionate data is that it makes small projects very vulnerable to artificial traffic spikes. I'd go out on a limb and say that some of the massive bumps in popularity we see in particular combinations are likely due to either undetected automata or simply a project having so little traffic that a small number of people can sway the results outlandishly.
On 25 February 2015 at 16:32, Andrew Lih andrew.lih@gmail.com wrote:
Great job.
Who knew Esperanto was big in Japan and China at #2 and #3?
On Wed, Feb 25, 2015 at 4:06 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey all!
We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
Hope it's useful to people!
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Yours is looking at just December, while mine is looking at the entire year, for starters. Also, what's the apps/mobile web inclusion for that report?
On 25 February 2015 at 17:34, Erik Zachte ezachte@wikimedia.org wrote:
I am surprised that the new data, with crawlers excluded, show more wp:en traffic from US (43%) than the old data (36.4% for 2014), which contained much crawler traffic, presumably most of that from US.
Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguageB...
Any thoughts?
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Wednesday, February 25, 2015 22:37 To: Research into Wikimedia content and communities Cc: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Wiki-research-l] [Release]
The one major caveat, I think, is that the danger of proportionate data is that it makes small projects very vulnerable to artificial traffic spikes. I'd go out on a limb and say that some of the massive bumps in popularity we see in particular combinations are likely due to either undetected automata or simply a project having so little traffic that a small number of people can sway the results outlandishly.
On 25 February 2015 at 16:32, Andrew Lih andrew.lih@gmail.com wrote:
Great job.
Who knew Esperanto was big in Japan and China at #2 and #3?
On Wed, Feb 25, 2015 at 4:06 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey all!
We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
Hope it's useful to people!
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Not sure about apps. I would have to test that. Mobile is included.
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Wednesday, February 25, 2015 23:46 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Cc: Research into Wikimedia content and communities Subject: Re: [Analytics] [Wiki-research-l] [Release]
Yours is looking at just December, while mine is looking at the entire year, for starters. Also, what's the apps/mobile web inclusion for that report?
On 25 February 2015 at 17:34, Erik Zachte ezachte@wikimedia.org wrote:
I am surprised that the new data, with crawlers excluded, show more wp:en traffic from US (43%) than the old data (36.4% for 2014), which contained much crawler traffic, presumably most of that from US.
Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLan guageBreakdown.htm
Any thoughts?
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Wednesday, February 25, 2015 22:37 To: Research into Wikimedia content and communities Cc: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Wiki-research-l] [Release]
The one major caveat, I think, is that the danger of proportionate data is that it makes small projects very vulnerable to artificial traffic spikes. I'd go out on a limb and say that some of the massive bumps in popularity we see in particular combinations are likely due to either undetected automata or simply a project having so little traffic that a small number of people can sway the results outlandishly.
On 25 February 2015 at 16:32, Andrew Lih andrew.lih@gmail.com wrote:
Great job.
Who knew Esperanto was big in Japan and China at #2 and #3?
On Wed, Feb 25, 2015 at 4:06 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey all!
We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
Hope it's useful to people!
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Erik Zachte, 25/02/2015 23:34:
Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguageB...
Ironholds' looks more vulnerable to bots, it's easier to see in small wikis (though, kudos! many more small wikis are included than in wikistats). For instance, 20 more percentage points for USA on Breton and Bavarian Wikipedias, 30 on Welsh, 40 on Alemannic, almost 50 on Kurdish. For Chinese bots they look similar, though in some cases I'm not sure what's going on: for instance als.wiki also sees CH and RO emerge.
Will the new pageviews definition use the same bot filtering method?
Nemo
Great work!
One way for further analysis of such kind of geolinguistic aggregate is to do some data normalization, or geographic normalization, as demonstrated by my previous work http://www.opensym.org/os2014-files/proceedings/p611.pdf: http://www.opensym.org/os2014-files/proceedings/p611.pdf
Any one is welcome to do some data normalization using the geolinguistic size indicators here https://github.com/hanteng/pyGeolinguisticSize/blob/master/size_geolinguistic.tsv: https://github.com/hanteng/pyGeolinguisticSize/blob/master/size_geolinguisti...
Currently, it has Population (LP), Internet users (IPop), Economy Size (PPPGDP), etc. estimation based on "even distribution" across percentage share of language population per country based on the Unicode CLDR 25 Territory-Language Information.
A simple linear regression can reveal, say, which geo-linguistic, geographic, or linguistic category has less-than-expected or more-than-expected proportional of viewing traffic, with the expected values being generated according to the sizes of population, Internet population, economy.
I hope this great work by Nemo can be extended to cover
(1) time-series report and data release
(2) edits aggregate
Altogether the tools and datasets will be a major milestone to monitor the language/project development across Wikimedia projects. Congrats!
Best, han-teng liao
2015-02-26 8:31 GMT+01:00 Federico Leva (Nemo) nemowiki@gmail.com:
Erik Zachte, 25/02/2015 23:34:
Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and http://stats.wikimedia.org/wikimedia/squids/ SquidReportPageViewsPerLanguageBreakdown.htm
Ironholds' looks more vulnerable to bots, it's easier to see in small wikis (though, kudos! many more small wikis are included than in wikistats). For instance, 20 more percentage points for USA on Breton and Bavarian Wikipedias, 30 on Welsh, 40 on Alemannic, almost 50 on Kurdish. For Chinese bots they look similar, though in some cases I'm not sure what's going on: for instance als.wiki also sees CH and RO emerge.
Will the new pageviews definition use the same bot filtering method?
Nemo
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Yes and no. So, we use a sliightly more expanded version of the ua-parser bot filtering (for example, detecting automata - wget and Twisted Pagegetter are not bots, but they should absolutely be filtered) and a slightly more expanded spider detection approach (there are Wikimedia-specific spiders). To me the greater risk is undeclared automata; I've had quite a lot of success detecting them using various concentration and density indexes, such as the Herfindahl, orienting around {ip,xff} tuples or user agents, but it requires >=1,000 pageviews to a particular URL to be useful.
So, there is more we can do - but it becomes complex and computationally intensive, and requires constant hand-coding to maintain. I have much sympathy for whoever it is in R&D who has to absorb my work, because a lot of it is maintaining things like this, and pageviews are of limited utility for most purposes without this kind of filtering.
On 26 February 2015 at 02:31, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Erik Zachte, 25/02/2015 23:34:
Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and
http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguageB...
Ironholds' looks more vulnerable to bots, it's easier to see in small wikis (though, kudos! many more small wikis are included than in wikistats). For instance, 20 more percentage points for USA on Breton and Bavarian Wikipedias, 30 on Welsh, 40 on Alemannic, almost 50 on Kurdish. For Chinese bots they look similar, though in some cases I'm not sure what's going on: for instance als.wiki also sees CH and RO emerge.
Will the new pageviews definition use the same bot filtering method?
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
This is really, really cool, great job guys!
G
Giovanni Luca Ciampaglia
✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciampag@indiana.edu
2015-02-25 16:06 GMT-05:00 Oliver Keyes okeyes@wikimedia.org:
Hey all!
We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
Hope it's useful to people!
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l