Analytics September 2014

analytics@lists.wikimedia.org

23 participants
25 discussions

confusing browser names
by TH 23 Sep '14

23 Sep '14

hello i want to find out which browsers are bigger in which parts of the world, and the ratios however the browser names are quite confusing on this page: https://stats.wikimedia.org/wikimedia/squids/SquidReportCountryBrowser.htm mozilla and firefox are two seperate entities ? is one mobile and the other desktop ? iOS is also seperate from ipad and iphone ? why so ? thanks thomas

4 4

Wikimedia Statistics - Missing Country Names
by Atul Vaidya 22 Sep '14

22 Sep '14

Thank you, Erik. I am not sure where you fixed the missing country names. Please take a look at the screenshot here: http://tinypic.com/r/2vanmg0/8 It shows the browser stats for the month of November 2013. YOu will notice that the first column, Country, has no entries. This happens for each month following September 2013. To see what I mean by stats for the "Apple" browser take a look at http://tinypic.com/r/be6b2w/8 The heading for the fifth column from the right hand side reads Apple. Given that there is already an entry for Safari & iOS it is not clear to me what Apple might mean. I hope you will be able to help with this. Atul

1 0

countries not appearing on this list
by TH 21 Sep '14

21 Sep '14

the e-mail subject says it all this time https://stats.wikimedia.org/wikimedia/squids/SquidReportCountryOs.htm

1 0

Wikimedia Statistics - Missing Country Names
by Atul Vaidya 19 Sep '14

19 Sep '14

Dear Wikimedia,I am using Wikimedia statistical data and have run into a small issue. For some reason, starting from October 2013 your SquidReportBrowserCountry tables do not indicate the names of the countries which makes the data impossible to use. I had hoped that I would be able to "guess" the country names by looking at previous tables but the order of the countries changes quite frequently. I would be most grateful if you could indicate the names of the countries here. Also, would you mind telling me precisely what is meant by "Apple" in the list of browsers? You have simultaneous entries for iOS, iPad and Safari etc so it is not immediately obvious what Apple might mean.Atul Vaidya

2 1

Analytics Dev Team Commitments 2014-09-04 -- 2014-09-16
by Kevin Leduc 18 Sep '14

18 Sep '14

Hi, The team is focused on reaching its quarterly goals ( https://www.mediawiki.org/wiki/Wikimedia_Engineering/2014-15_Goals#Analytics ) and part of the team is using Agile Scrum solely for the delivery of Editor Engagement Vital Signs. Production issues and Refinery development are handled by the other part of the team (see Adventures in Clusterland https://lists.wikimedia.org/pipermail/analytics/2014-September/002485.html ) Here’s a summary of the next sprint: Bug ID Component Summary Points 67459 Wikimetrics Story:b WikimetricsUser runs 'Rolling New Active Editors' report 8 67460 Wikimetrics Story:c WikimetricsUser runs 'Rolling Surviving New Active Editors' report 13 68822 EEVS Story: AnalyticsEng has static file with list of projects and metrics 8 68445 EEVS Story: EEVSUser downloads report with correct Http Cache Headers 5 68142 EEVS Story: EEVSUser adds/removes a metric/project 21 That’s 55 points in 5 stories. You can see the sprint here: http://sb.wmflabs.org/t/analytics-developers/2014-09-16/ cheers, Kevin Leduc

1 1

LinkedIn Samza Use
by Andrew Otto 18 Sep '14

18 Sep '14

http://engineering.linkedin.com/samza/real-time-insights-linkedins-performa…

4 3

Adventures in Clusterland 2014-09-08--2014-09-14
by Christian Aistleitner 18 Sep '14

18 Sep '14

Hi, in the week from 2014-09-08–2014-09-14, Andrew and Jeff worked on the following items around the Analytics Cluster and Analytics related Ops: * Logstash logs from Analytics Cluster * More investigation around analytics1021 partition leader drop-outs * Feasibility check on upgrading stat1002 to trusty (details below) Have fun, Christian * Logstash logs from Analytics Cluster Logging via gelf got enabled again and is now puppetized. Also names of threads in log messages now get normalized, which makes it way easier to filter. * More investigation around analytics1021 partition leader drop-outs Logs from recent analytics1021 drop-outs have been analyzed, but no clear culprit has been identified yet. * Feasibility check on upgrading stat1002 to trusty After the stat1003 upgrade to trusty a few weeks back, users asked to upgrade stat1002 to trusty too. However, stat1002 runs Hadoop clients, and Cloudera does not provide Hadoop packages for trusty yet, so upgrading is not too straight forward. Currently, the best way forward seems to be a dist-upgrade, but leaving Hadoop client packages at precise. This approach worked on a labs test instance, but that would put stat1002 in version limbo between precise and trusty. Once another pair of Ops-eyes looked over the approach and agreed to it, stat1002 can get upgraded. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 0

Re: [Analytics] [Wiki-research-l] [Wikimedia-l] wikipedia access traces ?
by Aaron Halfaker 17 Sep '14

17 Sep '14

I don't think that we keep those logs historically. analytics-l (CC'd) might have more insights. Do we have anything more granular than the hourly view logs available here: https://dumps.wikimedia.org/other/pagecounts-raw/ On Wed, Sep 17, 2014 at 10:39 AM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote: > Hello Aaron, > 1 hour is way too coarse. > Let's say 1 second would be ok. > Is that available ? > > On Wed, Sep 17, 2014 at 5:23 PM, Aaron Halfaker <aaron.halfaker(a)gmail.com> > wrote: > >> Hi Valerio, >> >> The page counts dataset has a time resolution of one hour. Is that too >> coarse? How fine of resolution do you need? >> >> On Wed, Sep 17, 2014 at 9:44 AM, Valerio Schiavoni < >> valerio.schiavoni(a)gmail.com> wrote: >> >>> Hello Giovanni, >>> on second thought, I think the Click dataset won't do either. >>> I've parsed the smaller sample [1], which is said to be extracted from >>> the bigger one. >>> >>> In that dataset there are ~34k entries related to Wikipedia, but they >>> look like the following: >>> >>> {"count": 1, "timestamp": 1257181201, "from": "en.wikipedia.org", "to": >>> "ko.wikipedia.org"} >>> >>> That is, the log only reports the host/domain accessed, but not the >>> specific URL being requested (to be clear, the one in the HTTP request >>> issued by the client). >>> >>> This is what is of main interest to me. >>> >>> Thanks for your interest anyway! >>> Valerio >>> >>> >>> 1 - http://carl.cs.indiana.edu/data/#traffic-websci14 >>> >>> On Wed, Sep 17, 2014 at 4:24 PM, Valerio Schiavoni < >>> valerio.schiavoni(a)gmail.com> wrote: >>> >>>> Hello Giovanni, >>>> thanks for the pointer to the Click datasets. >>>> I'd have to take a look at the complete dataset, to see how much of >>>> those requests are touching wikipedia. >>>> >>>> Then, one of the requirements to access those datas is: >>>> "The Click Dataset is large (~2.5 TB compressed), which requires that >>>> it be transferred on a physical hard drive. You will have to provide the >>>> drive as well as pre-paid return shipment. " >>>> >>>> I have to check if this is possible and how long this might take to >>>> ship and send back an hard-drive from Switzerland. >>>> I'll let you know !! >>>> >>>> Best, >>>> Valerio >>>> >>>> On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia < >>>> gciampag(a)indiana.edu> wrote: >>>> >>>>> Valerio, >>>>> >>>>> I didn't know such data existed. As an alternative, perhaps you could >>>>> have a look at our click datasets, which contain requests to the Web at >>>>> large (i.e., not just Wikipedia) generated from within the campus of >>>>> Indiana University over a period of several months. HTH >>>>> >>>>> http://carl.cs.indiana.edu/data/#click >>>>> >>>>> Cheers >>>>> >>>>> G >>>>> >>>>> Giovanni Luca Ciampaglia >>>>> >>>>> ✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA >>>>> ☞ http://www.glciampaglia.com/ >>>>> ✆ +1 812 855-7261 >>>>> ✉ gciampag(a)indiana.edu >>>>> >>>>> 2014-09-17 9:53 GMT-04:00 Valerio Schiavoni < >>>>> valerio.schiavoni(a)gmail.com>: >>>>> >>>>>> Hello, >>>>>> just bumping my email from last week, since so far I did not get any >>>>>> answer. >>>>>> >>>>>> Should I consider that dataset to be somehow lost ? >>>>>> >>>>>> I've also contacted the researchers who partially released it, but >>>>>> making it publicly available is tricky for them, due to its size (12 TB), >>>>>> which might instead be somehow in the norms of the operations taken daily >>>>>> by Wikipedia servers. >>>>>> >>>>>> Thanks again, >>>>>> Valerio >>>>>> >>>>>>> >>>>>>> On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < >>>>>>> valerio.schiavoni(a)gmail.com> wrote: >>>>>>> >>>>>>>> Dear WikiMedia foundation, >>>>>>>> in the context of a EU research project [1], we are interested in >>>>>>>> accessing >>>>>>>> wikipedia access traces. >>>>>>>> In the past, such traces were given for research purposes to other >>>>>>>> groups >>>>>>>> [2]. >>>>>>>> Unfortunately, only a small percentage (10%) of that trace has been >>>>>>>> made >>>>>>>> made available (10%). >>>>>>>> We are interested in accessing the totality of that same trace (or >>>>>>>> even >>>>>>>> better, a more recent one, but the same one will do). >>>>>>>> >>>>>>>> If this is not the correct ML to use for such requests, could >>>>>>>> please anyone >>>>>>>> redirect me to correct one ? >>>>>>>> >>>>>>>> Thanks again for your attention, >>>>>>>> >>>>>>>> Valerio Schiavoni >>>>>>>> Post-Doc Researcher >>>>>>>> University of Neuchatel, Switzerland >>>>>>>> >>>>>>>> 1 - http://www.leads-project.eu >>>>>>>> 2 - http://www.wikibench.eu/?page_id=60 >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Wiki-research-l mailing list >>>>>> Wiki-research-l(a)lists.wikimedia.org >>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Wiki-research-l mailing list >>>>> Wiki-research-l(a)lists.wikimedia.org >>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >

2 1

Adventures in Clusterland 2014-09-01--2014-09-07
by Christian Aistleitner 16 Sep '14

16 Sep '14

Hi, in the week from 2014-09-01–2014-09-07 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops: * Investigating ways to allow queries across MediaWiki and Hadoop databases * Deployment of webstatscollector's ulsfo https fix * Re-run reports due to slave lag * X-Analytics tag for used PHP engine * Digging deeper into analytics1021 issues (details below) Have fun, Christian * Investigating ways to allow queries across MediaWiki and Hadoop databases Currently data from Hadoop is fully separated from the our wiki's databases, which it hard to query across the two different kinds of databases, and hence makes researcher's life harder. Of the available solutions to overcome this issue, Scoop seems like a suitable approach. Scoop allows to import data from MediaWiki databases into HDFS, and query them from within Hadoop. It was looked at how Scoop imports work, and discussions were started with researchers on which imports would be useful and which would not. * Deployment of webstatscollector's ulsfo https fix The fix that stops webstatscollector to count ulsfo https requests twice got deployed. * Re-run reports due to slave lag The annonced schema changes caused more slave lag than some reports could cope with, so we had to re-run a few reports by hand to make up for the slave lag. * X-Analytics tag for used PHP engine Ops added a “php” tag to the X-Analytics header. This header allows to identify which PHP implementation got used to serve requests. * Digging deeper into analytics1021 issues Despite the recent buffer increases, analytics1021 still from time to time fails to act as proper partition leader. Since the failure is not reproducible manually, debugging is tricky ... and time consuming. We added some more monitoring, and waited for the issue to re-appear. It seems that from time to time bursts of disk writes free up lots memory on analytics1021. During these write-out phases, the processes on analytics are getting starved. If starvation takes to long, analytics1021 gets (correctly) kicked out of the partition leader role. We now need to find the source of those write bursts, to see if they are the real issue, or just the symptom of a different issue. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 0

Ten Simple Rules for Better Figures
by Dario Taraborelli 12 Sep '14

12 Sep '14

A no-nonsense guide to scientific data visualization published in PLOS Computational Biology http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.100… (the contents are CC-BY licensed and the source code is here: https://github.com/rougier/ten-rules ) Dario

2 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics September 2014