Hey Analytics!
I'm working on updating the Wikitech Analytics documentation
<https://wikitech.wikimedia.org/wiki/Analytics> based on my new
understanding of the Data Lake. I've already clarified that there's no
separate thing called the "Data Warehouse" (other than some experiments
from 2015), but I still don't understand the difference between the Analytics
Cluster <https://wikitech.wikimedia.org/wiki/Analytics/Cluster> and the Data
Lake <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>.
>From what I learned yesterday, the Data Lake is everything stored in the
Hadoop cluster (including pageview, mediacounts, last-access, and edit
history data), even when it can't be usefully joined together.
But that seems to be the same thing as the Analytics Cluster ("the Hadoop
cluster and its related components"). Is it possible to pick one name
("Data Lake" or "Analytics Cluster") and stick with it? I promise you it'll
make the whole system much easier to understand for outsiders :)
--
Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF>,
product analyst
Wikimedia Foundation
Hey folks,
I just got the following data requests emailed to me and I figured that
this list is probably best equipped to answer:
· Number of editors who contribute 1 edit per month?
· Is it possible/feasible to run editor retention metrics globally
(versus just based on a single project?
· Total number of editors on all projects over the past 16 years
(not just ENWP)?
· Global distribution of editors by region (or country), 2016 (the
last I saw is from 2008 <https://en.wikipedia.org/wiki/File:Enwiki-map.png>
)?
· Average hours of spent by editors by segment (5+ edits and 100+
edits)?
As an update on this thread, the Analytics team is now closing in on making
this happen and providing parsed data like OS and browser directly in the
EventLogging capsule. \o/
It is also planned to remove the raw user agent entirely from the data. So
anyone who has been depending on this for their analysis should make sure
that they can work with the parsed data, and rewrite their queries (cf.
discussion at https://phabricator.wikimedia.org/T153207 ; the Performance
team has already gone through this work:
https://phabricator.wikimedia.org/T156760 ). If I'm reading
https://phabricator.wikimedia.org/T160454 correctly, it will also involve
renaming of existing EL tables.
On Wed, Dec 14, 2016 at 8:25 AM, Tilman Bayer <tbayer(a)wikimedia.org> wrote:
> Thanks to all who responded in this thread!
>
> I have now filed a Phabricator task for augmenting the EventLogging
> capsule with this kind of pre-parsed data alongside the existing raw user
> agent field: https://phabricator.wikimedia.org/T153207
> (Also, as Nuria recalled on Phabricator at the time, this would in
> addition help to address the open issue of filtering out spiders in EL:
> https://phabricator.wikimedia.org/T121550 )
>
> Back in September I ended up continuing to use an rather simplistic regex
> in MySQL for the task at hand (restricting to Firefox UAs to mitigate
> T146840), but the experience only confirmed that it would be much better to
> have browser family etc. detected by the ua-parser library.
>
> On Thu, Sep 15, 2016 at 10:37 PM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
>
>>
>> I think we can also probably consider doing the parsing in EL/MySQL so
>> the user agent is never raw on tables but rather always parsed. We could
>> use the python ua parser library and results should be identical to the
>> ones we have on Hive.
>>
>> Thanks,
>>
>> Nuria
>>
>>
>> On Thu, Sep 15, 2016 at 1:06 PM, Andrew Otto <otto(a)wikimedia.org> wrote:
>>
>>> I’ve added an example to https://wikitech.wikimedia.
>>> org/wiki/Analytics/EventLogging#Hive on how to use the UAParserUDF and
>>> the Hive get_json_object function to work with a user_agent_map.
>>>
>>> Unfortunately we can’t manage tables in Hive for every EventLogging
>>> schema/revision like we do in MySQL. So, you have to create your own
>>> table. It *should* be possible to specify the schema and use
>>> the org.apache.hive.hcatalog.data.JsonSerDe, but I haven’t tried this.
>>>
>>> Hope that helps!
>>>
>>> On Thu, Sep 15, 2016 at 3:19 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org
>>> > wrote:
>>>
>>>> Just a heads up:
>>>>
>>>> user_agent field is a PII field (privacy sensitive), and as such it is
>>>> purged after 90 days. If there would be a user_agent_map field, it should
>>>> be purged after 90 days as well.
>>>>
>>>> Another more permanent option might be to detect the browser family on
>>>> the JavaScript client with i.e. duck-typing[1] and send it as part of the
>>>> explicit schema. The browser family by itself is not identifying enough to
>>>> be considered PII, and could be kept indefinitely.
>>>>
>>>> [1] http://stackoverflow.com/questions/9847580/how-to-detect
>>>> -safari-chrome-ie-firefox-and-opera-browser
>>>>
>>>> On Thu, Sep 15, 2016 at 5:40 PM, Jane Darnell <jane023(a)gmail.com>
>>>> wrote:
>>>>
>>>>> It's not just a question of which value to choose, but also how to
>>>>> sort. It would be nice to be able to choose sorting in alphabetical order
>>>>> vs numerical order. It would also be nice to assign a default sort to any
>>>>> item label that is taken from the Wikipedia {{DEFAULTSORT}} template
>>>>> (though that won't work for items without a Wikipedia article).
>>>>>
>>>>> On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu <
>>>>> dandreescu(a)wikimedia.org> wrote:
>>>>>
>>>>>> The problem with working on EL data in hive is that the schemas for
>>>>>> the tables can change at any point, in backwards-incompatible ways. And
>>>>>> maintaining tables dynamically is harder here than in mysql world (where EL
>>>>>> just tries to insert, and creates the table on failure). So, while it's
>>>>>> relatively easy to use ua-parser (see below), you can't easily access EL
>>>>>> data in hive tables. However, we do have all EL data in hadoop, so you can
>>>>>> access it with Spark. Andrew's about to answer with more details on that.
>>>>>> I just thought this might be useful if you sqoop EL data from mysql or
>>>>>> otherwise import it into a Hive table:
>>>>>>
>>>>>>
>>>>>> from stat1002, start hive, then:
>>>>>>
>>>>>> ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/a
>>>>>> nalytics/refinery/refinery-hive-0.0.35.jar;
>>>>>>
>>>>>> CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refin
>>>>>> ery.hive.UAParserUDF';
>>>>>>
>>>>>> select ua_parser('Wikimedia Bot');
>>>>>>
>>>>>>
> The question was actually about doing UA parsing in MySQL directly, but I
> did appreciate the additional information about Hadoop-based options (even
> though they don't cover many use cases that will be addressed by storing
> parsed data directly in EL tables).
> To go off on a tangent for a bit: Accessing EL data in Hadoop is an
> interesting topic, but as far as I know it has not been done widely before.
> I have been interested in it for a while (maInly because of the performance
> issues with some large EL tables in MariaDB) and in January Marcel walked
> me through the steps at https://wikitech.wikimedia.org
> /wiki/Analytics/EventLogging#Hadoop and we successfully imported one
> partition (one hour's worth of data). But we got stuck at the question of
> how to merge separately imported partitions, and I still don't see that
> addressed in the documentation. I understand that's a separate problem from
> the schema versioning issues discussed in this thread. Also, while the
> Spark option sounds cool, it would involve learning an entirely new tool
> and workflow for myself and other analysts (IIRC Oliver also made that
> point when we discussed this on IRC earlier this year with the team).
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
Hello Analytics,
WMDE will onboard their data analyst soon. Adam and me (with help of
various others) compiled an onboarding document to collect all™ the needed
information to get started.
https://docs.google.com/document/d/1lSP4aamtkv1XI5euC1NAGWeci9M39ZiXkkmB9vC…
It maybe useful for us to copy some over to Wiki to pages below
https://wikitech.wikimedia.org/wiki/Analytics
We could create a "getting started" page from (or is there already one
which I oversaw?)
Jan
--
Jan Dittrich
UX Design/ User Research
Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
http://wikimedia.de
Imagine a world, in which every single human being can freely share in the
sum of all knowledge. That‘s our commitment.
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
Please do take a look at desktop-only data available now:
https://analytics.wikimedia.org/dashboards/browsers/#desktop-site-by-os
Fedora is there but in a year and a half there is only a handful of days
where usage goes over 0.05%
Thanks,
Nuria
On Fri, Mar 17, 2017 at 6:42 AM, Christian Schaller <cschalle(a)redhat.com>
wrote:
>
>
>
>
> ----- Original Message -----
> > From: "Nuria Ruiz" <nuria(a)wikimedia.org>
> > To: "Christian Schaller" <cschalle(a)redhat.com>
> > Cc: "A mailing list for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and analytics."
> > <analytics(a)lists.wikimedia.org>, "Tomas Popela" <tpopela(a)redhat.com>
> > Sent: Thursday, March 16, 2017 4:11:54 PM
> > Subject: Re: [Analytics] Os stats
> >
> > >Hmm, does not make sense to me that the traffic caused by our users
> would
> > be that small,
> > Overall? I disagree, I think it does, think that wikipedia (our main
> source
> > of traffic for all wikimedia projects) is fastly moving to mobile, thus
> > mobile OS are the bulk of the requests, desktop are the minority and, in
> > that minority, Linux is the minority.
>
> Sorry I did not mean to imply that gathering the mobile statistics isn't
> useful
> for Wikipedia, I was just saying that grouping them with desktop data
> drownes out
> desktop data for smaller outfits like ourselves, making the data less
> useful for
> us. That said I do appreciate that you share this data as a public service
> and have
> no obligation to do so, so to be 100% clear; regardless of immediate
> usefulness to
> me I am grateful for the effort you guys are doing. So thank you :)
>
>
> > Just looked at December 2016 overall pageviews for desktop and mobile
> > coming from "users" (not self-identified-bots) and for that month about
> 20%
> > of pageviews are on iOS, 25% are on Android and Fedora is 0.027%. This
> data
> > is counting all projects for the whole world at large, probably Fedora
> > represents a larger chuck of traffic in US-desktop only traffic.
> >
> > I think we are going to be adding a bit more info to our browser reports
> > with desktop-only data but still, Fedora traffic is probably not going to
> > display.
> >
> > >Anyway, I will install the analytics stuff myself on a local machine and
> > do some testing, to see if I
> > >can see a reason for things to fail register properly.
> >
> > If you end up committing any fix to ua-parser please let us know
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Mar 16, 2017 at 9:28 AM, Christian Schaller <cschalle(a)redhat.com
> >
> > wrote:
> >
> > > Hmm, does not make sense to me that the traffic caused by our users
> would
> > > be that small,
> > > and there is no version string for Fedora in the user agent, it is
> just:
> > > Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like
> > > Gecko) Chrome/56.0.2924.87 Safari/537.36
> > >
> > > Anyway, I will install the analytics stuff myself on a local machine
> and
> > > do some testing, to see if I
> > > can see a reason for things to fail register properly. Thanks for the
> > > quick and helpful answers so far.
> > >
> > > Christian
> > >
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Nuria Ruiz" <nuria(a)wikimedia.org>
> > > > To: "A mailing list for the Analytics Team at WMF and everybody who
> has
> > > an interest in Wikipedia and analytics."
> > > > <analytics(a)lists.wikimedia.org>
> > > > Cc: "Christian Schaller" <cschalle(a)redhat.com>, "Tomas Popela" <
> > > tpopela(a)redhat.com>
> > > > Sent: Thursday, March 16, 2017 12:12:28 PM
> > > > Subject: Re: [Analytics] Os stats
> > > >
> > > > Small correction, threshold of browser reporting is 0.05%:
> > > > https://github.com/wikimedia/analytics-refinery/blob/
> > > master/oozie/browser/general/coordinator.properties#L62
> > > > Even for our traffic below that number reporting is really not that
> > > > meaningful. Now because the way that grouping happens if 'Fedora 23'
> and
> > > > 'Fedora 24' (imaginary versions) have 0.025% traffic neither will get
> > > > reported. This is something we would like to improve and we have a
> ticket
> > > > for it here: https://phabricator.wikimedia.org/T131127 (feel free to
> > > chime
> > > > in)
> > > >
> > > > Now, even with big traffic like ours there is a threshold below which
> > > > reporting data is not meaningful as numbers in some instances
> oscillate a
> > > > lot and that means that there is more noise than signal, we will try
> to
> > > get
> > > > an specific "desktop" tab (so only requests to desktop site are
> counted)
> > > > but even then, Fedora traffic might be too small to display.
> > > >
> > > > On Thu, Mar 16, 2017 at 6:09 AM, Dan Andreescu <
> dandreescu(a)wikimedia.org
> > > >
> > > > wrote:
> > > >
> > > > > The threshold is actually at 0.1%, though you are right that this
> is
> > > > > fairly arbitrary. We have sanitizing data on our goals next
> quarter,
> > > and
> > > > > that's when we'll take a more mathematical approach at the problem.
> > > > >
> > > > > Original Message
> > > > > From: Christian Schaller
> > > > > Sent: Thursday, March 16, 2017 08:44
> > > > > To: Dan Andreescu
> > > > > Cc: A mailing list for the Analytics Team at WMF and everybody who
> has
> > > an
> > > > > interest in Wikipedia and analytics.; Tomas Popela
> > > > > Subject: Re: [Analytics] Os stats
> > > > >
> > > > > Been thinking a bit about this and while I do appreciate the
> privacy
> > > > > concerns I would assume that
> > > > > even if you set the threshold to 0.5% the amount of traffic on
> > > Wikipedia
> > > > > would still be great enough
> > > > > for that to not be a real privacy risk? It is just that wikimedia
> is
> > > one
> > > > > of the few open sources with
> > > > > a huge traffic base for this kind of information and we would love
> to
> > > use
> > > > > it as a neutral way to track
> > > > > our own userbase growth in comparison with the wider market. So we
> know
> > > > > from our internal statistics that we
> > > > > more than doubled our userbase over the last year, but having a
> > > resource
> > > > > like wikimedia would allow us to see
> > > > > how those numbers play out in the bigger picture. So any chance of
> > > > > convincing you to lower the threshold
> > > > > to 0.5% to hopefully allow us to start using the statistics already
> > > now?
> > > > >
> > > > > Sincerely,
> > > > > Christian F.K. Schaller
> > > > > Manager for Fedora & Red Hat Desktop efforts
> > > > >
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Dan Andreescu" <dandreescu(a)wikimedia.org>
> > > > > > To: "A mailing list for the Analytics Team at WMF and everybody
> who
> > > has
> > > > > an interest in Wikipedia and analytics."
> > > > > > <analytics(a)lists.wikimedia.org>
> > > > > > Cc: "Christian Schaller" <cschalle(a)redhat.com>, "Tomas Popela" <
> > > > > tpopela(a)redhat.com>
> > > > > > Sent: Tuesday, March 14, 2017 2:10:38 PM
> > > > > > Subject: Re: [Analytics] Os stats
> > > > > >
> > > > > > Christian,
> > > > > >
> > > > > > I wanted to make sure our code is working well so I took a look.
> We
> > > use
> > > > > UA
> > > > > > Parser, a regex-based community-maintained user agent
> identifier. It
> > > > > > correctly identified Fedora as the OS in all of the strings I
> found
> > > like
> > > > > > '%Fedora%' for the hour of raw webrequests I looked at. However,
> > > there
> > > > > > were less than 0.1% requests that were identified as Fedora. We
> cut
> > > off
> > > > > > reporting statistics when numbers get that low for privacy
> reasons.
> > > But
> > > > > > everything is detected correctly, so if Fedora's share of
> requests
> > > > > > increases, it will show up on the charts.
> > > > > >
> > > > > > Hope this helps.
> > > > > >
> > > > > > On Tue, Mar 14, 2017 at 1:51 PM, Erik Zachte <
> ezachte(a)wikimedia.org>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Christian,
> > > > > > >
> > > > > > > I'm forwarding your question to the WMF Analytics Team who
> authored
> > > > > this
> > > > > > > report.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Erik
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Christian Schaller [mailto:cschalle@redhat.com]
> > > > > > > Sent: Monday, March 13, 2017 16:07
> > > > > > > To: Erik Zachte
> > > > > > > Cc: Tomas Popela
> > > > > > > Subject: Re: Os stats
> > > > > > >
> > > > > > > Hi Erik,
> > > > > > > Thanks for getting the new OS stats up on:
> > > > > > > https://analytics.wikimedia.org/dashboards/browsers/#all-
> > > > > > > sites-by-os/os-family-timeseries
> > > > > > >
> > > > > > > That said as far as we can tell the detection of Fedora does
> not
> > > work
> > > > > at
> > > > > > > all currently and we can not figure out why. Ubuntu which is
> > > detected
> > > > > uses
> > > > > > > the following user agent:
> > > > > > > Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101
> > > > > > > Firefox/52.0
> > > > > > >
> > > > > > > While Fedora which isn't detected uses this user agent:
> > > > > > > Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:52.0) Gecko/20100101
> > > > > > > Firefox/52.0
> > > > > > >
> > > > > > > Would you be so kind to let us know what the wikimedia
> analytics
> > > engine
> > > > > > > uses to try to identify Fedora systems? We can tweak our user
> > > agents
> > > > > quite
> > > > > > > easily if that is easier than updating the analytics engines
> way of
> > > > > > > detecting Fedora.
> > > > > > >
> > > > > > > Sincerely,
> > > > > > > Christian F.K. Schaller
> > > > > > >
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > > > From: "Erik Zachte" <ezachte(a)wikimedia.org>
> > > > > > > > To: "Christian Schaller" <cschalle(a)redhat.com>
> > > > > > > > Sent: Tuesday, October 6, 2015 11:28:55 AM
> > > > > > > > Subject: RE: Os stats
> > > > > > > >
> > > > > > > > Hi Christian,
> > > > > > > >
> > > > > > > > Sorry since my previous response we put the reports on hold,
> as
> > > there
> > > > > > > > are issues with reliability now that we migrated https almost
> > > fully.
> > > > > > > >
> > > > > > > > Can you please add your signature to
> > > > > > > > https://www.mediawiki.org/wiki/Analytics/Wikistats/
> > > > > TrafficReports/Futu
> > > > > > > > re_per_report_B2 I can do it for you, but I don't know: can
> I add
> > > > > your
> > > > > > > > full name or do you have a Wikipedia nick name that you
> prefer to
> > > > > use?
> > > > > > > >
> > > > > > > > We are working on migration of the reports. More here:
> > > > > > > > https://phabricator.wikimedia.org/T114379
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Erik
> > > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Christian Schaller [mailto:cschalle@redhat.com]
> > > > > > > > Sent: Tuesday, October 06, 2015 16:16
> > > > > > > > To: Erik Zachte
> > > > > > > > Subject: Re: Os stats
> > > > > > > >
> > > > > > > > Hi Erik,
> > > > > > > > Just checking what the current plans are for the OS
> statistics
> > > on the
> > > > > > > > wikimedia site. As I mentioned in my first email to you, we
> would
> > > > > love
> > > > > > > > to use these numbers as a way to estimate how we are doing
> with
> > > > > Fedora
> > > > > > > > Linux as they are one of the few sources for such statistics
> > > where we
> > > > > > > > can be fairly sure the data is not biased one way or the
> other
> > > (due
> > > > > to
> > > > > > > > the huge number of people using wikipedia). Of course with
> the
> > > old
> > > > > > > > stats being discontinued I am know waiting for the new data
> to be
> > > > > made
> > > > > > > > available to start building my usage trend statistics :)
> > > > > > > >
> > > > > > > > So on the page it says to let us know if we want a specific
> > > report
> > > > > > > > kept, so I would like to repeat my wish that there is a
> version
> > > of
> > > > > > > > report '2' kept available.
> > > > > > > >
> > > > > > > > Anyway, I realize that maintaining these website statistics
> is a
> > > bit
> > > > > > > > of a sideshow for you guys and not a core part of what your
> > > doing, so
> > > > > > > > I just want to say that I do truly appreciate the effort to
> try
> > > to
> > > > > > > > have something at all available.
> > > > > > > >
> > > > > > > > Sincerely,
> > > > > > > > Christian Schaller
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > > From: "Erik Zachte" <ezachte(a)wikimedia.org>
> > > > > > > > > To: "Christian Schaller" <cschalle(a)redhat.com>
> > > > > > > > > Sent: Monday, June 22, 2015 10:41:40 AM
> > > > > > > > > Subject: RE: Os stats
> > > > > > > > >
> > > > > > > > > Hi Christian,
> > > > > > > > >
> > > > > > > > > I started a job to catch-up for the last 3 months, will
> take
> > > 4-5
> > > > > days.
> > > > > > > > >
> > > > > > > > > FYI these reports are almost end-of-life. Expect a complete
> > > > > overhaul
> > > > > > > > > of Wikimedia traffic and core metrics reporting based on
> bigger
> > > > > iron
> > > > > > > > > and new paradigms (e.g. hadoop) in 2015 Q3/A4.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Erik
> > > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Christian Schaller [mailto:cschalle@redhat.com]
> > > > > > > > > Sent: Tuesday, June 16, 2015 16:46
> > > > > > > > > To: ezachte(a)wikimedia.org
> > > > > > > > > Subject: Os stats
> > > > > > > > >
> > > > > > > > > Hi Erik,
> > > > > > > > > Been checking out the stats on
> > > > > > > > > https://stats.wikimedia.org/wikimedia/squids/
> > > > > > > SquidReportOperatingSystems.htm.
> > > > > > > > > Are you planning on updating that page again soon?
> > > > > > > > > We are using your numbers as one of the datapoints for
> > > estimating
> > > > > > > > > how Fedora Linux is doing, so I hope you plan on pulling
> new
> > > > > numbers
> > > > > > > > > from time to time.
> > > > > > > > >
> > > > > > > > > Christian
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Analytics mailing list
> > > > > > > Analytics(a)lists.wikimedia.org
> > > > > > > https://lists.wikimedia.org/mailman/listinfo/analytics
> > > > > > >
> > > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Analytics mailing list
> > > > > Analytics(a)lists.wikimedia.org
> > > > > https://lists.wikimedia.org/mailman/listinfo/analytics
> > > > >
> > > >
> > >
> >
>
>Hmm, does not make sense to me that the traffic caused by our users would
be that small,
Overall? I disagree, I think it does, think that wikipedia (our main source
of traffic for all wikimedia projects) is fastly moving to mobile, thus
mobile OS are the bulk of the requests, desktop are the minority and, in
that minority, Linux is the minority.
Just looked at December 2016 overall pageviews for desktop and mobile
coming from "users" (not self-identified-bots) and for that month about 20%
of pageviews are on iOS, 25% are on Android and Fedora is 0.027%. This data
is counting all projects for the whole world at large, probably Fedora
represents a larger chuck of traffic in US-desktop only traffic.
I think we are going to be adding a bit more info to our browser reports
with desktop-only data but still, Fedora traffic is probably not going to
display.
>Anyway, I will install the analytics stuff myself on a local machine and
do some testing, to see if I
>can see a reason for things to fail register properly.
If you end up committing any fix to ua-parser please let us know
On Thu, Mar 16, 2017 at 9:28 AM, Christian Schaller <cschalle(a)redhat.com>
wrote:
> Hmm, does not make sense to me that the traffic caused by our users would
> be that small,
> and there is no version string for Fedora in the user agent, it is just:
> Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like
> Gecko) Chrome/56.0.2924.87 Safari/537.36
>
> Anyway, I will install the analytics stuff myself on a local machine and
> do some testing, to see if I
> can see a reason for things to fail register properly. Thanks for the
> quick and helpful answers so far.
>
> Christian
>
>
>
> ----- Original Message -----
> > From: "Nuria Ruiz" <nuria(a)wikimedia.org>
> > To: "A mailing list for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and analytics."
> > <analytics(a)lists.wikimedia.org>
> > Cc: "Christian Schaller" <cschalle(a)redhat.com>, "Tomas Popela" <
> tpopela(a)redhat.com>
> > Sent: Thursday, March 16, 2017 12:12:28 PM
> > Subject: Re: [Analytics] Os stats
> >
> > Small correction, threshold of browser reporting is 0.05%:
> > https://github.com/wikimedia/analytics-refinery/blob/
> master/oozie/browser/general/coordinator.properties#L62
> > Even for our traffic below that number reporting is really not that
> > meaningful. Now because the way that grouping happens if 'Fedora 23' and
> > 'Fedora 24' (imaginary versions) have 0.025% traffic neither will get
> > reported. This is something we would like to improve and we have a ticket
> > for it here: https://phabricator.wikimedia.org/T131127 (feel free to
> chime
> > in)
> >
> > Now, even with big traffic like ours there is a threshold below which
> > reporting data is not meaningful as numbers in some instances oscillate a
> > lot and that means that there is more noise than signal, we will try to
> get
> > an specific "desktop" tab (so only requests to desktop site are counted)
> > but even then, Fedora traffic might be too small to display.
> >
> > On Thu, Mar 16, 2017 at 6:09 AM, Dan Andreescu <dandreescu(a)wikimedia.org
> >
> > wrote:
> >
> > > The threshold is actually at 0.1%, though you are right that this is
> > > fairly arbitrary. We have sanitizing data on our goals next quarter,
> and
> > > that's when we'll take a more mathematical approach at the problem.
> > >
> > > Original Message
> > > From: Christian Schaller
> > > Sent: Thursday, March 16, 2017 08:44
> > > To: Dan Andreescu
> > > Cc: A mailing list for the Analytics Team at WMF and everybody who has
> an
> > > interest in Wikipedia and analytics.; Tomas Popela
> > > Subject: Re: [Analytics] Os stats
> > >
> > > Been thinking a bit about this and while I do appreciate the privacy
> > > concerns I would assume that
> > > even if you set the threshold to 0.5% the amount of traffic on
> Wikipedia
> > > would still be great enough
> > > for that to not be a real privacy risk? It is just that wikimedia is
> one
> > > of the few open sources with
> > > a huge traffic base for this kind of information and we would love to
> use
> > > it as a neutral way to track
> > > our own userbase growth in comparison with the wider market. So we know
> > > from our internal statistics that we
> > > more than doubled our userbase over the last year, but having a
> resource
> > > like wikimedia would allow us to see
> > > how those numbers play out in the bigger picture. So any chance of
> > > convincing you to lower the threshold
> > > to 0.5% to hopefully allow us to start using the statistics already
> now?
> > >
> > > Sincerely,
> > > Christian F.K. Schaller
> > > Manager for Fedora & Red Hat Desktop efforts
> > >
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Dan Andreescu" <dandreescu(a)wikimedia.org>
> > > > To: "A mailing list for the Analytics Team at WMF and everybody who
> has
> > > an interest in Wikipedia and analytics."
> > > > <analytics(a)lists.wikimedia.org>
> > > > Cc: "Christian Schaller" <cschalle(a)redhat.com>, "Tomas Popela" <
> > > tpopela(a)redhat.com>
> > > > Sent: Tuesday, March 14, 2017 2:10:38 PM
> > > > Subject: Re: [Analytics] Os stats
> > > >
> > > > Christian,
> > > >
> > > > I wanted to make sure our code is working well so I took a look. We
> use
> > > UA
> > > > Parser, a regex-based community-maintained user agent identifier. It
> > > > correctly identified Fedora as the OS in all of the strings I found
> like
> > > > '%Fedora%' for the hour of raw webrequests I looked at. However,
> there
> > > > were less than 0.1% requests that were identified as Fedora. We cut
> off
> > > > reporting statistics when numbers get that low for privacy reasons.
> But
> > > > everything is detected correctly, so if Fedora's share of requests
> > > > increases, it will show up on the charts.
> > > >
> > > > Hope this helps.
> > > >
> > > > On Tue, Mar 14, 2017 at 1:51 PM, Erik Zachte <ezachte(a)wikimedia.org>
> > > wrote:
> > > >
> > > > > Hi Christian,
> > > > >
> > > > > I'm forwarding your question to the WMF Analytics Team who authored
> > > this
> > > > > report.
> > > > >
> > > > > Cheers,
> > > > > Erik
> > > > >
> > > > > -----Original Message-----
> > > > > From: Christian Schaller [mailto:cschalle@redhat.com]
> > > > > Sent: Monday, March 13, 2017 16:07
> > > > > To: Erik Zachte
> > > > > Cc: Tomas Popela
> > > > > Subject: Re: Os stats
> > > > >
> > > > > Hi Erik,
> > > > > Thanks for getting the new OS stats up on:
> > > > > https://analytics.wikimedia.org/dashboards/browsers/#all-
> > > > > sites-by-os/os-family-timeseries
> > > > >
> > > > > That said as far as we can tell the detection of Fedora does not
> work
> > > at
> > > > > all currently and we can not figure out why. Ubuntu which is
> detected
> > > uses
> > > > > the following user agent:
> > > > > Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101
> > > > > Firefox/52.0
> > > > >
> > > > > While Fedora which isn't detected uses this user agent:
> > > > > Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:52.0) Gecko/20100101
> > > > > Firefox/52.0
> > > > >
> > > > > Would you be so kind to let us know what the wikimedia analytics
> engine
> > > > > uses to try to identify Fedora systems? We can tweak our user
> agents
> > > quite
> > > > > easily if that is easier than updating the analytics engines way of
> > > > > detecting Fedora.
> > > > >
> > > > > Sincerely,
> > > > > Christian F.K. Schaller
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Erik Zachte" <ezachte(a)wikimedia.org>
> > > > > > To: "Christian Schaller" <cschalle(a)redhat.com>
> > > > > > Sent: Tuesday, October 6, 2015 11:28:55 AM
> > > > > > Subject: RE: Os stats
> > > > > >
> > > > > > Hi Christian,
> > > > > >
> > > > > > Sorry since my previous response we put the reports on hold, as
> there
> > > > > > are issues with reliability now that we migrated https almost
> fully.
> > > > > >
> > > > > > Can you please add your signature to
> > > > > > https://www.mediawiki.org/wiki/Analytics/Wikistats/
> > > TrafficReports/Futu
> > > > > > re_per_report_B2 I can do it for you, but I don't know: can I add
> > > your
> > > > > > full name or do you have a Wikipedia nick name that you prefer to
> > > use?
> > > > > >
> > > > > > We are working on migration of the reports. More here:
> > > > > > https://phabricator.wikimedia.org/T114379
> > > > > >
> > > > > > Cheers,
> > > > > > Erik
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Christian Schaller [mailto:cschalle@redhat.com]
> > > > > > Sent: Tuesday, October 06, 2015 16:16
> > > > > > To: Erik Zachte
> > > > > > Subject: Re: Os stats
> > > > > >
> > > > > > Hi Erik,
> > > > > > Just checking what the current plans are for the OS statistics
> on the
> > > > > > wikimedia site. As I mentioned in my first email to you, we would
> > > love
> > > > > > to use these numbers as a way to estimate how we are doing with
> > > Fedora
> > > > > > Linux as they are one of the few sources for such statistics
> where we
> > > > > > can be fairly sure the data is not biased one way or the other
> (due
> > > to
> > > > > > the huge number of people using wikipedia). Of course with the
> old
> > > > > > stats being discontinued I am know waiting for the new data to be
> > > made
> > > > > > available to start building my usage trend statistics :)
> > > > > >
> > > > > > So on the page it says to let us know if we want a specific
> report
> > > > > > kept, so I would like to repeat my wish that there is a version
> of
> > > > > > report '2' kept available.
> > > > > >
> > > > > > Anyway, I realize that maintaining these website statistics is a
> bit
> > > > > > of a sideshow for you guys and not a core part of what your
> doing, so
> > > > > > I just want to say that I do truly appreciate the effort to try
> to
> > > > > > have something at all available.
> > > > > >
> > > > > > Sincerely,
> > > > > > Christian Schaller
> > > > > >
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > > From: "Erik Zachte" <ezachte(a)wikimedia.org>
> > > > > > > To: "Christian Schaller" <cschalle(a)redhat.com>
> > > > > > > Sent: Monday, June 22, 2015 10:41:40 AM
> > > > > > > Subject: RE: Os stats
> > > > > > >
> > > > > > > Hi Christian,
> > > > > > >
> > > > > > > I started a job to catch-up for the last 3 months, will take
> 4-5
> > > days.
> > > > > > >
> > > > > > > FYI these reports are almost end-of-life. Expect a complete
> > > overhaul
> > > > > > > of Wikimedia traffic and core metrics reporting based on bigger
> > > iron
> > > > > > > and new paradigms (e.g. hadoop) in 2015 Q3/A4.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Erik
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Christian Schaller [mailto:cschalle@redhat.com]
> > > > > > > Sent: Tuesday, June 16, 2015 16:46
> > > > > > > To: ezachte(a)wikimedia.org
> > > > > > > Subject: Os stats
> > > > > > >
> > > > > > > Hi Erik,
> > > > > > > Been checking out the stats on
> > > > > > > https://stats.wikimedia.org/wikimedia/squids/
> > > > > SquidReportOperatingSystems.htm.
> > > > > > > Are you planning on updating that page again soon?
> > > > > > > We are using your numbers as one of the datapoints for
> estimating
> > > > > > > how Fedora Linux is doing, so I hope you plan on pulling new
> > > numbers
> > > > > > > from time to time.
> > > > > > >
> > > > > > > Christian
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Analytics mailing list
> > > > > Analytics(a)lists.wikimedia.org
> > > > > https://lists.wikimedia.org/mailman/listinfo/analytics
> > > > >
> > > >
> > >
> > > _______________________________________________
> > > Analytics mailing list
> > > Analytics(a)lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/analytics
> > >
> >
>
The threshold is actually at 0.1%, though you are right that this is fairly arbitrary. We have sanitizing data on our goals next quarter, and that's when we'll take a more mathematical approach at the problem.
Original Message
From: Christian Schaller
Sent: Thursday, March 16, 2017 08:44
To: Dan Andreescu
Cc: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.; Tomas Popela
Subject: Re: [Analytics] Os stats
Been thinking a bit about this and while I do appreciate the privacy concerns I would assume that
even if you set the threshold to 0.5% the amount of traffic on Wikipedia would still be great enough
for that to not be a real privacy risk? It is just that wikimedia is one of the few open sources with
a huge traffic base for this kind of information and we would love to use it as a neutral way to track
our own userbase growth in comparison with the wider market. So we know from our internal statistics that we
more than doubled our userbase over the last year, but having a resource like wikimedia would allow us to see
how those numbers play out in the bigger picture. So any chance of convincing you to lower the threshold
to 0.5% to hopefully allow us to start using the statistics already now?
Sincerely,
Christian F.K. Schaller
Manager for Fedora & Red Hat Desktop efforts
----- Original Message -----
> From: "Dan Andreescu" <dandreescu(a)wikimedia.org>
> To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics."
> <analytics(a)lists.wikimedia.org>
> Cc: "Christian Schaller" <cschalle(a)redhat.com>, "Tomas Popela" <tpopela(a)redhat.com>
> Sent: Tuesday, March 14, 2017 2:10:38 PM
> Subject: Re: [Analytics] Os stats
>
> Christian,
>
> I wanted to make sure our code is working well so I took a look. We use UA
> Parser, a regex-based community-maintained user agent identifier. It
> correctly identified Fedora as the OS in all of the strings I found like
> '%Fedora%' for the hour of raw webrequests I looked at. However, there
> were less than 0.1% requests that were identified as Fedora. We cut off
> reporting statistics when numbers get that low for privacy reasons. But
> everything is detected correctly, so if Fedora's share of requests
> increases, it will show up on the charts.
>
> Hope this helps.
>
> On Tue, Mar 14, 2017 at 1:51 PM, Erik Zachte <ezachte(a)wikimedia.org> wrote:
>
> > Hi Christian,
> >
> > I'm forwarding your question to the WMF Analytics Team who authored this
> > report.
> >
> > Cheers,
> > Erik
> >
> > -----Original Message-----
> > From: Christian Schaller [mailto:cschalle@redhat.com]
> > Sent: Monday, March 13, 2017 16:07
> > To: Erik Zachte
> > Cc: Tomas Popela
> > Subject: Re: Os stats
> >
> > Hi Erik,
> > Thanks for getting the new OS stats up on:
> > https://analytics.wikimedia.org/dashboards/browsers/#all-
> > sites-by-os/os-family-timeseries
> >
> > That said as far as we can tell the detection of Fedora does not work at
> > all currently and we can not figure out why. Ubuntu which is detected uses
> > the following user agent:
> > Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101
> > Firefox/52.0
> >
> > While Fedora which isn't detected uses this user agent:
> > Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:52.0) Gecko/20100101
> > Firefox/52.0
> >
> > Would you be so kind to let us know what the wikimedia analytics engine
> > uses to try to identify Fedora systems? We can tweak our user agents quite
> > easily if that is easier than updating the analytics engines way of
> > detecting Fedora.
> >
> > Sincerely,
> > Christian F.K. Schaller
> >
> >
> > ----- Original Message -----
> > > From: "Erik Zachte" <ezachte(a)wikimedia.org>
> > > To: "Christian Schaller" <cschalle(a)redhat.com>
> > > Sent: Tuesday, October 6, 2015 11:28:55 AM
> > > Subject: RE: Os stats
> > >
> > > Hi Christian,
> > >
> > > Sorry since my previous response we put the reports on hold, as there
> > > are issues with reliability now that we migrated https almost fully.
> > >
> > > Can you please add your signature to
> > > https://www.mediawiki.org/wiki/Analytics/Wikistats/TrafficReports/Futu
> > > re_per_report_B2 I can do it for you, but I don't know: can I add your
> > > full name or do you have a Wikipedia nick name that you prefer to use?
> > >
> > > We are working on migration of the reports. More here:
> > > https://phabricator.wikimedia.org/T114379
> > >
> > > Cheers,
> > > Erik
> > >
> > > -----Original Message-----
> > > From: Christian Schaller [mailto:cschalle@redhat.com]
> > > Sent: Tuesday, October 06, 2015 16:16
> > > To: Erik Zachte
> > > Subject: Re: Os stats
> > >
> > > Hi Erik,
> > > Just checking what the current plans are for the OS statistics on the
> > > wikimedia site. As I mentioned in my first email to you, we would love
> > > to use these numbers as a way to estimate how we are doing with Fedora
> > > Linux as they are one of the few sources for such statistics where we
> > > can be fairly sure the data is not biased one way or the other (due to
> > > the huge number of people using wikipedia). Of course with the old
> > > stats being discontinued I am know waiting for the new data to be made
> > > available to start building my usage trend statistics :)
> > >
> > > So on the page it says to let us know if we want a specific report
> > > kept, so I would like to repeat my wish that there is a version of
> > > report '2' kept available.
> > >
> > > Anyway, I realize that maintaining these website statistics is a bit
> > > of a sideshow for you guys and not a core part of what your doing, so
> > > I just want to say that I do truly appreciate the effort to try to
> > > have something at all available.
> > >
> > > Sincerely,
> > > Christian Schaller
> > >
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Erik Zachte" <ezachte(a)wikimedia.org>
> > > > To: "Christian Schaller" <cschalle(a)redhat.com>
> > > > Sent: Monday, June 22, 2015 10:41:40 AM
> > > > Subject: RE: Os stats
> > > >
> > > > Hi Christian,
> > > >
> > > > I started a job to catch-up for the last 3 months, will take 4-5 days.
> > > >
> > > > FYI these reports are almost end-of-life. Expect a complete overhaul
> > > > of Wikimedia traffic and core metrics reporting based on bigger iron
> > > > and new paradigms (e.g. hadoop) in 2015 Q3/A4.
> > > >
> > > > Cheers,
> > > > Erik
> > > >
> > > > -----Original Message-----
> > > > From: Christian Schaller [mailto:cschalle@redhat.com]
> > > > Sent: Tuesday, June 16, 2015 16:46
> > > > To: ezachte(a)wikimedia.org
> > > > Subject: Os stats
> > > >
> > > > Hi Erik,
> > > > Been checking out the stats on
> > > > https://stats.wikimedia.org/wikimedia/squids/
> > SquidReportOperatingSystems.htm.
> > > > Are you planning on updating that page again soon?
> > > > We are using your numbers as one of the datapoints for estimating
> > > > how Fedora Linux is doing, so I hope you plan on pulling new numbers
> > > > from time to time.
> > > >
> > > > Christian
> > > >
> > > >
> > >
> > >
> >
> >
> > _______________________________________________
> > Analytics mailing list
> > Analytics(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
Hi Christian,
I'm forwarding your question to the WMF Analytics Team who authored this report.
Cheers,
Erik
-----Original Message-----
From: Christian Schaller [mailto:cschalle@redhat.com]
Sent: Monday, March 13, 2017 16:07
To: Erik Zachte
Cc: Tomas Popela
Subject: Re: Os stats
Hi Erik,
Thanks for getting the new OS stats up on:
https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os/os-fam…
That said as far as we can tell the detection of Fedora does not work at all currently and we can not figure out why. Ubuntu which is detected uses the following user agent:
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
While Fedora which isn't detected uses this user agent:
Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
Would you be so kind to let us know what the wikimedia analytics engine uses to try to identify Fedora systems? We can tweak our user agents quite easily if that is easier than updating the analytics engines way of detecting Fedora.
Sincerely,
Christian F.K. Schaller
----- Original Message -----
> From: "Erik Zachte" <ezachte(a)wikimedia.org>
> To: "Christian Schaller" <cschalle(a)redhat.com>
> Sent: Tuesday, October 6, 2015 11:28:55 AM
> Subject: RE: Os stats
>
> Hi Christian,
>
> Sorry since my previous response we put the reports on hold, as there
> are issues with reliability now that we migrated https almost fully.
>
> Can you please add your signature to
> https://www.mediawiki.org/wiki/Analytics/Wikistats/TrafficReports/Futu
> re_per_report_B2 I can do it for you, but I don't know: can I add your
> full name or do you have a Wikipedia nick name that you prefer to use?
>
> We are working on migration of the reports. More here:
> https://phabricator.wikimedia.org/T114379
>
> Cheers,
> Erik
>
> -----Original Message-----
> From: Christian Schaller [mailto:cschalle@redhat.com]
> Sent: Tuesday, October 06, 2015 16:16
> To: Erik Zachte
> Subject: Re: Os stats
>
> Hi Erik,
> Just checking what the current plans are for the OS statistics on the
> wikimedia site. As I mentioned in my first email to you, we would love
> to use these numbers as a way to estimate how we are doing with Fedora
> Linux as they are one of the few sources for such statistics where we
> can be fairly sure the data is not biased one way or the other (due to
> the huge number of people using wikipedia). Of course with the old
> stats being discontinued I am know waiting for the new data to be made
> available to start building my usage trend statistics :)
>
> So on the page it says to let us know if we want a specific report
> kept, so I would like to repeat my wish that there is a version of
> report '2' kept available.
>
> Anyway, I realize that maintaining these website statistics is a bit
> of a sideshow for you guys and not a core part of what your doing, so
> I just want to say that I do truly appreciate the effort to try to
> have something at all available.
>
> Sincerely,
> Christian Schaller
>
>
>
> ----- Original Message -----
> > From: "Erik Zachte" <ezachte(a)wikimedia.org>
> > To: "Christian Schaller" <cschalle(a)redhat.com>
> > Sent: Monday, June 22, 2015 10:41:40 AM
> > Subject: RE: Os stats
> >
> > Hi Christian,
> >
> > I started a job to catch-up for the last 3 months, will take 4-5 days.
> >
> > FYI these reports are almost end-of-life. Expect a complete overhaul
> > of Wikimedia traffic and core metrics reporting based on bigger iron
> > and new paradigms (e.g. hadoop) in 2015 Q3/A4.
> >
> > Cheers,
> > Erik
> >
> > -----Original Message-----
> > From: Christian Schaller [mailto:cschalle@redhat.com]
> > Sent: Tuesday, June 16, 2015 16:46
> > To: ezachte(a)wikimedia.org
> > Subject: Os stats
> >
> > Hi Erik,
> > Been checking out the stats on
> > https://stats.wikimedia.org/wikimedia/squids/SquidReportOperatingSystems.htm.
> > Are you planning on updating that page again soon?
> > We are using your numbers as one of the datapoints for estimating
> > how Fedora Linux is doing, so I hope you plan on pulling new numbers
> > from time to time.
> >
> > Christian
> >
> >
>
>