Analytics November 2014

analytics@lists.wikimedia.org

29 participants
31 discussions

Adventures in Clusterland 2014-11-03--2014-11-09
by Christian Aistleitner 18 Nov '14

18 Nov '14

Hi, in the week from 2014-11-03–2014-11-09 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops: * More research around making xmldumps available in the Analytics cluster * 'research' database user * Alter data type of time_firstbyte, and adding Range header to webrequest table * Automatic cleanup of EventLogging logs on stat1002 and stat1003 * Per wiki CSVs with daily aggregates of webstatscollector numbers (details below) Have fun, Christian * More research around making xmldumps available in the Analytics cluster In order make the xmldumps accessible easily from within the cluster, more research around WikiHadoop as InputFormat and Avro as serialization format was done. The proof of concept allowed to stream the xmldumps through WikiHadoop and write it into AVRO files. This approach would chunk the xmldumps into records containing the text (and metadata) for a revision and it's parent revision. Those records could be consumed directly from the cluster's processing platforms. * 'research' database user The needed code around making password changes of the 'research' database user got merged. So future password changes should be more frictionless. And the password was finally changed :-) * Alter data type of time_firstbyte, and adding Range header to webrequest table To be able to (at least mostly) disambiguate “seeking within a video file” from “starting to watch a video file” in the logs, we needed to add the Range header to the webrequest table. Additionally, the data type of the time_firstbyte column needed to get changed. While both migrations should just work according to Hive's documentation, testing them beforehand in labs showed that they “sometimes” screw up the table. So we prepared scripts to resurrect the table, if the migration blows up the table in the Analytics cluster. We migrated the table. The table exploded (Bug 73095). And the prepared scripts helped to rebuild it within a few minutes. Now, webrequest has the needed range header, a more granular data type for time_firstbyte, and all the partitions re-added. * Automatic cleanup of EventLogging logs on stat1002 and stat1003 File logs of EventLogging data on stat1002 and stat1003 are now automatically cleaned up after 90 days as required by the the data retention guidelines. * Per wiki CSVs with daily aggregates of webstatscollector numbers In order to ease upcoming plotting of webstatscollector data in Dashiki, we wrote code to automatically aggregate webstatscollector's hourly projectcounts files into per wiki CSVs with daily numbers. And we backfilled with data back to 2008. The code and data is still under code-review. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

2 1

Re: [Analytics] Analytics
by Joel Sahleen 18 Nov '14

18 Nov '14

(Moving this discussion to analytics@ and localization-team@ based on Nuria’s suggestion below.) Hi Leila, The output I posted in the message is the only output I am seeing. I do not see the URL-encoded section or the validation section. I think there may be something wrong with my testing setup. Niklas Laxstöm has checked what is happening with our event logging in beta and he confirmed that we are sending events and the events are valid. The issue seems to be that we are logging events to the beta event logging db while what we checked earlier was the production event logging db. Can you (or anyone who is available) check the event logging db in beta to see if the table has been created and has data? The schema name again is ContentTranslation. If you don’t find anything, let us know and we will do some more investigation. If there is data in the beta db the next step would be to follow with Dan’s instructions to get a dashboard set up on limn1. I believe that most of Dan’s instructions need to be handled by someone on the analytics team, but let me know if there is anything I can help with. Thanks again for your help! Joel Joel Sahleen, Software Engineer Language Engineering Wikimedia Foundation jsahleen(a)wikimedia.org On Nov 11, 2014, at 11:47 PM, Leila Zia <leila(a)wikimedia.org> wrote: > Hi Joel, > > When you log events, the output will be the URL-encoded JSON sent by the browser, the event record (similar to what you pasted in your email), and whether the event validates against the schema. For the sample output you pasted earlier, or another sample output, can you let us know if validation section shows Valid? > > Leila > > On Mon, Nov 10, 2014 at 3:24 PM, Nuria Ruiz <nuria(a)wikimedia.org> wrote: > Joel, > > For questions like these going forward you can contact analytics@ as you will be getting amore prompt response. Both Dan and Leila are OOTO the next couple of days. > > >There are configuration options for the dev server that need to be added. Do similar options need to be added when not using the dev server? > No, there is no need. > > You would need sample rates to determine at which sampling rate you are logging if you are not logging all events, that is. > > Thanks, > > Nuria > > On Mon, Nov 10, 2014 at 2:39 PM, Dan Andreescu <dandreescu(a)wikimedia.org> wrote: > Adding Nuria as she can probably help > > On Monday, November 10, 2014, Joel Sahleen <jsahleen(a)wikimedia.org> wrote: > Hi Leila, > > I have tested our EventLogging code and it seems to be working fine with the event logging dev server. I can see the events coming through and they are valid. Here is some sample output: > > {"wiki": "wiki", "uuid": "e9dde14cf18552269ae81a7897f45d0c", "webHost": "localhost", "timestamp": 1415651367, "clientValidated": true, "recvFrom": "1.0.0.127.in-addr.arpa", "seqId": 2, "clientIp": "80f7683f3565e3d365740a1c8d1771ba95caaaaa", "schema": "ContentTranslation", "event": {"action": "create-translated-page", "targetLanguage": "ca", "token": "Tester", "version": 1, "contentLanguage": "es"}, "revision": 7146627} > > Are there additional configuration options we need to add to get EL working aside from just requiring the main extension file. There are configuration options for the dev server that need to be added. Do similar options need to be added when not using the dev server? > > Any help on this would be much appreciated. > > Thanks, > > Joel > > On Nov 7, 2014, at 3:52 PM, Joel Sahleen <jsahleen(a)wikimedia.org> wrote: > >> No problem, Dan. Enjoy your vacation! >> >> I will read through the document at the link you sent. I still need to fix our event logging code so it may be a couple days before we are ready anyway. If I have any questions I will contact Leila or Nuria. >> >> Thanks, >> >> Joel >> >> Joel Sahleen, Software Engineer >> Language Engineering >> Wikimedia Foundation >> jsahleen(a)wikimedia.org >> >> >> >> >> On Nov 7, 2014, at 3:10 PM, Dan Andreescu <dandreescu(a)wikimedia.org> wrote: >> >>> Joel, re: visualization, >>> >>> I'm going on vacation tomorrow and will be back on November 19th. If that's not too late, I can set up a limn instance then. If it's too late, that's ok, I wrote up the steps needed. Someone with access to the limn1.eqiad.wmflabs instance can perform them: https://wikitech.wikimedia.org/wiki/Analytics/Dashboards >>> >>> If you have the data or are generating the data in some other way, then you don't need half of that setup, you just need the part that sets up the limn dashboard which is only an hour or so of work. Sorry I'm running out the door and can't take care of that for you. >>> >>> Dan >>> >>> On Fri, Nov 7, 2014 at 7:37 AM, Joel Sahleen <jsahleen(a)wikimedia.org> wrote: >>> Thank you for the information, Pau. Very helpful. As you say, this does not change our current plans or hold us up in any way. I was just wasn’t clear about the relationship between the "high priorities" and "other metrics” sections. Knowing these came from different people at different times clarifies things a lot. >>> Joel >>> >>> On Nov 7, 2014, at 3:44 AM, Pau Giner <pginer(a)wikimedia.org> wrote: >>> >>>> @Pau, @Amir There is a section called High priorities for product management on the Content translation analytics page. Did these priorities come from outside the team or does this just represent our own internal view of the high priorities? >>>> >>>> Here is the story of that page as I'm aware of it: >>>> >>>> In September 2013, I was in a meeting with the analytics team in SF presenting an initial proposal for metrics. On that meeting, Dario recommended to create hierarchy of metrics based on the project goals. I created such image and a description for those metrics (the image is on top of our analytics page and the metrics are described in what it now the "Other metrics for created articles" section. >>>> >>>> In a meeting between Amir and Howie, they captured which should be the most important metrics from the product perspective in the "High priorities for product management". If I recalled correctly, as an outcome of later meetings between Howie and Amir, Howie was happy focusing on articles published as a single (initial?) metric for success. Amir can provide more details since I was not on those meetings. >>>> >>>> In short: The analytics page has pieces contributed by different people during the last year, and although there are many ideas to organise and detail, measuring the number of published articles seems to be the solid candidate to get started with, learn from the value we get from it and polish the rest of our goal-to-signal process for detecting better metrics. >>>> >>>> >>>> Pau >>>> >>>> On Fri, Nov 7, 2014 at 1:57 AM, Joel Sahleen <jsahleen(a)wikimedia.org> wrote: >>>> Hi All, >>>> >>>> I have been reviewing our requirements for Content translation analytics and I have a few questions/requests. I am sending them to the language team list and Leila and Dan in the hopes of getting some more clarity. I will add the same content to the Trello card. >>>> >>>> In the weekly team meeting earlier today we agreed that the first metric we want to collect data for is the number of articles created in each language over time. This is something has Amir has already set up our current Event Logging to track. Now that Kartik has enabled EL in beta, that part should be done. Since we are only barely turning it on, there will be very little data until people create more articles using CX. However, we should be set up to collect any new data that comes in. >>>> >>>> @Leila, can you verify that the db table now exists for the ContentTranslation schema? If it doesn’t, can you point us to right people we need to work with to troubleshoot the issue? Also you mentioned in our meeting that personal data may soon be purged after 90 days as part of a new privacy policy. Could you explain that a bit more or point us to more information? If this is the case, it may affect some of the metrics we would like to collect in the future. >>>> >>>> @Dan, what do we need to do next in order to set up a very simple visualization that would show the number of articles created per week by language. Pau has an image of what he would like on the Trello card. You mentioned something about being able to host a dashboard for us on one of the Limn servers you already have set up. >>>> >>>> @Santhosh, I believe you said earlier you have a script you use to export the data for the ULS analytics. If so can you share that please in case we need a similar script for CX so I don’t have to write a new script from scratch? >>>> >>>> @Pau, @Amir There is a section called High priorities for product management on the Content translation analytics page. Did these priorities come from outside the team or does this just represent our own internal view of the high priorities? If the latter, have these priorities been reviewed by anyone outside the team? I think we are safe to proceed with our current plan, but it would be good to have product sign off on things more generally. >>>> >>>> Thanks, >>>> >>>> Joel >>>> >>>> Joel Sahleen, Software Engineer >>>> Language Engineering >>>> Wikimedia Foundation >>>> jsahleen(a)wikimedia.org >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Localisation-team mailing list >>>> Localisation-team(a)lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/localisation-team >>>> >>>> >>>> >>>> >>>> -- >>>> Pau Giner >>>> Interaction Designer >>>> Wikimedia Foundation >>>> _______________________________________________ >>>> Localisation-team mailing list >>>> Localisation-team(a)lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/localisation-team >>> >>> >> > > >

5 24

Solving Limn - Zero graph descrepancy
by Yuri Astrakhan 18 Nov '14

18 Nov '14

I am trying to migrate Limn graphs to our own handling. Currently Zero graphs are generated as Limn dashboards, and after I applied this filter (taken from the HQL for counting article pageviews), i got matching (about 10% discrepancy) between our graphs and limn. Yet, one partner has discrepancy of 10 times, and I would like to see where that mismatch comes from. I looked at https://github.com/wikimedia/analytics-wp-zero but it seems there is other code that's missing from that repo. Any suggestions are welcome. Thanks! WHERE webrequest_source IN ('text', 'mobile') AND year=${year} AND month=${month} AND day=${day} AND x_analytics LIKE '%zero=%' AND SUBSTR(uri_path, 1, 6) = '/wiki/' AND ( ( SUBSTR(ip, 1, 9) != '10.128.0.' AND SUBSTR(ip, 1, 11) NOT IN ( '208.80.152.', '208.80.153.', '208.80.154.', '208.80.155.', '91.198.174.' ) ) OR x_forwarded_for != '-' ) AND SUBSTR(uri_path, 1, 31) != '/wiki/Special:CentralAutoLogin/' AND http_status NOT IN ( '301', '302', '303' ) AND uri_host RLIKE '^[A-Za-z0-9-]+(\\.(zero|m))?\\.[a-z]*\\.org$' AND NOT (SPLIT(TRANSLATE(SUBSTR(uri_path, 7), ' ', '_'), '#')[0] RLIKE '^[Uu]ndefined$')

2 1

analytics01 machine on betalabs
by Yuvi Panda 17 Nov '14

17 Nov '14

Is it being used for anything at all? Puppet has been failing for a while because it's trying to run the kraken role which hasn't been around for a while. Can it be deleted? -- Yuvi Panda T http://yuvi.in/blog

2 1

Wikimedia Research Showcase (November 14th)
by Aaron Halfaker 15 Nov '14

15 Nov '14

Streaming for this event will be starting shortly at https://www.youtube.com/watch?v=-FQ-TtTCdJo On Thu, Nov 13, 2014 at 10:51 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote: > Hey folks, > > This month we're holding a special edition of the Research and Data > showcase > <https://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase>. > We've invited Dr. Yan Chen, Professor from the UMuch iSchool to present her > work studying community dynamics with Kiva (micro-lending platform) and > what her results might imply for Wikimedia's sites. To take advantage of > her travel schedule, we'll be holding the event on *Friday** November 14 > at 11.30 PST (UTC-8) *rather than the usually 3rd Wednesday. The event > will be live streamed and recorded as usual. You can join the conversation > via IRC on freenode.net in the the #wikimedia-research channel. > > We look forward to seeing you there, > > -Aaron > > *Does Team Competition Increase Pro-Social Lending? Evidence from Online > Microfinance.*By Yan Chen <http://yanchen.people.si.umich.edu/>In the > first half of the talk, I will present our empirical analysis of the > effects of team competition on pro-social lending activity on Kiva.org, the > first microlending website to match lenders with entrepreneurs in > developing countries. Using naturally occurring field data, we find that > lenders who join teams contribute 1.2 more loans per month than those who > do not. Furthermore, teams differ in activity levels. To investigate this > heterogeneity, we run a field experiment by posting forum messages. > Compared to the control, we find that lenders from inactive teams make > significantly more loans when exposed to a goal-setting message and that > team coordination increases the magnitude of this effect.In the second > part of the talk, I will discuss a randomized field experiment we did in > May 2014, when we recommend teams to lenders on Kiva. We find that lenders > are more likely to join teams in their local area. However, after joining > teams, those who join popular teams (on the leaderboard) are more active in > lending. >

1 0

Writing EventLogging events to database failed on 2014-11-13 for ~23 minutes between 00:59 and 01:21
by Christian Aistleitner 14 Nov '14

14 Nov '14

Hi, a test to move the m2 database cluster behind a proxy towards high availability (ah the irony!) failed for EventLogging, and caused EventLogging to not be able to write events to the database between 00:59 and 01:21. The data for that period is not lost, but is available in backup files, waiting to get injected again into the database. Just wanted to let you know what happened, in case you notice drops in graphs / dashboards during that period. Sorry for the inconveniences, Christian P.S.: Documentation are at the (currently still rather empty) Incident Report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20141113-EventLo… -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 1

Wikimedia Research
by Aaron Halfaker 14 Nov '14

14 Nov '14

Hey folks, This month we're holding a special edition of the Research and Data showcase <https://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase>. We've invited Dr. Yan Chen, Professor from the UMuch iSchool to present her work studying community dynamics with Kiva (micro-lending platform) and what her results might imply for Wikimedia's sites. To take advantage of her travel schedule, we'll be holding the event on *Friday** November 14 at 11.30 PST (UTC-8) *rather than the usually 3rd Wednesday. The event will be live streamed and recorded as usual. You can join the conversation via IRC on freenode.net in the the #wikimedia-research channel. We look forward to seeing you there, -Aaron *Does Team Competition Increase Pro-Social Lending? Evidence from Online Microfinance.*By Yan Chen <http://yanchen.people.si.umich.edu/>In the first half of the talk, I will present our empirical analysis of the effects of team competition on pro-social lending activity on Kiva.org, the first microlending website to match lenders with entrepreneurs in developing countries. Using naturally occurring field data, we find that lenders who join teams contribute 1.2 more loans per month than those who do not. Furthermore, teams differ in activity levels. To investigate this heterogeneity, we run a field experiment by posting forum messages. Compared to the control, we find that lenders from inactive teams make significantly more loans when exposed to a goal-setting message and that team coordination increases the magnitude of this effect.In the second part of the talk, I will discuss a randomized field experiment we did in May 2014, when we recommend teams to lenders on Kiva. We find that lenders are more likely to join teams in their local area. However, after joining teams, those who join popular teams (on the leaderboard) are more active in lending.

2 1

Adventures in Clusterland 2014-10-27--2014-11-02
by Christian Aistleitner 13 Nov '14

13 Nov '14

Hi, in the week from 2014-10-27–2014-11-02 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops: * Hive UDF to parse user agents with ua_parser * More kafkatee issues * Database replication getting stuck on 'Duplicate entry' * Ganglia's Views broke * Fixing sync of “aggregate-datasets” rsync * Turning down logstash logging * 'research' database user (details below) Have fun, Christian * Hive UDF to parse user agents with ua_parser A Hive UDF to parse User-Agent strings with ua_parser was merged and got deployed to the Analytics cluster. So people with Hive access can now use this UDF to automatically extract browser, OS, and device information. * More kafkatee issues After previous week's deployment of the new kafkatee build, we took a closer look at the generated files. While up to now, no partitions got dropped, it turned out that kafkatee is loosing lines if other processes have more disk activity. While---even if other processes have more disk activity---the kafkatee output files are still better than what udp2log can produce, we're investigating if the kafkatee output is good enough for users that need to stream data. (For non-streaming needs, it currently looks like Hive would be the more reliable choice.) * Database replication getting stuck on 'Duplicate entry' This week we've been having two more replication lag issues. From the five lag issues of October, the last three times, replication stopped with 'Duplicate entries'. Since this seems to be an emergent pattern, it has been called out with Ops and while they are aware of it, there is currently no fix for this issue. * Ganglia's Views broke Ganglia allows to have custom predefined dashboards (see Ganglia's “View” tab), which we use to watch kafka's and varnishkafka's key metrics. It seems that some puppet refactorings broke the existing Ganglia dashboards. As it seems we're one of the few teams using Ganglia dashboards regularly, we fixed puppet's Ganglia View setup. * Fixing sync of “aggregate-datasets” rsync Some weeks back, work was started to have stat1002's “aggregate-datasets” directory automatically publish it's content the website at http://datasets.wikimedia.org . Now final tweaks there have been put into place, and automatic publishing now works as expected. * Turning down logstash logging It turned out that the combination of Analytics cluster and other new log producers amount for more traffic than the current logstash setup can handle nicely. So the log level from the Analytics cluster got turned down until logstash itself got scaled up. * 'research' database user Many researchers and other WMFers are using the 'research' credentials to access the analytics databases, and the time came to switch those credentials to a new password. Since the password was not properly puppetized, discussions were started on how disruptive a change would be and how to best change it. Also puppetization work around it started. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 0

"Infrastructure for Data Streams" - via Nuzzel
by Jeff Gerard 12 Nov '14

12 Nov '14

"How I store 200k messages/sec... Using Kafka & librdkafka" :D Infrastructure for Data Streams Tadas Vilkeliskis – Nov 10 See the top news stories shared by Jeff Gage's friends Sent from my iPhone

2 1

s1-analytics-slave impressively slow queries
by Sean Pringle 12 Nov '14

12 Nov '14

Three identical queries from the 'research_prod' user have just passed one month execution time on s1-anlytics-slave: select count(*) from staging.ourvision r where exists ( select * from staging.ourvision r1 inner join staging.ourvision r2 on r2.sha1 = r1.sha1 where r1.page_id = r.page_id and r2.page_id = r.page_id and DATE_ADD(r.timestamp, INTERVAL 1 HOUR) and r2.timestamp between r.timestamp and DATE_SUB(r.timestamp , INTERVAL 1 HOUR) and r1.sha1!= r.sha1 ); I havn't checked to see if the queries are just that amazingly slow, or if they're part of a larger ongoing transaction. In any case, three month-long transactions is pushing the resource limits of the slave and will soon result in either mass replication lag or some other interesting lockup that may in turn take days to rollback :-) Can we kill these? Can we optimize and/or redesign the jobs? Happy to help...

5 5

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics November 2014