Analytics September 2014

analytics@lists.wikimedia.org

23 participants
25 discussions

datasets.wikimedia.org
by Andrew Otto 12 Sep '14

12 Sep '14

Hi all! For a while now, we’ve been hosting some public datasets at http://stat1001.wikimedia.org/public-datasets. We wanted to dissociate the domain that these datasets were hosted at from the actual server name, so, we did! The same data is now available at http://datasets.wikimedia.org. Redirects from stat1001.wikimedia.org are in place. Let us know if you have any trouble. Thanks! -Ao

6 6

Analytics Dev Team Commitments 2014-08-21 -- 2014-09-02
by Kevin Leduc 03 Sep '14

03 Sep '14

Hi, the analytics dev team has committed to the following user stories for the sprint starting today, ending September 2. Bug ID Component Summary Points 69297 Wikimetrics Story: EEVS user does not see reports for projects without databases 3 68351 EEVS Story: AnalyticsEng has website for EEVS 34 67806 EEVS Story: EEVSUser loads static site in accordance to Pau's design 13 That’s 50 points in 3 Stories You can see the sprint here: http://sb.wmflabs.org/t/analytics-developers/2014-08-21/ Note: Bug 68507 (replication lag may affect recurrent reports) is carried over from the previous sprint and will be completed shortly. Cheers, Kevin Leduc

1 1

Adventures in Clusterland 2014-08-25--2014-08-31
by Christian Aistleitner 03 Sep '14

03 Sep '14

Hi, in the week from 2014-08-25–2014-08-31 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops: * Analytics cluster feeding more logs into logstash * More buffer for kafka brokers * Life support for webstatscollector on udp2log * Webstatscollector and kafka * Webstatscollector counting https requests from ulsfo twice (details below) Have fun, Christian * Analytics cluster feeding more logs into logstash The analytics cluster previously only fed logs from the worker nodes into logstash, and now also feeds logs from namenodes into logstash. * More buffer for kafka brokers During partition leader re-elections, kafka brokers sometimes drop a few log lines. Since the kafka broker buffers were smaller than the time the re-election might take, the buffer size was increased, which could help brokers to handle a partition leader re-election without dropping messages. * Life support for webstatscollector on udp2log The production webstatscollector (the software that produces the hourly pageview files, that are used for example by stats.wikimedia.org, and stats.grok.se) that consumes from udp2log started to produce faulty files. As another, no longer needed service on the host that runs part of webstatscollector was greedy around resources, this no longer needed service has been stopped to free up more resources. Strangely enough, those additional resources made webstatscollector misbehave even more. Disks could no longer handle the load. After moving the service to writing to a RAM disk, the host could handle the write load again. This switch not only allowed to bring webstatscollector back to life, but also decreased packet loss on the collector by a bit more than an order of magnitude. * Webstatscollector and kafka Last week we reported that we spun up a webstatscollector instance that consumes from kafka instead of udp2log, and that the setup caused some issues at first. We now monitored the “webstatscollector on kafka” setup for a week, and it was producing the data extremely reliably. So with this webstatscollector on kafka, we have a good baseline to compare against when trying to scale up webstatscollector to Hadoop. * Webstatscollector counting https requests from ulsfo twice While working on establishing the “webstatscollector on kafka” baseline, it has been discovered that the udp2log webstatscollector counts https requests from ulsfo twice. The corresponding fix has been merged on the same day, but due to “no deploys on Fridays” the deploy did not happen last week. (It has been deployed since, and numbers look good) -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 0

Webstatscollector's per-wiki overcounting 2%(-ish) since 2014-07-07
by Christian Aistleitner 02 Sep '14

02 Sep '14

Hi, there has been another issue around webstatscollector---the software that produces the raw numbers for stats.wikimedia.org and stats.grok.se. Due to a deployment regression, requests to Special:CentralAutoLogin/* have been counted since 2014-07-07, although they should not have been counted. While this does /not/ impact per-page numbers [1], it results in overreporting of the per-wiki numbers. The impact of this issue strongly depends on the wiki (E.g.: jawiki: ~0.5%, ruwiki: ~2%). The regression has been fixed in the meantime, and the corresponding bug is https://bugzilla.wikimedia.org/show_bug.cgi?id=70295 Best regards, Christian [1] So pages like http://stats.grok.se/en/latest30/Main_Page are not affected by this bug. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 0

Anonymizing and releasing 'edits per country' data for Wiki Projects
by Yuvi Panda 01 Sep '14

01 Sep '14

Hello! I've been working for the last few days on https://github.com/Ironholds/WPDMZ, which currently generates raw data on 'number of non-bot edits per country', and I'd like to run some stats / make some graphs based on it. Since I'd like al l my 'research' to be completely repeatable, I'd love it if we can make the 'raw data' (edits per country) publicly available on labsdb. I have most of the code written for it, *but* it needs anonymization. The biggest de-anonymization threats involve identifying which editors come from which countries, and can be executed in the following case: An editor is the only person editing from a country in a project where the country has low edit volume, and by a process of elimination / counting edits from a public source (like recentchanges), the individual editor can be connected to a particular country I propose the following Anonymization scheme: 1. No data for projects with less than a threshold of total *individual editors* in the time period for which the data is released. 2. For countries that have less than a threshold % of 'individual editors' in the time period, we just simply lump them in as 'other'. This removes most anonymization attacks I can think of. Thoughts? I can easily write up the code to generate these on a monthly basis and puppetize those to make the data publicly available. I think not just me, but lots of external researchers would benefit from such data. Thanks! -- Yuvi Panda T http://yuvi.in/blog

8 10

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics September 2014