Analytics February 2013

analytics@lists.wikimedia.org

33 participants
26 discussions

by Magnus Manske

Hi all, as you might know, I have a few GLAM-related tools on the toolserver. Some are updated once a month, some can be used live, but all are in high demand by GLAM institutions. Now, the monthly updated stats have always been slow to run, but did almost grind to a halt recently. The on-demand tools have stalled completely. All these tools get their data from stats.grok.se, which works well but not really high-speed; my on-demand tools have apparently been shut out recently because too many people were using them, DDOSing the server :-( I know you are working on page view numbers, and for what I gather it's up-and-running internally already. My requirements are simple: I have a list of pages on many Wikimedia projects; I need view counts for these pages for a specific month, per-page. Now, I know that there is no public API yet, but is there any way I can get to the data, at least for the monthly stats? Cheers, Magnus

10 years, 10 months

Skin and active editor correlation

by Matthew Flaschen

There's discussion at https://bugzilla.wikimedia.org/show_bug.cgi?id=44448 about how skin usage correlates with who's an active editor. It would be great to know what percentage of active editor (5+ edits in the main namespace) uses each skin on English Wikipedia. Perhaps for the last three months. Matt Flaschen

11 years

Fundraising wants to model user behaviour

by Matthew Walker

All, Fundraising is proposing to an experiment to model user behavior on our properties. I've written an RfC on exactly what I'm proposing here [1]. I would love any comments/concerns/methodology changes/and additional considerations you might have. [1] http://meta.wikimedia.org/wiki/User_site_behavior_collection Thanks ~Matt Walker Wikimedia Foundation Fundraising Technology Team

11 years, 1 month

RFC: Tab as field delimiter in logging format of cache servers

by Diederik van Liere

Apologies for crossposting Heya, The Analytics Team is planning to deploy "tab as field delimiter" to replace the current space as fielddelimiter on the varnish/squid/nginx servers. We would like to do this on February 1st. The reason for this change is that we need to have a consistent number of fields in each webrequest log line. Right now, some fields contain spaces and that require a lot of post-processing cleanup and slows down the generation of reports. What is affected and maintained by Analytics * udp-filter already has support for the tab character * webstatscollector: we compiled a new version of filter to add support for the tab character * wikistats: we will fix the scripts on an ongoing basis. * udp2log: we have a patch ready for inserting sequence numbers separated by tab. In particular, I would like to have feedback to three questions: 1) Are there important reasons not to use tab as field delimiter? 2) Are there important pieces of logging that expect a space instead of a tab and that need to be fixed and that I did not mention in this email? 3) Is February 1st a good date to deploy this change? (Assuming that all preps are finished) Best, Diederik

11 years, 1 month

FW: Breakdown of page views by subject?

by Navino Evans

Hi, I'm trying to find some page view stats for Wikipedia articles broken down by subject matter (like 'history', or 'science'). So I would like to be able to find out what fraction of total Wikipedia page views are for articles belonging to a particular category like 'History' or "The Arts". If it helps, an example of the kind of information I was hoping for can be found in this video of Jimmy Wales https://www.youtube.com/watch?v=IhumTKbmdFs (chart is shown at time 12:40). I would be really grateful for any suggestions on getting this kind of data or similar. Many thanks Navino

11 years, 1 month

ApacheCon 2013 talk on Kafka Replication

by David Schoonover

Of interest to some, Jun Rao gave a talk at ApacheCon about Kafka[1] replication[2] (scheduled to land in 0.8 in March). I've pulled out some bits perhaps of interest. Updated stats about LinkedIn's experience with Kafka: - Writes: >10B messages/day (>2TB compressed data) - Reads: >50B messages/day (>1PB compressed data) - Typical failover time after a broker failure: <10ms Slides 14, 18-20 talk about its replication model for eventual consistency, interesting as it intentionally makes tradeoffs to take advantage of intra-datacenter latency being an order of magnitude(ish) better than that between DCs connected by the open internet. In exchange for some extra chatter, they tolerate 2f failures among 2f+1 replicas. Clever, and clearly it works for them. (See slides 21-22 for unhelpful diagrams, 27-31 for interesting performance numbers, excepting slide 28's totally inexplicable durability column using highly scientific measures like "some data loss" vs "a few data loss". What.) Pretty neat stuff, and it's great to see a built-in solution for cross-DC replication. [1] http://kafka.apache.org/ -- now out of incubator! [2] http://www.slideshare.net/junrao/kafka-replication-apachecon2013 -- David Schoonover dsc(a)wikimedia.org

11 years, 1 month

Team Meeting Notes

by David Schoonover

Heya all, just a reminder that all our meeting notes are public! The index of notes can be found at: - https://www.mediawiki.org/wiki/Analytics/Roadmap/PlanningMeetings Relatedly, we've started taking notes at our daily stand-ups and appended here (for the rest of 2013, anyway): - https://www.mediawiki.org/wiki/Analytics/Roadmap/PlanningMeetings/2013_Scrum We're not going to spam everybody and email out the notes, so watch that page if you want the nitty-gritty of what we're up to. (If you want notifications about *all* meeting notes, you could also watch the index.) Finally, we've started weekly review meetings for improving our team processes. The notes from today's meeting are here: - https://www.mediawiki.org/wiki/Analytics/Roadmap/PlanningMeetings/2013_02_2… Cheers, -- David Schoonover dsc(a)wikimedia.org

11 years, 1 month

Announcement: EventLogging workshop - 3/7

by Dario Taraborelli

We're organizing a half-day workshop for engineers, analysts, PMs and other parties interested in learning how to use EventLogging. EventLogging [1] is a MediaWiki extension developed by the E3 team that allows the collection of data on how users interact with our site. It's been largely adopted in Product/Mobile/Feature engineering to run A/B tests and to evaluate experimental features but can be used more generally to identify usability problems and to collect data to inform feature design. Whether you are already planning to use EventLogging for an existing project or you are just curious to learn how it works, the session will cover a typical workflow: 1) turning an idea into a data model 2) instrumenting MediaWiki to log events 3) accessing and QA'ing log data 4) performing simple log data analysis The workshop [2] will be hosted at the Wikimedia Foundation (Collab space, 6th floor) on March 7 between 1.30pm-5pm. The whole E3 engineer line-up will be in the office to provide hands-on demos and tutorials. If you are interested in attending, please sign up on the workshop page. The session will be recorded but we're not currently planning to stream it. Dario [1] https://www.mediawiki.org/wiki/EventLogging [2] https://www.mediawiki.org/wiki/EventLogging/Workshop

11 years, 2 months

Quick note for anyone who edits EventLogging schemas

by Steven Walling

Ori enabled the CodeEditor extension on the Schema namespace today. See: https://meta.wikimedia.org/wiki/Meta:Babel#CodeEditor_extension_deployment_…: for reference. -- Steven Walling https://wikimediafoundation.org/

11 years, 2 months

On handling data & analytics bug/feature requests

by Diederik van Liere

Hi everyone, We are glad to see that more people are finding their way to this mailinglist and that's really cool! Often, a thread will involve a request for a new dataset, a bugfix or a new feature. I am trying to keep track of all these requests as best as I can but what would really help is if you can file a request in Bugzilla under Product Analytics and then use either the General Component (if you don't know exactly where it should go) or use the appropriate component. If a component is missing then please let me know and I will get it added. Thanks for your cooperation and keep those requests coming! Best, Diederik

11 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics February 2013