Analytics January 2014

analytics@lists.wikimedia.org

38 participants
24 discussions

Adding fields to cache (=squid, varnish, nginx) log format ($ssl_cipher)

by Christian Aistleitner

Hi, it seems ops received a request to add the negotiated cipher suite to the cache logs for https requests. Would it hurt any of our tools or would we expect breakage if for example $ssl_cipher from http://nginx.org/en/docs/http/ngx_http_ssl_module.html#variables got append as field #17 to the format currently described at https://wikitech.wikimedia.org/wiki/Cache_log_format ? Do we know of other reasons to veto such a change? (If I do not hear of problems until 2014-01-22, I'll let ops know that appending $ssl_cipher is ok for us) Best regards, Christian P.S.: * Webstatscollector ignores additional fields, hence should be safe. * Wikipedia zero ignores additional fields, hence should be safe. * The mobile jobs that we moved out of Hadoop some time ago and I guess are currently unused ignore additional fields, hence should be safe. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian(a)quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ --------------------------------------------------------------- OpenPGP key transition from 0xEF78CCDE to 0x13C1072F: http://quelltextlich.at/openpgp-transition-0xEF78CCDE-to-0x13C1072F.txt

10 years, 3 months

silly bugzilla mistake

by Dan Andreescu

So I had been replying via email to the bugzilla-daemon, for *months*, and thinking that it would just update bugzilla. It does *not*. So 1. sorry for my apparent silence on the 8 bugs affected, 56030, 42259, 58416, 58633, 59846, 58450, 60095, 58208 2. keep this in mind in case you, too, are tempted to talk to soul-less uncompassionate machines

10 years, 3 months

Context: massive FB decline with teens

by Andrew Green

http://istrategylabs.com/2014/01/3-million-teens-leave-facebook-in-3-years-… True, people have different motivations (!) for using Wikipedia and FB. But if other sites are strongly impacted by changing patterns in use by different age groups, maybe it's worth investigating whether something similar is happening to Wikimedia sites, too? Do we have data on page views/user activity by age group? Cheers, Andrew

10 years, 3 months

Page view data with Wikipedia app?

by Denny Vrandečić

According to the stats, the last 12 months have seen a decline in overall page views by about 9-10%. As far as I can tell these numbers do not include access to Wikipedia and sister projects through the API. And if I understand correctly, this means that e.g. access through the Wikipedia app is ignored. Considering that mobile access through the browser is about 15-20% of the overall access (if I read the graphs correctly), is it possible that the current page view numbers are underreported? If a reasonable portion of mobile users use the app instead of the browser, then these numbers should somehow be included in the page view statistics, or wrong conclusions might be drawn. Any ideas? Denny

10 years, 3 months

FastCCI (Catgraph for Commons categories)

by Daniel Schwen

Just saw that my tool was mentioned on this list. I have a running commons category intersection available as a gadget on commons now. https://commons.wikimedia.org/w/index.php?title=Help:FastCCI&withJS=MediaWi… It is less of an analysis tool, and more of a user interface enhancement. Right now the intersection is only used to dig for Featured pictures, Quality images, and Valued images. Results are delivered quite fast through a websocket connection, which allows the streaming of progress updates. Cheers, Daniel

10 years, 3 months

FastCCI (Catgraph for Commons categories)

by Daniel Schwen

10 years, 3 months

Commons Category Tree Use

by Daniel Schwen

Hello Analytics people! I have a specific analytics question about how the category tree on commons is used. To get started I have drafted a schema at https://meta.wikimedia.org/wiki/Schema:CommonsCategoryTreeUse The description on the talk page of that schema is copied below. Thanks for considering it! Best, Daniel - [[User:Dschwen]] Question How are anonymous users using the commons category tree to find images, compared to logged in users. Is the category tree being used to discover images?The proposed schema should emit events on page view and on category link clicks. The event data should contain the log in status (logged in/not logged in), and the current namespace number. Analysis The following analysis on the dataset would be performed: Category page visitation frequency compared to image page visitation frequency for logged in and logged out users.How much relative "time" is each group spending the category namespace? This could indicate if categories are a significant path for the discovery of images (as opposed to direct jumps to image pages from internal/external search).Category link click rates in category and image namespaces.These metrics (again for each logged in and logged out users) would indicate if the category tree is actively browsed (rather than stumble upon). * Category link clicks in the image namespace are an indicator for the effectiveness of categories to find similar content. * Category link clicks in category namespace are an indicator for browsing the category tree to find specific content Rationale Motivator for this study is finding out the significance of the category tree in content discovery on wikimedia commons. This directly impacts decisions for gadget default deployment, such as the FastCCI Gadget which would benefit anonymous users (if the category tree is a significant funnel for content discovery. The schema is designed to collect a minimum amount of data in a maximally anonymized way. The data to be logged should be considered inexpensive (standard identifiers isAnaon and pageNS in the schema). I have no clue how the link click action will be logged, but determining the namespace from the link target should be rather trivial (using mw.title for example).

10 years, 3 months

Distributing an official graph

by Sebastiano Vigna

[Reposted from private discussion after Dario's request] My problem is that of exploring the graph structure of Wikipedia 1) easily; 2) reproducibly; 3) in a way that does not depend on parsing artifacts. Presently, when people wants to do this they either do their own parsing of the dumps, or they use the SQL data, or they download a dataset like http://law.di.unimi.it/webdata/enwiki-2013/ which has everything "cooked up". My frustration in the last few days was when trying to add the category links. I didn't realize (well, it's not very documented) that bliki extracts all links and render them in HTML *except* for the category links, that are instead accessible programmatically. Once I got there, I was able to make some progress. Nonetheless, I think that the graph of Wikipedia connections (hyperlinks and category links) is really a mine of information and it is a pity that a lot of huffing and puffing is necessary to do something as simple as a reverse visit of the category links from "People" to get, actually, all people pages (this is a bit more complicated--there are many false positives, but after a couple of fixes worked quite well). Moreover, one has continuously this feeling of walking on eggshells: a small change in bliki, a small change in the XML format and everything might stop working is such a subtle manner that you realize it only after a long time. I was wondering if Wikimedia would be interested in distributing in compressed form the Wikipedia graph. That would be the "official" Wikipedia graph--the benefits, in particular for people working on leveraging semantic information from Wikipedia, would be really significant. I would (obviously) propose to use our Java framework, WebGraph, which is actually quite standard in distributing large (well, actually much larger) graphs, such as ClueWeb09 http://lemurproject.org/clueweb09/, ClueWeb12 http://lemurproject.org/clueweb12/ and the recent Common Web Crawl http://webdatacommons.org/hyperlinkgraph/index.html. But any format is OK, even a pair of integers per line. The advantage of a binary compressed form is reduced network utilization, instantaneous availability of the information, etc. Probably it would be useful to actually distribute several graphs with the same dataset--e.g., the category links, the content link, etc. It is immediate, using WebGraph, to build a union (i.e., a superposition) of any set of such graphs and use it transparently as a single graph. In my mind the distributed graph should have a contiguous ID space, say, induced by the lexicographical order of the titles (possibly placing template pages at the start or at the end of the ID space). We should provide graphs, and a bidirectional node<->title map. All such information would use about 300M of space for the current English Wikipedia. People could then associate pages to nodes using the title as a key. But this last part is just rambling. :) Let me know if you people are interested. We can of course take care of the process of cooking up the information once it is out of the SQL database. Ciao, seba

10 years, 3 months

Commons Category Tree Use

by Daniel Schwen

10 years, 3 months

Data Retention guidelines draft released

by Toby Negrin

Hi all, The Foundation has released the first draft of the data retention guidelines. https://meta.wikimedia.org/wiki/Data_retention_guidelines We'd like to solicit your feedback on these guidelines. You can use the talk page on the above document. We really appreciate your help in finding the right balance between our values, privacy and the ability to improve the site and make data available to the community. Thanks, -Toby

10 years, 3 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics January 2014