Analytics May 2016

analytics@lists.wikimedia.org

34 participants
24 discussions

by Ryan Kaldari

The folks on Meta are considering whether or not to enable WikiLove and they were hoping to find some data about it. There is a research project on Meta about WikiLove (https://meta.wikimedia.org/wiki/Research:WikiLove), but it seems to have been "in progress" since 2011. Could someone in Analytics update that page to indicate that it is no longer in progress (or finish whatever piece was still ongoing)? It would also be great if someone from Analytics could respond to the questions and comments about the research data at https://meta.wikimedia.org/wiki/WikiLove#Support_for_another_discussion_abo… . Thanks!

7 years, 11 months

Wikipedia visitor data

by Nathan Marwell

Hello, I am an economist working on a research project analyzing contribution behavior on Wikipedia, and am interested in computing the fraction of individuals who use the site and make contributions. I have found the data on daily contributions, and am now looking for data on the number of individuals who use Wikipedia. I believe I have found the data I am looking for here: https://stats.wikimedia.org/EN/TablesUsageVisits.htm Unfortunately, the data contained in that file only covers the period from August 2002 through October 2004. Does a similar database exist for later time periods? Any information you can provide would be greatly appreciated. Thank you for you time and I look forward to hearing from you. Sincerely, Nathan Marwell

7 years, 11 months

University project to make entire English Wikipedia history searchable on Hadoop using Solr

by Tilman Bayer

Detailed technical report on an undergraduate student project at Virginia Tech (work in progress) to import the entire English Wikipedia history dump into the university's Hadoop cluster and index it using Apache Solr, to "allow researchers and developers at Virginia Tech to benchmark configurations and big data analytics software": Steven Stulga, "English Wikipedia on Hadoop Cluster" https://vtechworks.lib.vt.edu/handle/10919/70932 (CC BY 3.0) IIRC this has rarely or never been attempted due to the large size of the dataset - 10TB uncompressed. And it looks like the author here encountered an out of memory error that he wasn't able to solve before the end of term... -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB -- Sent from Gmail Mobile

7 years, 11 months

stat1004 access to Analytics Cluster

by Andrew Otto

Hi all! For years now, y’all have been accessing the Analytics Hadoop Cluster using stat1002. This works just fine, but others use stat1002 for number crunching outside of Hadoop as well. At times stat1002 can get pretty overloaded, which can make accessing Hadoop via this one box a little annoying. But fret no longer! stat1004 is here! stat1004 can now be accessed by anyone in the analytics-privatedata-users and analytics-users groups. If you previously had access to stat1002 AND used it to talk to Hive and Hadoop, you may now also do this from stat1004. You don’t have to do anything new to get access to stat1004 if you already had Hadoop accounts. stat1002 will remain useable as is. If you are looking for a more dedicated place from which to interact with Hadoop services, use stat1004 instead. You don’t have to do anything to get access. I’ve updated the wikitech documentation accordingly. Let us know if you have any questions! -Andrew

7 years, 11 months

Video view stats

by Andrew Gray

Hi all, I hacked up a very quick count of the 2015 video viewing aggregate figures, using the data that Bartosz put together last year - with the caveat that the data only goes up to 10 December, but it's probably indicative of whole-year trends. I haven't yet tried to merge in the 11-31/12 data. Nothing very insightful but I don't recall seeing it done before, so it might be of interest! http://www.generalist.org.uk/blog/2016/most-popular-videos-on-wikipedia/ The headline figure is that we had about three billion (!!) video/audio plays during the year, and that some of the most popular items are insanely popular - the most popular was viewed an average of 42,000 times a day, every day. Pine: the video you asked about in the other thread was viewed 187,899 times from 31/10/15 to 10/12/15. So there's half your answer :-) -- - Andrew Gray andrew.gray(a)dunelm.org.uk

7 years, 11 months

New EventLogging schemas don’t work after Kafka 0.9 upgrade

by Andrew Otto

Hi all! We just noticed a problem with the (old) version kafka-python client we are using to produce EventLogging events to Kafka: it doesn’t handle creation of new topics now that we’ve upgraded the Kafka cluster to 0.9. This means that until we fix, events produced to new schemas will not be saved. We will fix this ASAP (hopefully by tomorrow), but in the meantime, don’t make new schemas! :) I will update again once we have this fixed. Sorry for the trouble! -Andrew

7 years, 11 months

Re: [Analytics] [WikimediaMobile] "Among mobile sites, Wikipedia reigns in terms of popularity"

by Federico Leva (Nemo)

Thanks; Nielsen data can indeed be very useful, I asked about it earlier because I'd love to have it again for Italy. https://meta.wikimedia.org/w/index.php?title=Talk:ComScore/Announcement&old… Nemo Tilman Bayer, 11/05/2016 19:23: > New study (US only) by the Knight Foundation: > https://medium.com/mobile-first-news-how-people-use-smartphones-to , > summarized here: > http://www.theatlantic.com/technology/archive/2016/05/people-love-wikipedia… > > "People spent more time on Wikipedia’s mobile site than any other news > or information site in Knight’s analysis, about 13 minutes per month > for the average visitor. CNN wasn’t too far behind, at 9 minutes 45 > seconds per month. BuzzFeed clocked in third at 9 minutes 21 seconds > per month. (BuzzFeed, however, slays both CNN and Wikipedia in time > spent with the sites’ apps, compared with mobile websites. BuzzFeed > users devote more than 2 hours per month to its apps, compared with > about 46 minutes among CNN app users and 31 minutes among Wikipedia > app loyalists.) > > Another way to look at Wikipedia’s influence: Wikipedia reaches almost > one-third of the total mobile population each month, according to > Knight’s analysis, which used data from the audience-tracking firm > Nielsen." > >

7 years, 11 months

Fwd: [Wiki-research-l] [ANN] Wikipedia Tools for Google Spreadsheets

by Pine W

Forwarding because this may be of interest to Analytics subscribers as well. Pine ---------- Forwarded message ---------- From: "Thomas Steiner" <tomac(a)google.com> Date: May 2, 2016 01:18 Subject: [Wiki-research-l] [ANN] Wikipedia Tools for Google Spreadsheets To: "Thomas Steiner" <tomac(a)google.com> Cc: "public-lod(a)w3.org" <public-lod(a)w3.org>, "Semantic Web" < semantic-web(a)w3.org>, "Discussion list for the Wikidata project." < wikidata(a)lists.wikimedia.org>, "Research into Wikimedia content and communities" <wiki-research-l(a)lists.wikimedia.org> Esteemed Wikipedia, Wikidata, Linked Data, and Semantic Web communities[*], === tl;dr: Released a Google Spreadsheets add-on called Wikipedia Tools [1] that makes working with data from Wikipedia and Wikidata a breeze. === I am happy to release a Google Spreadsheets add-on called Wikipedia Tools [1]. This add-on allows you to work with data from Wikipedia and Wikidata from within a spreadsheet context using custom formulas. Let me motivate the tools with a short example: You may have heard of Volkswagen's #DieselGate scandal. Is this still a problem for Volkswagen—and if so, where? Google Trends to the rescue? Maybe [2]. But what about global impact? How do people in Korea, an important Volkswagen export market [citation needed😉], refer to the scandal? Turns out they call it 폭스바겐 배기가스 조작 (among probably other options). With a custom function from Wikipedia Tools, we can safely "translate" from one English (a language that, for the sake of this example, we assume we dominate well enough) Wikipedia article to many other languages (that we do not necessarily dominate): =WIKITRANSLATE("en:Volkswagen_emissions_scandal") bg Афера на Фолксваген cs Dieselgate de VW-Abgasskandal […] zh 福斯集團汽車舞弊事件 Then, using Wikipedia page views as one (among others) reasonable popularity indicator, for each of these language results, for example for Korean, we can get =WIKIPAGEVIEWS("ko:폭스바겐 배기가스 조작") for the last n days, and plot the results [3] (in practice, you would probably still normalize by size and/or total views of the particular Wikipedia[**]). There are a lot more custom functions implemented than I could cover in this short example. I have put together a slide deck [4] and paper [5] that go into more detail if you are interested, a demo with all functions is available at [6]. The add-on also has a built-in manual (in Google Sheets, click Add-ons→Wikipedia Tools→Show documentation) and its underlying code is open-source [7]. Please let me know in case of any open question, feature request, or bug. Thanks! Cheers, Tom -- [1] http://bit.ly/wikipedia-tools-add-on [2] http://www.google.com/trends/explore?hl=en-US&q=volkswagen+emissions+scanda… [3] https://docs.google.com/spreadsheets/d/1PyFq59iEeLWpPQrWDUyU8mlmQrb4GDv2QEl… [4] bit.ly/wikipedia-tools-slides [5] bit.ly/wikipedia-tools-paper (PDF) [6] https://docs.google.com/spreadsheets/d/1sVduZul787O-bRzuy0UKpRl7bkouxwaIOsx… [7] https://github.com/tomayac/wikipedia-tools-for-google-spreadsheets/ [*] Cross-posted on purpose (http://ruben.verborgh.org/blog/2014/01/31/apologies-for-cross-posting/), please choose your reply options accordingly. [**] This is a simple example for illustrative purposes, I do _not_ claim it is an accurate popularity prediction, nor do I mean to bash Volkswagen. -- Dr. Thomas Steiner, Employee (http://blog.tomayac.com, https://twitter.com/tomayac) Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany Managing Directors: Matthew Scott Sucherman, Paul Terence Manicle Registration office and registration number: Hamburg, HRB 86891 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.29 (GNU/Linux) iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom hTtPs://xKcd.cOm/1181/ -----END PGP SIGNATURE----- _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

7 years, 11 months

Testing wikipedia

by Stan Zonov

Hi! I have been trying to gage the speed/efficiency of a database I have setup. In order to test it, I have filled it with a lot of wikipedia articles from a specific category (for example history). The database does multi-word queries and returns the articles that best match the multiword query. For example if I search up "history in Italy in the past 100 years" then the best matching articles should pop up. I was wondering if anyone has any advice how to form sample test queries to model realistic situations/queries. I don't think it would be fair to do random phrases (such as "banana the string") and wanted to model queries based on my data to test performance and correctness of output. Does anyone have any advice? How or Is this done at wikipedia? I have looked here ( http://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia…) but the data has been down for a while. Cheers,

7 years, 11 months

number of registered accounts

by Verena Lindner

Hello everybody, does anybody know a way how to get the number of registered accounts for specific dates without the criteria of the number of edits, just the registered accounts, preferrably with publicly available data. In case you are wondering why we want to know: We are currently gathering statistics about the development of editor numbers and this is one of our metrics. Best Verena -- Verena Lindner Projektmanagerin Know-how Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin Tel. (030) 219 158 26-0 http://wikimedia.de Stellen Sie sich eine Welt vor, in der jeder Mensch an der Menge allen Wissens frei teilhaben kann. Helfen Sie uns dabei! http://spenden.wikimedia.de/ Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

7 years, 11 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics May 2016