Analytics March 2021

analytics@lists.wikimedia.org

14 participants
13 discussions

Growth in reader engagement since 2016?
by Tilman Bayer 30 Apr '21

30 Apr '21

In her recent announcement of her upcoming departure as the Wikimedia Foundation's CEO, Katherine highlighted a growth in "reader engagement" by 30% during her tenure (i.e. since 2016).[1] A WMF board member since reported in somewhat more detail that this refers to "~1 billion interactions up 32% in six years".[2] Are the underlying numbers published somewhere? Regards, Tilman [1] https://twitter.com/krmaher/status/1357390962410987520 [2] https://twitter.com/raju/status/1371100758343614471 PS: As some may be aware, a widely read German blogger linked to Katherine's tweet while singling out the "reader engagement" bit for some outspoken criticism. Just to clarify, that's not why I'm asking (in fact I disagree with most of that criticism).

3 4

[events] Wiki Workshop 2021 Announcement and Call for Papers
by Leila Zia 09 Apr '21

09 Apr '21

Hi everyone, We are delighted to announce that Wiki Workshop 2021 will be held virtually in April 2021 and as part of the Web Conference 2021 [1]. The exact day is to be finalized and we know it will be between April 19-23. In the past years, Wiki Workshop has traveled to Oxford, Montreal, Cologne, Perth, Lyon, and San Francisco, and (virtually) to Taipei. Last year, we had more than 120 participants in the workshop and we are particularly excited about this year's as we will celebrate the 20th birthday of Wikipedia. We encourage contributions by all researchers who study the Wikimedia projects. We specifically encourage 1-2 page submissions of preliminary research. You will have the option to publish your work as part of the proceedings of The Web Conference 2021. You can read more about the call for papers and the workshop at http://wikiworkshop.org/2021/#call. Please note that the deadline for the submissions to be considered for proceedings is January 29. All other submissions should be received by March 1. If you have questions about the workshop, please let us know on this list or at wikiworkshop(a)googlegroups.com. Looking forward to seeing many of you in this year's edition. Best, Miriam Redi, Wikimedia Foundation Bob West, EPFL Leila Zia, Wikimedia Foundation [1] https://www2021.thewebconf.org/

1 2

[Wikimedia Research Showcase] March 17: Curiosity
by Janna Layton 17 Mar '21

17 Mar '21

In this showcase, Prof. Danielle Bassett will present recent work studying individual and collective curiosity as network building processes using Wikipedia. Date/Time: March 17, 16:30 UTC (9:30am PT/12:30pm ET/17:30pm CET) Youtube: https://www.youtube.com/watch?v=jw2s_Y4J2tI Speaker: Danielle Bassett (University of Pennsylvania) Title: The curious human Abstract: The human mind is curious. It is strange, remarkable, and mystifying; it is eager, probing, questioning. Despite its pervasiveness and its relevance for our well-being, scientific studies of human curiosity that bridge both the organ of curiosity and the object of curiosity remain in their infancy. In this talk, I will integrate historical, philosophical, and psychological perspectives with techniques from applied mathematics and statistical physics to study individual and collective curiosity. In the former, I will evaluate how humans walk on the knowledge network of Wikipedia during unconstrained browsing. In doing so, we will capture idiosyncratic forms of curiosity that span multiple millennia, cultures, languages, and timescales. In the latter, I will consider the fruition of collective curiosity in the building of scientific knowledge as encoded in Wikipedia. Throughout, I will make a case for the position that individual and collective curiosity are both network building processes, providing a connective counterpoint to the common acquisitional account of curiosity in humans. Related papers: Hunters, busybodies, and the knowledge network building associated with curiosity. https://doi.org/10.31234/osf.io/undy4 The network structure of scientific revolutions. http://arxiv.org/abs/2010.08381 https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#March_2021 -- Janna Layton (she/her) Administrative Associate - Product & Technology Wikimedia Foundation <https://wikimediafoundation.org/>

2 3

About Varnish NTP server time accuracy
by Ho Chung 16 Mar '21

16 Mar '21

Hello Have anyone know the Varnish ntp server clock accuracy can it be used as a forensic clock after evaluation? Do you use some third-party unapproved NTP clocks for your time? Thanks 👍

2 2

Invitation for Wikimedia Research Office hours March 16, 2021
by Martin Gerlach 16 Mar '21

16 Mar '21

Hi all, Join the Research Team at the Wikimedia Foundation [1] for their monthly Office hours on 2021-03-16 at 16:00-17:00 UTC (9am PT/5pm CET). To participate, join the video-call via this link [2]. There is no set agenda - feel free to add your item to the list of topics in the etherpad [3] (You can do this after you join the meeting, too.), otherwise you are welcome to also just hang out. More detailed information (e.g. about how to attend) can be found here [4]. Through these office hours, we aim to make ourselves more available to answer some of the research related questions that you as Wikimedia volunteer editors, organizers, affiliates, staff, and researchers face in your projects and initiatives. Some example cases we hope to be able to support you in: - You have a specific research related question that you suspect you should be able to answer with the publicly available data and you don’t know how to find an answer for it, or you just need some more help with it. For example, how can I compute the ratio of anonymous to registered editors in my wiki? - You run into repetitive or very manual work as part of your Wikimedia contributions and you wish to find out if there are ways to use machines to improve your workflows. These types of conversations can sometimes be harder to find an answer for during an office hour, however, discussing them can help us understand your challenges better and we may find ways to work with each other to support you in addressing it in the future. - You want to learn what the Research team at the Wikimedia Foundation does and how we can potentially support you. Specifically for affiliates: if you are interested in building relationships with the academic institutions in your country, we would love to talk with you and learn more. We have a series of programs that aim to expand the network of Wikimedia researchers globally and we would love to collaborate with those of you interested more closely in this space. - You want to talk with us about one of our existing programs [5]. Hope to see many of you, Martin (WMF Research Team) [1] https://research.wikimedia.org/team.html [2] https://meet.jit.si/WMF-Research-Office-Hours [3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours [4] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours [5] https://research.wikimedia.org/projects.html -- Martin Gerlach Research Scientist Wikimedia Foundation

1 1

Pageview-complete entries labeled as "-"
by Ogier Maitre 16 Mar '21

16 Mar '21

Hello everybody, We are currently working on a wikipedia visualisation tool (which is presented here: http://www.wikimaps.io/). We use several pageview statistics to generate time series for each page from 2008 to 2020. (we use pagecounts, pageviews and pageview_complete). This last format is great for our work compared to previous format, and we use it for our data from 2016 to 2020. (Thank to the analytics team for that). We aggregate redirections as one page, identified by the page_id (as it is done in the pageview_complete files). But when we compare with the wikimedia API, we have some small differences. I think this problem comes from the fact that wikimedia API (and pageviews.toolforge.org) uses page_title to get the time series, and I saw that pageview_complete files contain entries where the page_title is missing (replaced by a "-"). As we are using page_id to do the aggregation whenever it is possible, we aggregate these "-" entries, but pageviews.toolforge.org probably does not. For example for the page Barack_Obama in French, and the file `pageviews-20200112-user.bz2`, I get several relevant entries. fr.wikipedia - 167398 mobile-web 1 B1 fr.wikipedia Barack 167398 mobile-web 1 X1 fr.wikipedia Barack_Hussein_Obama 167398 mobile-web 1 J1 fr.wikipedia Barack_Obama 167398 desktop 748 A18B10C5D8E3F3G8H6I18J36K41L37M35N37O55P76Q65R57S48T29U56V42W23X32 fr.wikipedia Barack_Obama 167398 mobile-app 10 A1L1O1Q1T3U2V1 fr.wikipedia Barack_Obama 167398 mobile-web 1732 A62B38C28D17E24F10G16H43I40J56K65L78M87N100O95P100Q93R127S84T128U124V184W84X49 fr.wikipedia Natasha_Obama 167398 desktop 3 Q1R2 fr.wikipedia Obama 167398 desktop 11 J2K1M1O1Q2R1S1U1W1 fr.wikipedia Obama 167398 mobile-web 2 R1V1 fr.wikipedia Obama_Barack 167398 desktop 3 N1P2 fr.wikipedia Sacha_Obama 167398 desktop 3 J1O2 fr.wikipedia Sacha_Obama 167398 mobile-web 1 C1 fr.wikipedia Barack_Obama mobile-app 29 B1C1H4J1L1M2N3O3P1R3S5V1W2X1 That is 12 entries that use the page_id, and one that does not. I have two questions about that result. What kind of query can cause theses "-" entries ? Why the entry "Barack_Obama mobile-app" appears two times ? Sorry for the long introduction and thank you for your time. Regards, Ogier

3 5

About readership timestamp
by Ho Chung 15 Mar '21

15 Mar '21

Hello In this page did you know when any readership visit any Chinese web page , Eg. https://zh.wikipedia.org/wiki/MP3 the timestamp is use UTC ? https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest

2 1

Fwd: About: refine_webrequest.hql
by Joseph Allemandou 15 Mar '21

15 Mar '21

Forwarding to the analytics list for reference. ---------- Forwarded message --------- From: Ho Chung <chungho4865(a)gmail.com> Date: Mon, Mar 15, 2021 at 11:45 AM Subject: Re: [Analytics] About: refine_webrequest.hql To: Joseph Allemandou <jallemandou(a)wikimedia.org> Hello Thanks for your reply Because i was research your Analytics team public discuss history and wikiteah about web request time stamp https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest https://phabricator.wikimedia.org/T212529 I have been in doubt at that time, you're used java technology, but your HIVE version did not support java before October 2018. The wmf.webrequest file is located in HIVE. When collecting the privacy data of readership , whether the time stamp used the reader's computer system clock instead of the Wikipedia computer server clock when reading and browsing the page Now I am more clear. On the public discussion page of your analysis team, said that all the time is utc by Ottomata It’s just that you technicians don’t want to unify the expression of the time stamp format, but in fact all of them use UTC 在 2021年3月15日週一 16:14，Joseph Allemandou <jallemandou(a)wikimedia.org> 寫道： > Hi, > the `dt` field is the time in UTC (no timezone specified) at which the > request ends being processed by Varnish. > Cheers > Joseph > > On Mon, Mar 15, 2021 at 8:36 AM Luca Toscano <ltoscano(a)wikimedia.org> > wrote: > >> +A mailing list for the Analytics Team at WMF and everybody who has an >> interest in Wikipedia and analytics. <analytics(a)lists.wikimedia.org> >> >> Hi! >> >> I added the Analytics mailing list in Cc so other people can chime in, >> this is the canonical way to follow up with us and the community, please >> avoid direct email if possible :) >> >> Thanks! >> >> Luca >> >> >> >> On Sat, Mar 13, 2021 at 10:57 PM Ho Chung <chungho4865(a)gmail.com> wrote: >> >>> Hello >>> >>> I have some problem request , about refine_webrequest.hql >>> >>> >>> In this file timestamp is use utc ? >>> >>> This file is it connect wmf_raw.webrequest and wmf.webrequest ? >>> >>> Because i can't read the code have add Z / +/- zone time >>> >>> >>> >>> -- Hack to get a correct timestamp because of hive inconsistent >>> conversion >>> >>> CAST(unix_timestamp(dt, "yyyy-MM-dd'T'HH:mm:ss") * 1.0 as timestamp) as >>> ts, >>> >>> >>> https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webreques… >>> >>> I emailed wiki legal request 3 month they not sure , can you clearly ask >>> me . >>> >>> If not use utc, is use your server clock or , my computer clock? >>> >>> >>> _______________________________________________ >> Analytics mailing list >> Analytics(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > -- > Joseph Allemandou (joal) (he / him) > Staff Data Engineer > Wikimedia Foundation > -- Joseph Allemandou (joal) (he / him) Staff Data Engineer Wikimedia Foundation

1 0

Re: [Analytics] About: refine_webrequest.hql
by Luca Toscano 15 Mar '21

15 Mar '21

+A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. <analytics(a)lists.wikimedia.org> Hi! I added the Analytics mailing list in Cc so other people can chime in, this is the canonical way to follow up with us and the community, please avoid direct email if possible :) Thanks! Luca On Sat, Mar 13, 2021 at 10:57 PM Ho Chung <chungho4865(a)gmail.com> wrote: > Hello > > I have some problem request , about refine_webrequest.hql > > > In this file timestamp is use utc ? > > This file is it connect wmf_raw.webrequest and wmf.webrequest ? > > Because i can't read the code have add Z / +/- zone time > > > > -- Hack to get a correct timestamp because of hive inconsistent conversion > > CAST(unix_timestamp(dt, "yyyy-MM-dd'T'HH:mm:ss") * 1.0 as timestamp) as > ts, > > > https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webreques… > > I emailed wiki legal request 3 month they not sure , can you clearly ask > me . > > If not use utc, is use your server clock or , my computer clock? > > >

2 1

Re: [Analytics] [Xmldatadumps-l] Proposed Project Grant for browsing and downloading Wikimedia datasets - Balchivist 2.0
by Hydriz Scholz 15 Mar '21

15 Mar '21

Thank you for your question. The datasets are intended to be retained forever, as researchers may want access to historical data. If any removal is necessary for compliance with local and international laws, it will be primarily handled by the Internet Archive, as they are the ones storing the data. Warmest regards, Hydriz Scholz On Mon, 15 Mar 2021 at 13:59, colin johnston <colinj(a)gt86car.org.uk> wrote: > > Are you going to implement retention times for data sets and removal of data info under gdpr order when asked ? > > Sent from my iPod > > > On 15 Mar 2021, at 01:59, Hydriz Scholz <hydriz(a)jorked.com> wrote: > > > > Dear All, > > > > I am User:Hydriz on Wikimedia wikis and I am working on a grant > > proposal to facilitate browsing and downloading of Wikimedia datasets > > (including the database dumps as well as other datasets). It is a > > proposed rewrite of the existing system which focused primarily on > > archiving the datasets to the Internet Archive. [1] > > > > My proposal aims to modernize the software used for automatically > > archiving datasets to the Internet Archive. More importantly, it aims > > to put researchers and downloaders first, by providing both a > > human-readable and a machine-readable interface for browsing and > > downloading datasets, whether present or historical. I also intend to > > integrate a "watchlist" feature that can automatically notify users > > when new datasets are available. > > > > Please do express your support for this proposal and help make this > > project a reality. Thank you! > > > > Warmest regards. > > Hydriz Scholz > > > > [1]: https://meta.wikimedia.org/wiki/Grants:Project/Hydriz/Balchivist_2.0 > > > > _______________________________________________ > > Xmldatadumps-l mailing list > > Xmldatadumps-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l -- Hydriz Scholz

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics March 2021