In her recent announcement of her upcoming departure as the Wikimedia
Foundation's CEO, Katherine highlighted a growth in "reader
engagement" by 30% during her tenure (i.e. since 2016). A WMF
board member since reported in somewhat more detail that this refers
to "~1 billion interactions up 32% in six years".
Are the underlying numbers published somewhere?
PS: As some may be aware, a widely read German blogger linked to
Katherine's tweet while singling out the "reader engagement" bit for
some outspoken criticism. Just to clarify, that's not why I'm asking
(in fact I disagree with most of that criticism).
We are delighted to announce that Wiki Workshop 2021 will be held
virtually in April 2021 and as part of the Web Conference 2021 .
The exact day is to be finalized and we know it will be between April
In the past years, Wiki Workshop has traveled to Oxford, Montreal,
Cologne, Perth, Lyon, and San Francisco, and (virtually) to Taipei.
Last year, we had more than 120 participants in the workshop and we
are particularly excited about this year's as we will celebrate the
20th birthday of Wikipedia.
We encourage contributions by all researchers who study the Wikimedia
projects. We specifically encourage 1-2 page submissions of
preliminary research. You will have the option to publish your work as
part of the proceedings of The Web Conference 2021.
You can read more about the call for papers and the workshop at
http://wikiworkshop.org/2021/#call. Please note that the deadline for
the submissions to be considered for proceedings is January 29. All
other submissions should be received by March 1.
If you have questions about the workshop, please let us know on this
list or at wikiworkshop(a)googlegroups.com.
Looking forward to seeing many of you in this year's edition.
Miriam Redi, Wikimedia Foundation
Bob West, EPFL
Leila Zia, Wikimedia Foundation
In this showcase, Prof. Danielle Bassett will present recent work studying
individual and collective curiosity as network building processes using
Date/Time: March 17, 16:30 UTC (9:30am PT/12:30pm ET/17:30pm CET)
Speaker: Danielle Bassett (University of Pennsylvania)
Title: The curious human
Abstract: The human mind is curious. It is strange, remarkable, and
mystifying; it is eager, probing, questioning. Despite its pervasiveness
and its relevance for our well-being, scientific studies of human curiosity
that bridge both the organ of curiosity and the object of curiosity remain
in their infancy. In this talk, I will integrate historical, philosophical,
and psychological perspectives with techniques from applied mathematics and
statistical physics to study individual and collective curiosity. In the
former, I will evaluate how humans walk on the knowledge network of
Wikipedia during unconstrained browsing. In doing so, we will capture
idiosyncratic forms of curiosity that span multiple millennia, cultures,
languages, and timescales. In the latter, I will consider the fruition of
collective curiosity in the building of scientific knowledge as encoded in
Wikipedia. Throughout, I will make a case for the position that individual
and collective curiosity are both network building processes, providing a
connective counterpoint to the common acquisitional account of curiosity in
Hunters, busybodies, and the knowledge network building associated with
The network structure of scientific revolutions.
Janna Layton (she/her)
Administrative Associate - Product & Technology
Wikimedia Foundation <https://wikimediafoundation.org/>
Join the Research Team at the Wikimedia Foundation  for their monthly
Office hours on 2021-03-16 at 16:00-17:00 UTC (9am PT/5pm CET).
To participate, join the video-call via this link . There is no set
agenda - feel free to add your item to the list of topics in the etherpad
 (You can do this after you join the meeting, too.), otherwise you are
welcome to also just hang out. More detailed information (e.g. about how to
attend) can be found here .
Through these office hours, we aim to make ourselves more available to
answer some of the research related questions that you as Wikimedia
volunteer editors, organizers, affiliates, staff, and researchers face in
your projects and initiatives. Some example cases we hope to be able to
support you in:
You have a specific research related question that you suspect you
should be able to answer with the publicly available data and you don’t
know how to find an answer for it, or you just need some more help with it.
For example, how can I compute the ratio of anonymous to registered editors
in my wiki?
You run into repetitive or very manual work as part of your Wikimedia
contributions and you wish to find out if there are ways to use machines to
improve your workflows. These types of conversations can sometimes be
harder to find an answer for during an office hour, however, discussing
them can help us understand your challenges better and we may find ways to
work with each other to support you in addressing it in the future.
You want to learn what the Research team at the Wikimedia Foundation
does and how we can potentially support you. Specifically for affiliates:
if you are interested in building relationships with the academic
institutions in your country, we would love to talk with you and learn
more. We have a series of programs that aim to expand the network of
Wikimedia researchers globally and we would love to collaborate with those
of you interested more closely in this space.
You want to talk with us about one of our existing programs .
Hope to see many of you,
Martin (WMF Research Team)
We are currently working on a wikipedia visualisation tool (which is presented here: http://www.wikimaps.io/). We use several pageview statistics to generate time series for each page from 2008 to 2020. (we use pagecounts, pageviews and pageview_complete). This last format is great for our work compared to previous format, and we use it for our data from 2016 to 2020. (Thank to the analytics team for that).
We aggregate redirections as one page, identified by the page_id (as it is done in the pageview_complete files).
But when we compare with the wikimedia API, we have some small differences.
I think this problem comes from the fact that wikimedia API (and pageviews.toolforge.org) uses page_title to get the time series, and I saw that pageview_complete files contain entries where the page_title is missing (replaced by a "-"). As we are using page_id to do the aggregation whenever it is possible, we aggregate these "-" entries, but pageviews.toolforge.org probably does not.
For example for the page Barack_Obama in French, and the file `pageviews-20200112-user.bz2`, I get several relevant entries.
fr.wikipedia - 167398 mobile-web 1 B1
fr.wikipedia Barack 167398 mobile-web 1 X1
fr.wikipedia Barack_Hussein_Obama 167398 mobile-web 1 J1
fr.wikipedia Barack_Obama 167398 desktop 748 A18B10C5D8E3F3G8H6I18J36K41L37M35N37O55P76Q65R57S48T29U56V42W23X32
fr.wikipedia Barack_Obama 167398 mobile-app 10 A1L1O1Q1T3U2V1
fr.wikipedia Barack_Obama 167398 mobile-web 1732 A62B38C28D17E24F10G16H43I40J56K65L78M87N100O95P100Q93R127S84T128U124V184W84X49
fr.wikipedia Natasha_Obama 167398 desktop 3 Q1R2
fr.wikipedia Obama 167398 desktop 11 J2K1M1O1Q2R1S1U1W1
fr.wikipedia Obama 167398 mobile-web 2 R1V1
fr.wikipedia Obama_Barack 167398 desktop 3 N1P2
fr.wikipedia Sacha_Obama 167398 desktop 3 J1O2
fr.wikipedia Sacha_Obama 167398 mobile-web 1 C1
fr.wikipedia Barack_Obama mobile-app 29 B1C1H4J1L1M2N3O3P1R3S5V1W2X1
That is 12 entries that use the page_id, and one that does not.
I have two questions about that result.
What kind of query can cause theses "-" entries ?
Why the entry "Barack_Obama mobile-app" appears two times ?
Sorry for the long introduction and thank you for your time.
Forwarding to the analytics list for reference.
---------- Forwarded message ---------
From: Ho Chung <chungho4865(a)gmail.com>
Date: Mon, Mar 15, 2021 at 11:45 AM
Subject: Re: [Analytics] About: refine_webrequest.hql
To: Joseph Allemandou <jallemandou(a)wikimedia.org>
Thanks for your reply
Because i was research your Analytics team public discuss history and
wikiteah about web request time stamp
I have been in doubt at that time, you're used java technology, but your
HIVE version did not support java before October 2018.
The wmf.webrequest file is located in HIVE.
When collecting the privacy data of readership , whether the time stamp
used the reader's computer system clock instead of the Wikipedia computer
server clock when reading and browsing the page
Now I am more clear. On the public discussion page of your analysis team,
said that all the time is utc by Ottomata
It’s just that you technicians don’t want to unify the expression of the
time stamp format, but in fact all of them use UTC
在 2021年3月15日週一 16:14，Joseph Allemandou <jallemandou(a)wikimedia.org> 寫道：
> the `dt` field is the time in UTC (no timezone specified) at which the
> request ends being processed by Varnish.
> On Mon, Mar 15, 2021 at 8:36 AM Luca Toscano <ltoscano(a)wikimedia.org>
>> +A mailing list for the Analytics Team at WMF and everybody who has an
>> interest in Wikipedia and analytics. <analytics(a)lists.wikimedia.org>
>> I added the Analytics mailing list in Cc so other people can chime in,
>> this is the canonical way to follow up with us and the community, please
>> avoid direct email if possible :)
>> On Sat, Mar 13, 2021 at 10:57 PM Ho Chung <chungho4865(a)gmail.com> wrote:
>>> I have some problem request , about refine_webrequest.hql
>>> In this file timestamp is use utc ?
>>> This file is it connect wmf_raw.webrequest and wmf.webrequest ?
>>> Because i can't read the code have add Z / +/- zone time
>>> -- Hack to get a correct timestamp because of hive inconsistent
>>> CAST(unix_timestamp(dt, "yyyy-MM-dd'T'HH:mm:ss") * 1.0 as timestamp) as
>>> I emailed wiki legal request 3 month they not sure , can you clearly ask
>>> me .
>>> If not use utc, is use your server clock or , my computer clock?
>> Analytics mailing list
> Joseph Allemandou (joal) (he / him)
> Staff Data Engineer
> Wikimedia Foundation
Joseph Allemandou (joal) (he / him)
Staff Data Engineer
+A mailing list for the Analytics Team at WMF and everybody who has an
interest in Wikipedia and analytics. <analytics(a)lists.wikimedia.org>
I added the Analytics mailing list in Cc so other people can chime in, this
is the canonical way to follow up with us and the community, please avoid
direct email if possible :)
On Sat, Mar 13, 2021 at 10:57 PM Ho Chung <chungho4865(a)gmail.com> wrote:
> I have some problem request , about refine_webrequest.hql
> In this file timestamp is use utc ?
> This file is it connect wmf_raw.webrequest and wmf.webrequest ?
> Because i can't read the code have add Z / +/- zone time
> -- Hack to get a correct timestamp because of hive inconsistent conversion
> CAST(unix_timestamp(dt, "yyyy-MM-dd'T'HH:mm:ss") * 1.0 as timestamp) as
> I emailed wiki legal request 3 month they not sure , can you clearly ask
> me .
> If not use utc, is use your server clock or , my computer clock?
Thank you for your question.
The datasets are intended to be retained forever, as researchers may
want access to historical data. If any removal is necessary for
compliance with local and international laws, it will be primarily
handled by the Internet Archive, as they are the ones storing the
On Mon, 15 Mar 2021 at 13:59, colin johnston <colinj(a)gt86car.org.uk> wrote:
> Are you going to implement retention times for data sets and removal of data info under gdpr order when asked ?
> Sent from my iPod
> > On 15 Mar 2021, at 01:59, Hydriz Scholz <hydriz(a)jorked.com> wrote:
> > Dear All,
> > I am User:Hydriz on Wikimedia wikis and I am working on a grant
> > proposal to facilitate browsing and downloading of Wikimedia datasets
> > (including the database dumps as well as other datasets). It is a
> > proposed rewrite of the existing system which focused primarily on
> > archiving the datasets to the Internet Archive. 
> > My proposal aims to modernize the software used for automatically
> > archiving datasets to the Internet Archive. More importantly, it aims
> > to put researchers and downloaders first, by providing both a
> > human-readable and a machine-readable interface for browsing and
> > downloading datasets, whether present or historical. I also intend to
> > integrate a "watchlist" feature that can automatically notify users
> > when new datasets are available.
> > Please do express your support for this proposal and help make this
> > project a reality. Thank you!
> > Warmest regards.
> > Hydriz Scholz
> > : https://meta.wikimedia.org/wiki/Grants:Project/Hydriz/Balchivist_2.0
> > _______________________________________________
> > Xmldatadumps-l mailing list
> > Xmldatadumps-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l