Hi everyone,
We are delighted to announce that Wiki Workshop 2020 will be held in
Taipei on April 20 or 21, 2020 (the date to be finalized soon) and as
part of the Web Conference 2020 [1]. In the past years, Wiki Workshop
has traveled to Oxford, Montreal, Cologne, Perth, Lyon, and San
Francisco.
You can read more about the call for papers and the workshops at
http://wikiworkshop.org/2020/#call. Please note that the deadline for
the submissions to be considered for proceedings is January 17. All
other submissions should be received by February 21.
If you have questions about the workshop, please let us know on this
list or at wikiworkshop(a)googlegroups.com.
Looking forward to seeing you in Taipei.
Best,
Miriam Redi, Wikimedia Foundation
Bob West, EPFL
Leila Zia, Wikimedia Foundation
[1] https://www2020.thewebconf.org/
Hello everyone - apologies for cross-posting! *TL;DR*: We would like your
feedback on our Metrics Kit project. Please have a look and comment on
Meta-Wiki:
https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit
The Wikimedia Foundation's Trust and Safety team, in collaboration with the
Community Health Initiative, is working on a Metrics Kit designed to
measure the relative "health"[1] of various communities that make up the
Wikimedia movement:
https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit
The ultimate outcome will be a public suite of statistics and data looking
at various aspects of Wikimedia project communities. This could be used by
both community members to make decisions on their community direction and
Wikimedia Foundation staff to point anti-harassment tool development in the
right direction.
We have a set of metrics we are thinking about including in the kit,
ranging from the ratio of active users to active administrators,
administrator confidence levels, and off-wiki factors such as freedom to
participate. It's ambitious, and our methods of collecting such data will
vary.
Right now, we'd like to know:
* Which metrics make sense to collect? Which don't? What are we missing?
* Where would such a tool ideally be hosted? Where would you normally look
for statistics like these?
* We are aware of the overlap in scope between this and Wikistats <
https://stats.wikimedia.org/v2/#/all-projects> — how might these tools
coexist?
Your opinions will help to guide this project going forward. We'll be
reaching out at different stages of this project, so if you're interested
in direct messaging going forward, please feel free to indicate your
interest by signing up on the consultation page.
Looking forward to reading your thoughts.
best,
Joe
P.S.: Please feel free to CC me in conversations that might happen on this
list!
[1] What do we mean by "health"? There is no standard definition of what
makes a Wikimedia community "healthy", but there are many indicators that
highlight where a wiki is doing well, and where it could improve. This
project aims to provide a variety of useful data points that will inform
community decisions that will benefit from objective data.
--
*Joe Sutherland* (he/him or they/them)
Trust and Safety Specialist
Wikimedia Foundation
joesutherland.rocks
Hi everybody!
I created the following doc:
https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_No…
It contains two FAQ:
- How do I ensure that there is enough space on disk before storing big
datasets/files ?
- How do I check the space used by my files/data on stat/notebook hosts ?
Please read them and let me know if anything is not clear or missing. We
have plenty of space on stat100X hosts, but we tend to cluster on single
machines like stat1007 for some reason, ending up in fighting for resources.
On a related note, we are going to work on unifying stat/notebook puppet
configs in https://phabricator.wikimedia.org/T243934, so eventually all
Analytics clients will be exactly the same.
Thanks!
Luca (on behalf of the Analytics team)
Hey there!
I was running SQL queries via PySpark (using the wmfdata package
<https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py>) on
SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java
heap space".
After that, when I tried to call the spark.sql function again (via
wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot
call methods on a stopped SparkContext."
When I tried to create a new Spark context using
SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session
or directly), it returned a SparkContent object properly, but calling the
object's sql function still gave the "stopped SparkContext error".
Any idea what's going on? I assume restarting the notebook kernel would
take care of the problem, but it seems like there has to be a better way to
recover.
Thank you!
Hello colleagues,
I'm forwarding this announcement to additional email lists.
Most public WMF meetings that are livestreamed on Youtube remain
available for replay after the meeting, and I'm guessing that this one
will be also.
Pine
( https://meta.wikimedia.org/wiki/User:Pine )
---------- Forwarded message ---------
From: Srishti Sethi <ssethi(a)wikimedia.org>
Date: Mon, Feb 24, 2020 at 8:59 PM
Subject: Re: [Wikitech-l] [Wikimedia Technical Talks] Data and
Decision Science at Wikimedia with Kate Zimmerman, 26 February 2020 @
6PM UTC
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Hello folks,
Just a reminder that this talk will take place Wednesday 26 February 2020
at 6 PM UTC.
Hope to see you there!
Cheers,
Srishti
*Srishti Sethi*
Developer Advocate
Wikimedia Foundation <https://wikimediafoundation.org/>
On Tue, Feb 18, 2020 at 2:57 PM Sarah R <srodlund(a)wikimedia.org> wrote:
> Hello Everyone,
>
> It's time for Wikimedia Tech Talks 2020 Episode 1! This talk will take
> place on *26 February 2020 at 6 PM UTC*.
>
> This month's talk will be in an interview format. You are invited to send
> questions ahead of time by replying to this email, or you can ask during Q
> & A section of the live talk by asking through IRC or the Youtube
> Livestream.
>
> Title: Data and Decision Science at Wikimedia
>
> Speaker: Kate Zimmerman, Head of Product Analytics at Wikimedia
>
> Summary:
>
> How do teams at the Foundation use data to inform decisions?
>
> Sarah R. Rodlund talks with Kate Zimmerman, Head of Product Analytics at
> Wikimedia, about what sorts of data her team uses and how insights from
> their analysis have shaped product decisions.
>
> Kate Zimmerman holds an MS in Psychology & Behavioral Decision Research
> from Carnegie Mellon University and has over 15 years of experience in
> quantitative and experimental methods. Before joining Wikimedia, she built
> data teams from scratch at ModCloth and SmugMug, evolving their data
> capabilities from basic reports to strategic analysis, automated
> dashboards, and advanced modeling.
>
> The link to the Youtube Livestream can be found here:
> https://www.youtube.com/watch?v=J-CRsiwYM9w
>
> During the live talk, you are invited to join the discussion on IRC at
> #wikimedia-office
>
> You can watch past Tech Talks here:
> https://www.mediawiki.org/wiki/Tech_talks
>
> If you are interested in giving your own tech talk, you can learn more
> here:
>
> https://www.mediawiki.org/wiki/Project:Calendar/How_to_schedule_an_event#Te…
>
> Note: This is a public talk. Feel free to distribute through appropriate
> email and social channels!
>
> Many kindnesses,
>
> Sarah R. Rodlund
> Technical Writer, Developer Advocacy
> srodlund(a)wikimedia.org
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi all,
join us for our monthly Analytics/Research Office hours on 2020-02-26 at
17.00-18.00 (UTC). Bring all your research questions and ideas to discuss
projects, data, analysis, etc…
To participate, please join the IRC channel: #wikimedia-research [1].
More detailed information can be found here [2] or on the etherpad [3] if
you would like to add items to agenda or check notes from previous meetings.
Best,
Martin
[1] irc://chat.freenode.net:6667/wikimedia-research
[2] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours
[3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours
--
Martin Gerlach
Research Scientist
Wikimedia Foundation
If your team uses mw.user.sessionId() for instrumentation, a recent change
to MediaWiki could impact your numbers.
The new patch <https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/572011/>
changes
the way that session IDs work, bringing their behavior closer to other
platforms that many of us are familiar with.
The value returned from mw.user.sessionId() will now:
- be the same in different tabs of the same browser process
- be the same in different windows of the same browser process
- be forgotten once the browser process ends
Since 2017, values returned from mw.user.sessionId() have only been
constant within the same browser tab, and only lasted until the tab was
closed. This had gone unnoticed until recently. See T223931
<https://phabricator.wikimedia.org/T223931> for more details. This patch
restores pre-2017 behavior.
If you have any questions about the change, or if you notice any
irregularities in your data or instrumentation, reach out or tag jlinehan,
mpopov, or the Better Use of Data topic on Phabricator.
-Jason
Hi all,
The next Research Showcase will be live-streamed on Wednesday, February 19,
at 9:30 AM PST/17:30 UTC. We’ll have presentations from Jeffrey V.
Nickerson on human/machine collaboration on Wikipedia, and Lucie-Aimée
Kaffee on human/machine collaboration on Wikidata. A question-and-answer
session will follow.
YouTube stream: https://www.youtube.com/watch?v=fj0z20PuGIk
As usual, you can join the conversation on IRC at #wikimedia-research. You
can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
This month's presentations:
Autonomous tools and the design of work
By Jeffrey V. Nickerson, Stevens Institute of Technology
Bots and other software tools that exhibit autonomy can appear in an
organization to be more like employees than commodities. As a result,
humans delegate to machines. Sometimes the machines turn and delegate part
of the work back to humans. This talk will discuss how the design of human
work is changing, drawing on a recent study of editors and bots in
Wikipedia, as well as a study of game and chip designers. The Wikipedia bot
ecosystem, and how bots evolve, will be discussed. Humans are working
together with machines in complex configurations; this puts constraints on
not only the machines but also the humans. Both software and human skills
change as a result. Paper
<https://dl.acm.org/doi/pdf/10.1145/3359317?download=true>
When Humans and Machines Collaborate: Cross-lingual Label Editing in
Wikidata
By Lucie-Aimée Kaffee, University of Southampton
The quality and maintainability of any knowledge graph are strongly
influenced in the way it is created. In the case of Wikidata, the knowledge
graph is created and maintained by a hybrid approach of human editing
supported by automated tools. We analyse the editing of natural language
data, i.e. labels. Labels are the entry point for humans to understand the
information, and therefore need to be carefully maintained. Wikidata is a
good example for a hybrid multilingual knowledge graph as it has a large
and active community of humans and bots working together covering over 300
languages. In this work, we analyse the different editor groups and how
they interact with the different language data to understand the provenance
of the current label data. This presentation is based on the paper “When
Humans and Machines Collaborate: Cross-lingual Label Editing in Wikidata”,
published in OpenSym 2019 in collaboration with Kemele M. Endris and Elena
Simperl. Paper
<https://opensym.org/wp-content/uploads/2019/08/os19-paper-A16-kaffee.pdf>
--
Janna Layton (she, her)
Administrative Assistant - Product & Technology
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi Giovanni,
The pagelinks table is great for temporal snapshots: you know about links
between pages at the time of the query. Parsing the wikitext is needed to
provide an historical view of the links :)
Cheers
Joseph
On Tue, Feb 18, 2020 at 12:22 AM Giovanni Luca Ciampaglia <glc3(a)mail.usf.edu>
wrote:
> Thank you Joseph; great to hear there is interest in building such a
> dataset. You say that the link information would need to be parsed from
> wikitext, which is complicated; would the pagelinks table help as an
> alternative source of data?
>
> *Giovanni Luca Ciampaglia* ∙ glciampaglia.com
> Assistant Professor
> Computer Science and Engineering
> <https://www.usf.edu/engineering/cse/> ∙ University
> of South Florida <https://www.usf.edu/>
>
> *Due to Florida’s broad open records law, email to or from university
> employees is public record, available to the public and the media upon
> request.*
>
>
> On Thu, Feb 13, 2020 at 9:27 AM Joseph Allemandou <
> jallemandou(a)wikimedia.org>
> wrote:
>
> > Hi Giovanni,
> > Thank you for your message :)
> > You are correct in that there is no information on page-to-page link as
> of
> > today, as well as no information for instance on historical values of
> > revisions being redirects for instance.
> > We share with you the idea that such information is extremely valuable,
> and
> > we have in mind to be able to extract it at some point.
> > The reason for which it has not yet been done is because those pieces
> > of information are only available through parsing the wikitext of every
> > revision, which is not only resource intensive but also complicated
> > technically (templates, version changes etc).
> > You can be sure we will send another announcement when we'll release that
> > data :)
> > Best,
> >
> > On Tue, Feb 11, 2020 at 10:30 PM Giovanni Luca Ciampaglia <
> > glc3(a)mail.usf.edu>
> > wrote:
> >
> > > Hi Joseph,
> > >
> > > Thanks a lot for creating and sharing such a valuable resource. I went
> > > through the schema and from what I understand there is no information
> > about
> > > page-to-page links, correct? Are there any resources that would provide
> > > such historical data?
> > >
> > > Best,
> > >
> > > *Giovanni Luca Ciampaglia* ∙ glciampaglia.com
> > > Assistant Professor
> > > Computer Science and Engineering
> > > <https://www.usf.edu/engineering/cse/> ∙ University
> > > of South Florida <https://www.usf.edu/>
> > >
> > > *Due to Florida’s broad open records law, email to or from university
> > > employees is public record, available to the public and the media upon
> > > request.*
> > >
> > >
> > > On Mon, Feb 10, 2020 at 11:28 AM Joseph Allemandou <
> > > jallemandou(a)wikimedia.org> wrote:
> > >
> > > > Hi Analytics People,
> > > >
> > > > The Wikimedia Analytics Team is pleased to announce the release of
> the
> > > most
> > > > complete dataset we have to date to analyze content and contributors
> > > > metadata: Mediawiki History [1] [2].
> > > >
> > > > Data is in TSV format, released monthly around the 3rd of the month
> > > > usually, and every new release contains the full history of metadata.
> > > >
> > > > The dataset contains an enhanced [3] and historified [4] version of
> > user,
> > > > page and revision metadata and serves as a base to Wiksitats API on
> > > edits,
> > > > users and pages [5] [6].
> > > >
> > > > We hope you will have as much fun playing with the data as we have
> > > building
> > > > it, and we're eager to hear from you [7], whether for issues, ideas
> or
> > > > usage of the data.
> > > >
> > > > Analytically yours,
> > > >
> > > > --
> > > > Joseph Allemandou (joal) (he / him)
> > > > Sr Data Engineer
> > > > Wikimedia Foundation
> > > >
> > > > [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
> > > > [2]
> > > >
> > > >
> > >
> >
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_his…
> > > > [3] Many pre-computed fields are present in the dataset, from
> > edit-counts
> > > > by user and page to reverts and reverted information, as well as time
> > > > between events.
> > > > [4] As accurate as possible historical usernames and page-titles (as
> > well
> > > > as user-groups and blocks) is available in addition to current
> values,
> > > and
> > > > are provided in a denormalized way to every event of the dataset.
> > > > [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
> > > > [6] https://wikimedia.org/api/rest_v1/
> > > > [7]
> > > >
> > > >
> > >
> >
> https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20Hi…
> > > > _______________________________________________
> > > > Wiki-research-l mailing list
> > > > Wiki-research-l(a)lists.wikimedia.org
> > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > >
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > Wiki-research-l(a)lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> >
> >
> > --
> > Joseph Allemandou (joal) (he / him)
> > Sr Data Engineer
> > Wikimedia Foundation
> > _______________________________________________
> > Wiki-research-l mailing list
> > Wiki-research-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
--
Joseph Allemandou (joal) (he / him)
Sr Data Engineer
Wikimedia Foundation
Hi Analytics People,
The Wikimedia Analytics Team is pleased to announce the release of the most
complete dataset we have to date to analyze content and contributors
metadata: Mediawiki History [1] [2].
Data is in TSV format, released monthly around the 3rd of the month
usually, and every new release contains the full history of metadata.
The dataset contains an enhanced [3] and historified [4] version of user,
page and revision metadata and serves as a base to Wiksitats API on edits,
users and pages [5] [6].
We hope you will have as much fun playing with the data as we have building
it, and we're eager to hear from you [7], whether for issues, ideas or
usage of the data.
Analytically yours,
--
Joseph Allemandou (joal) (he / him)
Sr Data Engineer
Wikimedia Foundation
[1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
[2]
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_his…
[3] Many pre-computed fields are present in the dataset, from edit-counts
by user and page to reverts and reverted information, as well as time
between events.
[4] As accurate as possible historical usernames and page-titles (as well
as user-groups and blocks) is available in addition to current values, and
are provided in a denormalized way to every event of the dataset.
[5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
[6] https://wikimedia.org/api/rest_v1/
[7]
https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20Hi…