Hello,
This is just a quick message to let you know that we made some changes
today to the monitoring configuration of many of the Data Platform
Engineering servers. This may affect you if you participate in Ops Week
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week> for Data
Engineering and friends.
By default, all notification alerts from Icinga and Prometheus will now
go to data-platform-alerts(a)wikimedia.org
<https://groups.google.com/a/wikimedia.org/g/data-platform-alerts>
instead of data-engineering-alerts(a)lists.wikimedia.org
<https://lists.wikimedia.org/hyperkitty/list/data-engineering-alerts@lists.w…>
We are working to try to make sure that we can route any alert emails
(and IRC pings) to the most appropriate team, principally so that we
don't overload the person who is on Ops Week with a lot of messages that
would be more appropriately routed to Data Platform SREs.
Any scheduled tasks related to data pipelines and services critical for
data processing are still going to be sent to the
data-engineering-alerts(a)lists.wikimedia.org
<https://lists.wikimedia.org/hyperkitty/list/data-engineering-alerts@lists.w…>
list, so that's Airflow jobs, Refine tasks, Gobblin, Sqoop,
Varnishkafka, Eventlogging etc.
We haven't made any changes to the monitoring/notification settings of
the Search and Query Services servers (Elasticsearch/WDQS/WCQS etc) nor
have we made any changes to the Dumps servers. This mainly affects the
analytics systems
<https://wikitech.wikimedia.org/wiki/Analytics/Systems> and the rest of
the Data Engineering team's infrastructure.
Please do let us know if you have any queries or concerns about this
change, or if anything doesn't look right to you.
You can reach out on Slack at #data-engineering-collab or
#data-platform-sre or on IRC at #wikimedia-analytics or
#wikimedia-data-platform or to data-platform-engineering(a)wikimedia.org
by email.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello,
Please update to *wmfdata version 2.3.0* and/or *update your
conda-analytics* environments.
Hello, I have just pushed a new version of our conda-analytics
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda>
environment to production and I encourage you to start using it as soon
as possible please. The only change from the previous version is an
important bump of the wmfdata-python
<https://github.com/wikimedia/wmfdata-python> library to version 2.3.0,
which allows wmfdata to talk to presto using a DNS alias, instead of a
hard-coded hostname.
If youcreate a new clone of conda-analytics
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda#Creating…>
you will automatically get this new version of wmfdata, but if that's
inconvenient you can always update the version within your existing
environments. The instructions for doing that are here
<https://github.com/wikimedia/wmfdata-python?tab=readme-ov-file#installation…>.
Once you have all had enough time to update your environments, we will
be able to make a change to the presto configuration that will break
presto support for older versions of wmfdata.
If you have any questions or concerns about this change, or if you
notice anything peculiar with conda-analytics, please don't hesitate to
let us know and we will look into it.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello,
We need to carry out some scheduled maintenance on the web server behind
the following services:
* analytics.wikimedia.org/published
<https://analytics.wikimedia.org/published/>
* stats.wikimedia.org <https://stats.wikimedia.org>
This means that we need to schedule a period of downtime for these
services, of up to around 30 minutes.
I'd like to schedule this for next Tuesday morning, the 6th of February,
starting at 10:30 UTC.
Please do let me know if this will inconvenient for you at all and I
will postpone the upgrade.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
*TL;DR* - Please test https://superset-next.wikimedia.org and let us
know of any problems. Thanks.
Hello,
This message is specifically addressed to any users of
superset.wikimedia.org <https://superset.wikimedia.org>.
We in the Data Platform SRE team would be grateful for your assistance
in the acceptance testing phase of an upgrade to Superset, please. Our
production Superset instance is currently running version *1.5.3*, but
superset-next.wikimedia.org <https://superset-next.wikimedia.org> has
now been upgraded to version *3.1.0* and is ready for testing. Its
database was copied from the production instance yesterday, so it is
relatively fresh.
If you could spend a little time reviewing whether your dashboards,
charts, dataset, and SQL queries etc. work properly, that would be
really helpful. There are lots of changes between the 1.5 and 3.1
releases, so please feel free to read through the following release
notes, where the highlights are listed.
* release-notes-2-0
<https://github.com/apache/superset/blob/master/RELEASING/release-notes-2-0/…>
* release-notes-3-1
<https://github.com/apache/superset/blob/master/RELEASING/release-notes-3-1/…>
One particular point of note in the latest upgrade is that a viz
migrations
<https://github.com/apache/superset/blob/master/RELEASING/release-notes-3-1/…>
CLI tool has been added, which can help migrate legacy (Area, Bubble,
Line, and Sunburst) chart types to the newer ECharts based versions.
Please let us know if this tool would be of interest to you and we can
look at running it on your behalf.
Once we assess feedback from users of superset-next, we will be able to
schedule a date for the upgrade of the production instance. All things
being well, we would hope to do this upgrade within *a week or two*.
Feel free to share any feedback or queries about this upgrade in the
#data-engineering-collab Slack channel, or the #wikimedia-analytics IRC
channel, or any of the mailing lists where you read this, or simply by
reply if you prefer.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
Wiki Workshop 2024 (now in its 11th edition) will take place as a
standalone virtual event on June 20, 2024. For more information, see the
workshop website: https://wikiworkshop.org/2024/
The call for papers is now open:
https://wikiworkshop.org/2024/call-for-papers
The call is for extended abstracts (2 pages) of ongoing or completed work.
The deadline is April 22. The submissions are non-archival which means you
can submit work that is already published as well!
If you have questions about the workshop, please let us know on this list
or at wikiworkshop(a)googlegroups.com.
On behalf of the organizing committee,
Pablo Aragón, Wikimedia Foundation
Pablo Beytía, Catholic University of Chile
Martin Gerlach, Wikimedia Foundation
Kinneret Gordon, Wikimedia Foundation
Robert West, EPFL
Leila Zia, Wikimedia Foundation
----
We invite contributions to the research track of the 11th edition of Wiki
Workshop, which will take place virtually on June 20, 2024 (tentatively
12:00-19:00 UTC) as a standalone event.
The Wiki Workshop is the largest Wikimedia research event of the year,
aimed at bringing together researchers who study all aspects of Wikimedia
projects (including, but not limited to, Wikipedia, Wikidata, Wikimedia
Commons, Wikisource, and Wiktionary) as well as Wikimedia developers,
affiliate organizations, and volunteer editors. Co-organized by the
Wikimedia Foundation’s Research team and members of the Wikimedia research
community, the workshop provides a direct pathway for exchanging ideas
between the organizations that serve Wikimedia projects and the researchers
actively studying them.
Building on the successful experiences of organizing Wiki Workshop in 2015
<https://wikiworkshop.org/2015/>, 2016 <https://wikiworkshop.org/2016/>,
2017 <https://wikiworkshop.org/2017/>, 2018 <https://wikiworkshop.org/2018/>,
2019 <https://wikiworkshop.org/2019/>, 2020 <https://wikiworkshop.org/2020/>,
2021 <https://wikiworkshop.org/2021/>, 2022 <https://wikiworkshop.org/2022/>,
2023 <https://wikiworkshop.org/2023/> and based on feedback from authors
and participants over the years, this year’s research track is organized as
follows:
-
Submissions are non-archival, meaning we welcome ongoing, completed, and
already published work.
-
We accept submissions in the form of 2-page extended abstracts.
-
Authors of accepted abstracts will be invited to present their research
in a pre-recorded oral presentation with dedicated time for live Q&A on
June 20, 2024.
-
Accepted abstracts will be shared on the website prior to the event.
Topics include, but are not limited to:
-
new technologies and initiatives to grow content, quality, equity,
diversity, and participation across Wikimedia projects;
-
use of bots, algorithms, and crowdsourcing strategies to curate, source,
or verify content and structured data;
-
bias in content and gaps of knowledge on Wikimedia projects;
-
relation between Wikimedia projects and the broader (open) knowledge
ecosystem;
-
exploration of what constitutes a source and how/if the incorporation of
other kinds of sources are possible (e.g., oral histories, video);
-
detection of low-quality, promotional, or fake content (misinformation
or disinformation), as well as fake accounts (e.g., sock puppets);
-
questions related to community health (e.g., sentiment analysis,
harassment detection, tools that could increase harmony);
-
motivations, engagement models, incentives, and needs of editors,
readers, and/or developers of Wikimedia projects;
-
innovative uses of Wikipedia and other Wikimedia projects for AI and NLP
applications and vice versa;
-
consensus-finding and conflict resolution on editorial issues;
-
dynamics of content reuse across projects and the impact of policies and
community norms on reuse;
-
privacy, security, and trust;
-
collaborative content creation;
-
innovative uses of Wikimedia projects' content and consumption patterns
as sensors for real-world events, culture, etc.;
-
open-source research code, datasets, and tools to support research on
Wikimedia contents and communities;
-
connections between Wikimedia projects and the Semantic Web;
-
strategies for how to incorporate Wikimedia projects into media literacy
interventions.
Important dates and timeline:
-
Submission deadline: April 22, 2024 (23:59 AoE
<https://en.wikipedia.org/wiki/Anywhere_on_Earth>)
-
Author notification: May 27, 2024
-
Final version due: June 10, 2024 (23:59 AoE)
-
Workshop date: June 20, 2024
Submission instructions:
https://wikiworkshop.org/2024/call-for-papers#submission
--
Martin Gerlach (he/him) | Senior Research Scientist | Wikimedia Foundation
Hi all, im new here , working on environmental science using digital footprints. i want to use wikipedia search analytics for my research. im looking for a way to retrieve location data of users searching specific pages. something like : for january -December 2023, query= 'Electric cars', location
is there a way to retrieve such data , what are the permissions i need ? and how to i get them?
Hi all,
The next Research Showcase will be live-streamed on Wednesday, January 17,
at 9:30 AM PST / 17:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1705512600>. The theme for this showcase is
*Connecting Action with Policy*.
You are welcome to watch via the YouTube stream:
https://www.youtube.com/watch?v=UUuC6Q1SIoM. As usual, you can join the
conversation in the YouTube chat as soon as the showcase goes live.
This month's presentations:
Presenting the report "Unreliable Guidelines"By *Amber Berson and Monika
Jones*The goal behind the report Unreliable Guidelines: Reliable Sources
and Marginalized Communities in French, English and Spanish Wikipedias was
to understand the effects of the set of reliable source guidelines and
rules on the participation of and the content about marginalized
communities on three Wikipedias. Two years following the release of their
report, researchers Berson and Sengul-Jones reflect on the impact of their
research as well as the actionable next steps.Why Should This Article Be
Deleted? Transparent Stance Detection in Multilingual Wikipedia Editor
DiscussionsBy *Lucie-Aimée Kaffee and Arnav Arora*The moderation of content
on online platforms is usually non-transparent. On Wikipedia, however, this
discussion is carried out publicly and the editors are encouraged to use
the content moderation policies as explanations for making moderation
decisions. However, currently only a few comments explicitly mention those
policies. To aid in this process of understanding how content is moderated,
we construct a novel multilingual dataset of Wikipedia editor discussions
along with their reasoning in three languages. We demonstrate that stance
and corresponding reason (policy) can be predicted jointly with a high
degree of accuracy, adding transparency to the decision-making process.
Best,
Kinneret
--
Kinneret Gordon
Lead Research Community Officer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello,
If you use wmfdata-python <https://github.com/wikimedia/wmfdata-python>,
please *upgrade it soon* to version 2.2.0 in order to allow its presto
support to keep working.
We have just deployed a new version of our conda-analytics
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda>
environment, which has this new version of wmfdata installed by default,
so you can use conda-analytics-clone to make a new, custom environment
for yourself. Alternatively, you can update it within your existing
environments with:
pip install --upgrade
git+https://github.com/wikimedia/wmfdata-python.git@release
This upgrade is necessary because we are in the process of improving our
Presto <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto>
services, for which we need to change the TLS certificates that secure
our client connections to the Presto cluster. Versions of wmfdata-python
prior to v2.2.0 were hard-coded
<https://github.com/wikimedia/wmfdata-python/commit/b7b5df4651c880ad6fc0980c…>
to use our Puppet based Certificate Authority.
We will implement a change
<https://gerrit.wikimedia.org/r/c/operations/puppet/+/709713> to the
Presto configuration to switch the certificates *around mid-January
2024*, at which point any versions of wmfdata-python prior to 2.2.0 will
cease to connect to Presto and will return an error. I will send further
updates nearer the time, with more precise dates.
Please do let me know if you have any queries or concerns about this change.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
*Scheduled downtime for Hadoop - Monday Jan 15th - 10:00 until 12:00 UTC*
Hello,
We need to perform some maintenance on our primary Hadoop cluster, which
will require a period of *downtime*. This work is scheduled for *Monday
Jan 15th - 10:00 until 12:00 UTC* - which is a US holiday for WMF and
also Wikipedia Day <https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Day>.
This 2 hour maintenance window has been chosen in the hope of minimising
disruption for you, whilst the cluster and the various tools that depend
upon it, such as Superset and JupyterLab, are largely unavailable.
The work being undertaken is a replacement of the Hadoop nameserver
hosts <https://phabricator.wikimedia.org/T332573> which, unfortunately,
requires a full cluster restart. We will be disabling ingestion to HDFS,
pausing Airflow DAGs on all instances, and stopping production data
processing pipelines, prior to the work, then re-enabling them all
afterwards. We are not expecting any gaps in data, once the pipelines
have caught up again.
If you have any queries or concerns about this work, or the time or date
is particularly in convenient for you, please don't hesitate to let us
know, so that we can look to reschedule.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello,
I am planning to move the MariaDB database at:
*staging-db-analytics.eqiad.wmnet* to a new host tomorrow, as part of
T351924 <https://phabricator.wikimedia.org/T351924>.
If you have never used the command: *analytics-mysql staging* from a
stat host, or accessed the *mysql_staging* database from
superset.wikimedia.org <https://superset.wikimedia.org>, then this work
is unlikely to affect you. If you do use this database, please note that
it will be unavailable for a period of up to 1 hour, tomorrow from
around 11:00 UTC, whilst it is moved to a new host.
As ever, please let me know if this work will cause you any
inconvenience and I will look to reschedule it.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>