Hi all,
The next Research Showcase will be live-streamed next Wednesday, July 24,
at 9:30 AM PST / 16:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1721838600>. The theme for this showcase is
*Machine Translation on Wikipedia*.
You are welcome to watch via the YouTube stream:
https://www.youtube.com/live/O7AqvHgqUVk. As usual, you can join the
conversation in the YouTube chat as soon as the showcase goes live.
This month's presentations:
The Promise and Pitfalls of AI Technology in Bridging Digital Language
DivideBy *Kai Zhu, Bocconi University*Machine translation technologies have
the potential to bridge knowledge gaps across languages, promoting more
inclusive access to information regardless of native languages. This study
examines the impact of integrating Google Translate into Wikipedia's
Content Translation system in January 2019. Employing a natural experiment
design and difference-in-differences strategy, we analyze how this
translation technology shock influenced the dynamics of content production
and accessibility on Wikipedia across over a hundred languages. We find
that this technology integration leads to a 149% increase in content
production through translation, driven by existing editors becoming more
productive as well as an expansion of the editor base. Moreover, we observe
that machine translation enhances the propagation of biographical and
geographical information, helping to close these knowledge gaps in the
multilingual context. However, our findings also underscore the need for
continued efforts to mitigate the preexisting systemic barriers. Our study
contributes to our knowledge on the evolving role of artificial
intelligence in shaping knowledge dissemination through enhanced language
translation capabilities.Implications of Using Inorganic Content in Arabic
Wikipedia EditionsBy *Saied Alshahrani and Jeanna Matthews, Clarkson
University*Wikipedia articles (content pages) are one of the widely
utilized training corpora for NLP tasks and systems, yet these articles are
not always created, generated, or even edited organically by native
speakers; some are automatically created, generated, or translated using
Wikipedia bots or off-the-shelf translation tools like Google Translate
without human revision or supervision. We first analyzed the three Arabic
Wikipedia editions, Arabic (AR), Egyptian Arabic (ARZ), and Moroccan Arabic
(ARY), and found that these Arabic Wikipedia editions suffer from a few
serious issues, like large-scale automatic creations and translations from
English to Arabic, all without human involvement, generating content
(articles) that lack not only linguistic richness and diversity but also
content that lacks cultural richness and meaningful representation of the
Arabic language and its native speakers. We second studied the performance
implications of using such inorganic, unrepresentative articles to train
NLP tasks or systems, where we intrinsically evaluated the performance of
two main NLP upstream tasks, namely word representation and language
modeling, using word analogy and fill-mask evaluations. We found that most
of the models trained on the organic and representative content
outperformed or, at worst, performed on par with the models trained with
inorganic content generated using bots or translated using templates
included, demonstrating that training on unrepresentative content not only
impacts the representation of native speakers but also impacts the
performance of NLP tasks or systems. We recommend avoiding utilizing the
automatically created, generated, or translated articles on Wikipedia when
the task is a representation-based task, like measuring opinions,
sentiments, or perspectives of native speakers, and also suggest that when
registered users employ automated creation or translation, their
contributions should be marked differently than “registered user” for
better transparency; perhaps “registered user (automation-assisted)”.
Best,Kinneret
Hi everyone,
New documentation website for the Wikimedia Analytics API (AQS) is now
available at [0].
Wikimedia Analytics API (or AQS - Analytics Query Service) provides
analytics data, such as page views, unique devices, and editor counts;
for Wikipedia and other Wikimedia free-knowledge projects. For
example, you can use the API to:
- Get the number of devices that visited Wikipedia in a given month
- List the most-viewed or most-edited pages on Wikipedia
- Compare the number of editors that edited Wikipedia by country
The new website [0] replaces documentation previously available on
Wikitech (example: [1]) and in the REST API reference [2]. It
includes:
- explanations of core concepts used in Wikimedia analytics
- extensive examples
- tutorials with code written in Python
- a full API reference with sandboxes, allowing you to test API
requests directly in your browser
- links to other resources
To share feedback or report an issue with the new site, create a task
in Phabricator [3].
To request changes directly in the code repository [4], create a merge
request in GitLab.
On behalf of the Technical Documentation team and the Data Products team,
Kamil Bach
[0]: https://doc.wikimedia.org/analytics-api
[1]: https://wikitech.wikimedia.org/wiki/Data_Platform/AQS/Pageviews
[2]: https://wikimedia.org/api/rest_v1/#/
[3]: https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=data…
[4]: https://gitlab.wikimedia.org/repos/generated-data-platform/aqs/analytics-api
--
Kamil Bach
Technical Writer
Wikimedia Foundation
Hi there,
I have a question about the right way to *extract the accurate time (or
revision ID) of when an article becomes a featured article (FA) or a good
article (GA)*.
The first and most straightforward method I tried was to extract the first
time that a *{{featured article}}* tag (or a {{good article}} one) is found
in the article revision text.
However, in some cases, this yields weird results, such as an article being
an FA on the same day it was created.
Another method I tried (without much success) was to extract this
information from the article talk page under the "Article History" section.
However, not all pages have this section, and I'm not sure how reliable
this information is (are all editors adding the date when the article got a
promotion or maybe the nomination date?)
So, I wonder if anyone can advise on the best way to extract this
information in the most accurate way. Isn't there any database table that
holds such information?
Thank you all in advance!
Abraham
--
Best,
Abraham
---------
Abraham I.
Postdoc Researcher
University of Michigan | School of Information
pronouns: he/him
abraham.com <https://www.avrahami-israeli.com/>
Good morning,
If you do not use our Archiva
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Archiva>
artifact repository service, you may ignore this message.
Apologies for the short notice. This is just to let you know that I will
be performing some maintenance work on our Archiva server today, which
will result in some brief periods of downtime for the service. One
element of this work is a disk storage change operation and the next is
an O/S upgrade. I will try to keep the downtime of the service to a minimum.
Apologies if this instability causes you any inconvenience. Please do
feel free to let me know if this impacts your work and I will try to
help you find a workaround.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Cross-post.
---------- Forwarded message ---------
From: Adam Baso <abaso(a)wikimedia.org>
Date: Fri, May 31, 2024 at 4:05 PM
Subject: (Possible breaking change) XML pages-articles dumps bug with
missing revision text for some records; fix in progress with schema change
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
As described on Phabricator a bug [1] surfaced whereby the "pages-articles"
XML dumps on https://dumps.wikimedia.org/ bear incomplete records.
A possible fix has been identified, and it involves bumping the dump schema
version from version 0.10 to version 0.11 [2], which could be a breaking
change for some.
MORE DETAILS:
Due to the bug that surfaced, a nontrivial number of <text> nodes
representing article text shows in a fashion like so as empty.
<text bytes="123456789" />
A potential fix in T365155 [3] has been identified. Assuming further
testing looks good, XML dumps will be kicked off again starting next week
in order to restore the missing records as soon as possible. It will take a
while for new dumps to be generated as it is a compute intensive operation.
More progress will be reported at T365155 and new dumps will eventually
show up on dumps.wikimedia.org .
Although a number of pipelines may not notice the change associated with
the schema bump, if your dump ingestion tooling or use of Special:Export
relies on the specific shape of the XML at version 0.10 (e.g., because of
code generation tools), please examine the differences between version 0.10
and version 0.11. One notable addition in version 0.11 is addition of MCR
[4] fields.
Thank you for your patience while this issue is resolved.
-Adam
[1]
https://phabricator.wikimedia.org/T365501
[2]
https://www.mediawiki.org/xml/export-0.10.xsd
and
https://www.mediawiki.org/xml/export-0.11.xsd
Schema version 0.11 has existed in MediaWiki for over 6 years, but
Wikimedia wikis have been using version 0.10.
[3]
https://phabricator.wikimedia.org/T365155#9851025
and
https://phabricator.wikimedia.org/T365155#9851160
[4]
https://www.mediawiki.org/wiki/Multi-Content_Revisions
Hello,
I would like to upgrade stat1008 from buster to bullseye this Thursday
at approximately 09:15 UTC.
The upgrade is expected to take up to an hour, during which time
stat1008 will be unavailable for use. Work in your home directories will
be left untouched, so the impact should be low, especially if you are
using conda environments
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda>.
If this maintenance window is likely to cause an issue for you, please
do let me know and I can look to reschedule the work. We will also be
available after the upgrade, in case you experience difficulties with
the upgraded operating system.
After the upgrade, stat1008 will have new SSH host fingerprints, so I
will update this page SSH_Fingerprints/stat1008.eqiad.wmnet
<https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/stat1008.eqiad.wm…>
and provide some more help to get you reconnected.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
The next Research Showcase will be live-streamed tomorrow, Wednesday, May
15, at 9:30 AM PST / 16:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1715790600>. The theme for this showcase is
*Reader to Editor Pipeline*.
You are welcome to watch via the YouTube stream:
https://www.youtube.com/watch?v=G-8CbpcwGV8. As usual, you can join the
conversation in the YouTube chat as soon as the showcase goes live.
This month's presentations:
Journey TransitionsBy *Mike Raish and Daisy Chen*What kinds of events do
readers and editors identify as separating the stages of their relationship
with Wikipedia, and which of these kinds of events might the Wikimedia
Foundation possibly support through design interventions? In the Journey
Transitions qualitative research project, the WMF Design Research team
interviewed readers and editors in Arabic, Spanish, and English in order to
answer these questions and provide guidance to WMF Product teams making
strategic decisions. A series of semi-structured interviews revealed that
readers and editors describe their relationships with Wikipedia in
different ways, with readers describing a static and transactional
relationship, and that even many experienced editors express confusion
about core functions of the Wikimedia ecosystem, such as the role of Talk
pages. This presentation will describe the Journey Transitions research, as
well as present its implications for the sponsoring Product teams in order
to shed light on the way that qualitative research is used to inform
strategic decisions in the Wikimedia Foundation.
Increasing participation in peer production communities with the Growth
featuresBy *Morten Warncke-Wang and Kirsten Stoller*For peer production
communities to be sustainable, they must attract and retain new
contributors. Studies have identified social and technical barriers to
entry and discovered some potential solutions, but these solutions have
typically focused on a single highly successful community, the English
Wikipedia, been tested in isolation, and rarely evaluated through
controlled experiments. In this talk, we show how the Wikimedia
Foundation’s Growth team collaborates with Wikipedia communities to develop
and experiment with new features to improve the newcomer experience in
Wikipedia. We report findings from a large-scale controlled experiment
using the Newcomer Homepage, a central place where newcomers can learn how
peer production works and find opportunities to contribute, and show how
the effectiveness depends on the newcomer’s context. Lastly, we show how
the Growth team has continued developing features that further improve the
newcomer experience while adapting to community needs.
Best,Kinneret
--
Kinneret Gordon
Lead Research Community Officer
Wikimedia Foundation <https://wikimediafoundation.org/>