Analytics

analytics@lists.wikimedia.org

2 participants
1825 discussions

[Wikimedia Research Showcase] Machine Translation on Wikipedia- July 24 at 16:30 UTC
by Kinneret Gordon 24 Jul '24

24 Jul '24

Hi all, The next Research Showcase will be live-streamed next Wednesday, July 24, at 9:30 AM PST / 16:30 UTC. Find your local time here <https://zonestamp.toolforge.org/1721838600>. The theme for this showcase is *Machine Translation on Wikipedia*. You are welcome to watch via the YouTube stream: https://www.youtube.com/live/O7AqvHgqUVk. As usual, you can join the conversation in the YouTube chat as soon as the showcase goes live. This month's presentations: The Promise and Pitfalls of AI Technology in Bridging Digital Language DivideBy *Kai Zhu, Bocconi University*Machine translation technologies have the potential to bridge knowledge gaps across languages, promoting more inclusive access to information regardless of native languages. This study examines the impact of integrating Google Translate into Wikipedia's Content Translation system in January 2019. Employing a natural experiment design and difference-in-differences strategy, we analyze how this translation technology shock influenced the dynamics of content production and accessibility on Wikipedia across over a hundred languages. We find that this technology integration leads to a 149% increase in content production through translation, driven by existing editors becoming more productive as well as an expansion of the editor base. Moreover, we observe that machine translation enhances the propagation of biographical and geographical information, helping to close these knowledge gaps in the multilingual context. However, our findings also underscore the need for continued efforts to mitigate the preexisting systemic barriers. Our study contributes to our knowledge on the evolving role of artificial intelligence in shaping knowledge dissemination through enhanced language translation capabilities.Implications of Using Inorganic Content in Arabic Wikipedia EditionsBy *Saied Alshahrani and Jeanna Matthews, Clarkson University*Wikipedia articles (content pages) are one of the widely utilized training corpora for NLP tasks and systems, yet these articles are not always created, generated, or even edited organically by native speakers; some are automatically created, generated, or translated using Wikipedia bots or off-the-shelf translation tools like Google Translate without human revision or supervision. We first analyzed the three Arabic Wikipedia editions, Arabic (AR), Egyptian Arabic (ARZ), and Moroccan Arabic (ARY), and found that these Arabic Wikipedia editions suffer from a few serious issues, like large-scale automatic creations and translations from English to Arabic, all without human involvement, generating content (articles) that lack not only linguistic richness and diversity but also content that lacks cultural richness and meaningful representation of the Arabic language and its native speakers. We second studied the performance implications of using such inorganic, unrepresentative articles to train NLP tasks or systems, where we intrinsically evaluated the performance of two main NLP upstream tasks, namely word representation and language modeling, using word analogy and fill-mask evaluations. We found that most of the models trained on the organic and representative content outperformed or, at worst, performed on par with the models trained with inorganic content generated using bots or translated using templates included, demonstrating that training on unrepresentative content not only impacts the representation of native speakers but also impacts the performance of NLP tasks or systems. We recommend avoiding utilizing the automatically created, generated, or translated articles on Wikipedia when the task is a representation-based task, like measuring opinions, sentiments, or perspectives of native speakers, and also suggest that when registered users employ automated creation or translation, their contributions should be marked differently than “registered user” for better transparency; perhaps “registered user (automation-assisted)”. Best,Kinneret

1 1

New Wikimedia Analytics API (AQS) documentation
by Kamil Bach 08 Jul '24

08 Jul '24

Hi everyone, New documentation website for the Wikimedia Analytics API (AQS) is now available at [0]. Wikimedia Analytics API (or AQS - Analytics Query Service) provides analytics data, such as page views, unique devices, and editor counts; for Wikipedia and other Wikimedia free-knowledge projects. For example, you can use the API to: - Get the number of devices that visited Wikipedia in a given month - List the most-viewed or most-edited pages on Wikipedia - Compare the number of editors that edited Wikipedia by country The new website [0] replaces documentation previously available on Wikitech (example: [1]) and in the REST API reference [2]. It includes: - explanations of core concepts used in Wikimedia analytics - extensive examples - tutorials with code written in Python - a full API reference with sandboxes, allowing you to test API requests directly in your browser - links to other resources To share feedback or report an issue with the new site, create a task in Phabricator [3]. To request changes directly in the code repository [4], create a merge request in GitLab. On behalf of the Technical Documentation team and the Data Products team, Kamil Bach [0]: https://doc.wikimedia.org/analytics-api [1]: https://wikitech.wikimedia.org/wiki/Data_Platform/AQS/Pageviews [2]: https://wikimedia.org/api/rest_v1/#/ [3]: https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=data… [4]: https://gitlab.wikimedia.org/repos/generated-data-platform/aqs/analytics-api -- Kamil Bach Technical Writer Wikimedia Foundation

1 0

Good and Featured articles promotion/demotion time
by Abraham Israeli 19 Jun '24

19 Jun '24

Hi there, I have a question about the right way to *extract the accurate time (or revision ID) of when an article becomes a featured article (FA) or a good article (GA)*. The first and most straightforward method I tried was to extract the first time that a *{{featured article}}* tag (or a {{good article}} one) is found in the article revision text. However, in some cases, this yields weird results, such as an article being an FA on the same day it was created. Another method I tried (without much success) was to extract this information from the article talk page under the "Article History" section. However, not all pages have this section, and I'm not sure how reliable this information is (are all editors adding the date when the article got a promotion or maybe the nomination date?) So, I wonder if anyone can advise on the best way to extract this information in the most accurate way. Isn't there any database table that holds such information? Thank you all in advance! Abraham -- Best, Abraham --------- Abraham I. Postdoc Researcher University of Michigan | School of Information pronouns: he/him abraham.com <https://www.avrahami-israeli.com/>

1 0

Maintenance of Archiva service today (2024-06-17)
by Ben Tullis 17 Jun '24

17 Jun '24

Good morning, If you do not use our Archiva <https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Archiva> artifact repository service, you may ignore this message. Apologies for the short notice. This is just to let you know that I will be performing some maintenance work on our Archiva server today, which will result in some brief periods of downtime for the service. One element of this work is a disk storage change operation and the next is an O/S upgrade. I will try to keep the downtime of the service to a minimum. Apologies if this instability causes you any inconvenience. Please do feel free to let me know if this impacts your work and I will try to help you find a workaround. Kind regards, Ben -- *Ben Tullis*(he/him) Senior Site Reliability Engineer Wikimedia Foundation <https://wikimediafoundation.org/>

1 1

(no subject)
by Virginia Poundstone 08 Jun '24

08 Jun '24

[cross posting to] wikitech-l and analytics-l *We're migrating Dashiki <https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Dashiki>to newer cloud instances. * We completed T360914 <https://phabricator.wikimedia.org/T360914> to meet the compliance deadline for the Cloud_VPS_2024 Purge <https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2024_Purge>. This means we can no longer use .wmflabs.org to proxy to them, so any.wmflabs.org will become any.wmcloud.org. Following are links to the new dashboards with some details about each. - pingback.wmcloud.org (tracking third-party MW deployments around the world) - vital-signs.wmcloud.org (old "state of the wiki" dashboard, piwik says this is barely ever used) - browser-reports.wmcloud.org (browser statistics, new and improved data coming soon, used to be quite popular) - wmcs-edits.wmcloud.org (awesome dashboard showing the proportion of content contributions that comes from cloud-based tools and bots) Additional dashboards: - equitylandscape.wmcloud.org (the dashboard for the project <https://meta.wikimedia.org/wiki/Global_Data_and_Insights/Movement_Data/Equi…> was paused with staffing changes and reorg in June 2023) - language-reportcard.wmcloud.org (data missing, doesn't seem updated for a while) - flow-reportcard.wmcloud.org (defunct project) - page-creation.wmcloud.org (data not available anymore)) - analytics-prototype.wmcloud.org (internal Data Platform Engineering team general purpose prototyping) - wikistats-canary.wmcloud.org (internal Data Platform Engineering team test wikistats production deployments from mobile devices, etc) Please note: Dashiki dashboards are not actively maintained. Issues can be submitted using the Data-Engineering-Dashiki <https://phabricator.wikimedia.org/tag/data-engineering-dashiki/> tag in Phabricator. If there is an urgent critical bug, add the Data-Products <https://phabricator.wikimedia.org/tag/data_products/> tag for triage. -- Virginia Poundstone <https://www.mediawiki.org/wiki/User:VPoundstone-WMF> (she/her) Senior Product Manager, Data Products Wikimedia Foundation <https://wikimediafoundation.org/>

1 0

Decommissioning of the older analytics clients (stats servers) scheduled for May 28th
by Ben Tullis 04 Jun '24

04 Jun '24

Hello, We need to plan to decommission several of the analytics clients <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients>, also referred to as the stats servers, since they have reached their end of service date. The servers in question are: * stat1004 <https://wikitech.wikimedia.org/wiki/Stat1004> * stat1005 <https://wikitech.wikimedia.org/wiki/Stat1005> * stat1006 <https://wikitech.wikimedia.org/wiki/Stat1006> * stat1007 <https://wikitech.wikimedia.org/wiki/Stat1007> If you actively use these servers, please consider moving your work to alternative stat servers (namely, stat10[08-11]) as soon as reasonably possible. Similarly, should you have personal files in your home directory on any of these servers that you would like to retain, now would be a good time to consider moving them to a different server, or moving them to your HDFS home directory. There are some guides available on syncing files between stats servers <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients#Rsync_between…> and also using the hdfs CLI <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster#How_do_I...> to manage files, which may help you to clean up the necessary files. We would like to be able to decommission these servers *three weeks' from today*, which is on *Tuesday May 28th*. Please do feel free to get back to us if you feel that this timescale will not allow sufficient time for you to migrate your work to alternative servers, or if you have any other concerns about this plan. Kind regards, Ben -- *Ben Tullis*(he/him) Senior Site Reliability Engineer Wikimedia Foundation <https://wikimediafoundation.org/>

1 3

Fwd: (Possible breaking change) XML pages-articles dumps bug with missing revision text for some records; fix in progress with schema change
by Adam Baso 01 Jun '24

01 Jun '24

Cross-post. ---------- Forwarded message --------- From: Adam Baso <abaso(a)wikimedia.org> Date: Fri, May 31, 2024 at 4:05 PM Subject: (Possible breaking change) XML pages-articles dumps bug with missing revision text for some records; fix in progress with schema change To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> As described on Phabricator a bug [1] surfaced whereby the "pages-articles" XML dumps on https://dumps.wikimedia.org/ bear incomplete records. A possible fix has been identified, and it involves bumping the dump schema version from version 0.10 to version 0.11 [2], which could be a breaking change for some. MORE DETAILS: Due to the bug that surfaced, a nontrivial number of <text> nodes representing article text shows in a fashion like so as empty. <text bytes="123456789" /> A potential fix in T365155 [3] has been identified. Assuming further testing looks good, XML dumps will be kicked off again starting next week in order to restore the missing records as soon as possible. It will take a while for new dumps to be generated as it is a compute intensive operation. More progress will be reported at T365155 and new dumps will eventually show up on dumps.wikimedia.org . Although a number of pipelines may not notice the change associated with the schema bump, if your dump ingestion tooling or use of Special:Export relies on the specific shape of the XML at version 0.10 (e.g., because of code generation tools), please examine the differences between version 0.10 and version 0.11. One notable addition in version 0.11 is addition of MCR [4] fields. Thank you for your patience while this issue is resolved. -Adam [1] https://phabricator.wikimedia.org/T365501 [2] https://www.mediawiki.org/xml/export-0.10.xsd and https://www.mediawiki.org/xml/export-0.11.xsd Schema version 0.11 has existed in MediaWiki for over 6 years, but Wikimedia wikis have been using version 0.10. [3] https://phabricator.wikimedia.org/T365155#9851025 and https://phabricator.wikimedia.org/T365155#9851160 [4] https://www.mediawiki.org/wiki/Multi-Content_Revisions

1 0

Somalia
by Antoine Dusséaux 30 May '24

30 May '24

Hi, It seems that the Western part of the internationally recognized territory of Somalia (claimed by Somaliland) disappear from the maps: https://stats.wikimedia.org/#/ar.wikipedia.org/reading/page-views-by-countr… [image: image.png] Best, *Antoine DUSSÉAUX* dusseaux.antoine(a)gmail.com [ Twitter <https://twitter.com/ADssx> | LinkedIn <https://www.linkedin.com/in/antoinedusseaux> | Substack <https://www.adssx.com> ]

1 0

Scheduled maintenance for stat1008 on Thursday May 23rd at 09:15 UTC
by Ben Tullis 23 May '24

23 May '24

Hello, I would like to upgrade stat1008 from buster to bullseye this Thursday at approximately 09:15 UTC. The upgrade is expected to take up to an hour, during which time stat1008 will be unavailable for use. Work in your home directories will be left untouched, so the impact should be low, especially if you are using conda environments <https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda>. If this maintenance window is likely to cause an issue for you, please do let me know and I can look to reschedule the work. We will also be available after the upgrade, in case you experience difficulties with the upgraded operating system. After the upgrade, stat1008 will have new SSH host fingerprints, so I will update this page SSH_Fingerprints/stat1008.eqiad.wmnet <https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/stat1008.eqiad.wm…> and provide some more help to get you reconnected. Kind regards, Ben -- *Ben Tullis*(he/him) Senior Site Reliability Engineer Wikimedia Foundation <https://wikimediafoundation.org/>

1 1

[Wikimedia Research Showcase] May 15 at 16:30 UTC
by Kinneret Gordon 15 May '24

15 May '24

Hi all, The next Research Showcase will be live-streamed tomorrow, Wednesday, May 15, at 9:30 AM PST / 16:30 UTC. Find your local time here <https://zonestamp.toolforge.org/1715790600>. The theme for this showcase is *Reader to Editor Pipeline*. You are welcome to watch via the YouTube stream: https://www.youtube.com/watch?v=G-8CbpcwGV8. As usual, you can join the conversation in the YouTube chat as soon as the showcase goes live. This month's presentations: Journey TransitionsBy *Mike Raish and Daisy Chen*What kinds of events do readers and editors identify as separating the stages of their relationship with Wikipedia, and which of these kinds of events might the Wikimedia Foundation possibly support through design interventions? In the Journey Transitions qualitative research project, the WMF Design Research team interviewed readers and editors in Arabic, Spanish, and English in order to answer these questions and provide guidance to WMF Product teams making strategic decisions. A series of semi-structured interviews revealed that readers and editors describe their relationships with Wikipedia in different ways, with readers describing a static and transactional relationship, and that even many experienced editors express confusion about core functions of the Wikimedia ecosystem, such as the role of Talk pages. This presentation will describe the Journey Transitions research, as well as present its implications for the sponsoring Product teams in order to shed light on the way that qualitative research is used to inform strategic decisions in the Wikimedia Foundation. Increasing participation in peer production communities with the Growth featuresBy *Morten Warncke-Wang and Kirsten Stoller*For peer production communities to be sustainable, they must attract and retain new contributors. Studies have identified social and technical barriers to entry and discovered some potential solutions, but these solutions have typically focused on a single highly successful community, the English Wikipedia, been tested in isolation, and rarely evaluated through controlled experiments. In this talk, we show how the Wikimedia Foundation’s Growth team collaborates with Wikipedia communities to develop and experiment with new features to improve the newcomer experience in Wikipedia. We report findings from a large-scale controlled experiment using the Newcomer Homepage, a central place where newcomers can learn how peer production works and find opportunities to contribute, and show how the effectiveness depends on the newcomer’s context. Lastly, we show how the Growth team has continued developing features that further improve the newcomer experience while adapting to community needs. Best,Kinneret -- Kinneret Gordon Lead Research Community Officer Wikimedia Foundation <https://wikimediafoundation.org/>

1 1

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics