What do y'all use Dumps for?

List overview All Threads
Download

newer

older

New Code of Conduct Committee...

Fwd: [Wikitech-l] Forthcoming...

Bryan Davis

8 Oct 2024 8 Oct '24

11:59 a.m.

I was asked recently what I knew about the types of tools that use data from the https://dumps.wikimedia.org/ project. I had to admit that I really didn't know of many tools off the top of my head that relied on dumps. Most of the use cases I have heard about are for research topics like looking at word frequencies and sentence complexity, or machine learning things that consume some or all of the wiki corpus.

Do you run a tool that needs data from Dumps to do its job? I would love to hear some stories about how this data helps folks advance the work of the movement.

Bryan

-- Bryan Davis Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

Show replies by date

Kimmo Virtanen

8 Oct 8 Oct

12:05 p.m.

I am from time to time using dumps for parsing data that I cannot get via SQL/API. For example in summer I fetched Wikimedia Commons page history for getting the list of old categories of images so that I would not be re-inserting categories by bot which were least once removed from the photo.

Br, -- Kimmo Virtanen, Zache

On Tue, Oct 8, 2024 at 6:59 PM Bryan Davis bd808@wikimedia.org wrote:

...

I was asked recently what I knew about the types of tools that use data from the https://dumps.wikimedia.org/ project. I had to admit that I really didn't know of many tools off the top of my head that relied on dumps. Most of the use cases I have heard about are for research topics like looking at word frequencies and sentence complexity, or machine learning things that consume some or all of the wiki corpus.

Do you run a tool that needs data from Dumps to do its job? I would love to hear some stories about how this data helps folks advance the work of the movement.

Bryan

Bryan Davis Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808 _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/

Amir Sarabadani

12:51 p.m.

Oh I don't know where to even start:

AI/ML done by me:

- https://www.mediawiki.org/wiki/User:Ladsgroup/masz this uses dumps to help checkusers of 11 wikis do their work more efficiently. - Tool that finds "bad words" of each wiki automatically using the history of edits dumps which later got used in ores and abuse filters: https://gist.github.com/Ladsgroup/cc22515f55ae3d868f47

General research done by others:

- There are around 5,000 papers indexed in google scholar where they explicitly mention "dumps.wikimedia.org" https://scholar.google.com/scholar?hl=de&as_sdt=0%2C5&q=%22dumps.wik...

Other projects I remember from top of my head:

- https://www.wikidata.org/wiki/Wikidata:Automated_finding_references_input - https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Check_Wikipedia Cleaning up articles from syntax and styling issues - I have done also all sorts of small clean ups using dumps, - guessing the name of a person in another language based on how first name and last name is used in wikidata. - Same for translating the name of species into Persian using their scientific name (e.g. Acacia sibirica -> آکاسیای سیبری which is Siberian Acacia). - Finding duplicate items to merge - Many more I can't remember right now.

Hope that helps

Am Di., 8. Okt. 2024 um 18:06 Uhr schrieb Kimmo Virtanen < kimmo.virtanen@wikimedia.fi>:

...

I am from time to time using dumps for parsing data that I cannot get via SQL/API. For example in summer I fetched Wikimedia Commons page history for getting the list of old categories of images so that I would not be re-inserting categories by bot which were least once removed from the photo.

Br, -- Kimmo Virtanen, Zache

On Tue, Oct 8, 2024 at 6:59 PM Bryan Davis bd808@wikimedia.org wrote:

...
I was asked recently what I knew about the types of tools that use data from the https://dumps.wikimedia.org/ project. I had to admit that I really didn't know of many tools off the top of my head that relied on dumps. Most of the use cases I have heard about are for research topics like looking at word frequencies and sentence complexity, or machine learning things that consume some or all of the wiki corpus.

Do you run a tool that needs data from Dumps to do its job? I would love to hear some stories about how this data helps folks advance the work of the movement.

Bryan

Bryan Davis Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808 _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/

Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/

-- Amir (he/him)

Jérémie Roquet

12:42 p.m.

Hi Bryan,

Le mar. 8 oct. 2024 à 17:59, Bryan Davis bd808@wikimedia.org a écrit :

...

Do you run a tool that needs data from Dumps to do its job? I would love to hear some stories about how this data helps folks advance the work of the movement.

Socksfinder¹ uses stub-meta-history to build an index of edits used to find how often multiple accounts edit the same articles and help identify sockpuppets.

Arkbot² uses pages-articles to build various lists of articles that share certain properties to help maintenance projects on the French Wikipedia (list of pages not linked to something…). Some of these lists used to be provided as special pages by MediaWiki and were disabled because of performance concerns, some are too specific to be part of MediaWiki.

In both cases, I'm not 100 % certain dumps are the best approach (I've been thinking about using SQL queries on replicas and some of the available APIs), but it works well enough™ and no other approach was so obviously better (if at all) for me to feel an urgent need to rewrite my tools.

Also in the past I've used Wikidata dumps to explore the limits of some RDF tools, found the limits faster than I thought, and moved on to other hobbies :-)

Best regards,

¹ https://socksfinder.toolforge.org/ (https://github.com/Arkanosis/socksfinder) ² https://fr.wikipedia.org/wiki/Utilisateur:Arkbot (https://github.com/Arkanosis/arkbot-rs)

-- Jérémie

YiFei Zhu

12:56 p.m.

On Tue, Oct 8, 2024 at 8:59 AM Bryan Davis bd808@wikimedia.org wrote:

...

I was asked recently what I knew about the types of tools that use data from the https://dumps.wikimedia.org/ project. I had to admit that I really didn't know of many tools off the top of my head that relied on dumps. Most of the use cases I have heard about are for research topics like looking at word frequencies and sentence complexity, or machine learning things that consume some or all of the wiki corpus.

Do you run a tool that needs data from Dumps to do its job? I would love to hear some stories about how this data helps folks advance the work of the movement.

YiFeiBot uses dumps to find a list of pages with interlanguage links, for the interlanguage link removal task. It does this by processing each page's wikitext through a regex.

...

Bryan

Bryan Davis Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808 _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/

Sascha Brawer

2:07 p.m.

QRank https://qrank.wmcloud.org/ uses dumps (plus access logs) to compute a ranking signal for Wikidata items.

— Sascha

Am Di., 8. Okt. 2024 um 18:57 Uhr schrieb YiFei Zhu <zhuyifei1999@gmail.com

...

:

...

On Tue, Oct 8, 2024 at 8:59 AM Bryan Davis bd808@wikimedia.org wrote:

...
I was asked recently what I knew about the types of tools that use data from the https://dumps.wikimedia.org/ project. I had to admit that I really didn't know of many tools off the top of my head that relied on dumps. Most of the use cases I have heard about are for research topics like looking at word frequencies and sentence complexity, or machine learning things that consume some or all of the wiki corpus.

Do you run a tool that needs data from Dumps to do its job? I would love to hear some stories about how this data helps folks advance the work of the movement.

YiFeiBot uses dumps to find a list of pages with interlanguage links, for the interlanguage link removal task. It does this by processing each page's wikitext through a regex.

...
Bryan

Bryan Davis Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808 _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information:

https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/ _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/

Kosta Harlan

2:30 p.m.

Without wanting to take away from this thread--folks who respond here, could you please also have a look at T365693: MediaWiki Dumps XML - Provide attribute to indicate that user is temporary account in exported content https://phabricator.wikimedia.org/T365693 and add your feedback? (Or just reply to this list and I'll make sure it's captured on the task.)

Thanks, Kosta

On 8. Oct 2024 at 20:07:05, Sascha Brawer via Cloud < cloud@lists.wikimedia.org> wrote:

...

QRank https://qrank.wmcloud.org/ uses dumps (plus access logs) to compute a ranking signal for Wikidata items.

— Sascha

Am Di., 8. Okt. 2024 um 18:57 Uhr schrieb YiFei Zhu < zhuyifei1999@gmail.com>:

...
On Tue, Oct 8, 2024 at 8:59 AM Bryan Davis bd808@wikimedia.org wrote:

...
I was asked recently what I knew about the types of tools that use data from the https://dumps.wikimedia.org/ project. I had to admit that I really didn't know of many tools off the top of my head that relied on dumps. Most of the use cases I have heard about are for research topics like looking at word frequencies and sentence complexity, or machine learning things that consume some or all of the wiki corpus.

Do you run a tool that needs data from Dumps to do its job? I would love to hear some stories about how this data helps folks advance the work of the movement.

YiFeiBot uses dumps to find a list of pages with interlanguage links, for the interlanguage link removal task. It does this by processing each page's wikitext through a regex.

...
Bryan

Bryan Davis Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808 _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information:

https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/ _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/

Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/

Valerio Bozzolan

5:55 p.m.

Nice thread. Wikimedia Switzerland is another happy user of Wikimedia dumps on a project called Cassandra, designed by Synapta s.r.l. in 2017. The Cassandra web tool shows things like user contributions on the Swiss national gallery and historical page views and other "uh! :O" things about Swiss GLAMs in general. It relies on daily mediacount, plus, it features tiny and happy esoteric SSH reverse tunnels holding hands to reach Wiki Replicas from the outside (since T318191).

Some relevant "uh" links:

https://stats.wikimedia.swiss/SNL/page-views

https://stats.wikimedia.swiss/SNL/category-network

https://stats.wikimedia.swiss/SNL/usage

https://stats.wikimedia.swiss/SNL/dashboard

https://gitlab.wikimedia.org/repos/wikimedia-ch/cassandra-GLAM-tools

Wikimedia Switzerland is doing the tool handover + recover right now.

https://phabricator.wikimedia.org/T374209

Feel free to ask questions but I don't know answers :) See you.

-[[User:ValerioBoz-WMCH]]

Il giorno Tue, 08/10/2024 alle 20.07 +0200, Sascha Brawer via Cloud ha scritto:

...

QRank uses dumps (plus access logs) to compute a ranking signal for Wikidata items.

Benjamín Valero Espinosa

10 Oct 10 Oct

3:02 a.m.

Hi all,

I maintain a tool, running on Toolforge, to help users to fix easily misspellings and style issues on Spanish and Galician Wikipedia:

https://github.com/benjavalero/replacer

Although the pages are parsed also on-the-fly, there is a background parsing of all the pages thanks to the dump of "pages-articles".

Regards,

El mar, 8 oct 2024 a las 23:55, Valerio Bozzolan via Cloud (< cloud@lists.wikimedia.org>) escribió:

...

Nice thread. Wikimedia Switzerland is another happy user of Wikimedia dumps on a project called Cassandra, designed by Synapta s.r.l. in 2017. The Cassandra web tool shows things like user contributions on the Swiss national gallery and historical page views and other "uh! :O" things about Swiss GLAMs in general. It relies on daily mediacount, plus, it features tiny and happy esoteric SSH reverse tunnels holding hands to reach Wiki Replicas from the outside (since T318191).

Some relevant "uh" links:

https://stats.wikimedia.swiss/SNL/page-views

https://stats.wikimedia.swiss/SNL/category-network

https://stats.wikimedia.swiss/SNL/usage

https://stats.wikimedia.swiss/SNL/dashboard

https://gitlab.wikimedia.org/repos/wikimedia-ch/cassandra-GLAM-tools

Wikimedia Switzerland is doing the tool handover + recover right now.

https://phabricator.wikimedia.org/T374209

Feel free to ask questions but I don't know answers :) See you.

-[[User:ValerioBoz-WMCH]]

Il giorno Tue, 08/10/2024 alle 20.07 +0200, Sascha Brawer via Cloud ha scritto:

...
QRank uses dumps (plus access logs) to compute a ranking signal for Wikidata items.

Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/

Bruce Myers

8 Oct 8 Oct

3:09 p.m.

Monthly dumps are used for the following: 1) Checkwiki - helps clean up syntax and other errors in the source code for several languages and wiki types - pages-articles - https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Check_Wikipedia 2) TemplateParameters - displays template parameter usage, uses TemplateData to validate parameter name usage, for several languages and wiki types - pages-articles - https://bambots.brucemyers.com/TemplateParam.php 3) Wikidata Class Browser - class tree with statistics and common class properties - pages-articles - https://bambots.brucemyers.com/WikidataClasses.php 4) Wikidata NavelGazer - user editing statistics - stub-meta-history, change_tag.sql - https://bambots.brucemyers.com/NavelGazer.php

On 10/8/24 11:59 AM, Bryan Davis wrote:

...

I was asked recently what I knew about the types of tools that use data from the https://dumps.wikimedia.org/ project. I had to admit that I really didn't know of many tools off the top of my head that relied on dumps. Most of the use cases I have heard about are for research topics like looking at word frequencies and sentence complexity, or machine learning things that consume some or all of the wiki corpus.

Do you run a tool that needs data from Dumps to do its job? I would love to hear some stories about how this data helps folks advance the work of the movement.

Bryan

Age (days ago)

Last active (days ago)

cloud@lists.wikimedia.org

9 comments

10 participants

tags (0)

participants (10)

Amir Sarabadani
Benjamín Valero Espinosa
Bruce Myers
Bryan Davis
Jérémie Roquet
Kimmo Virtanen
Kosta Harlan
Sascha Brawer
Valerio Bozzolan
YiFei Zhu