[Cloud] Re: What do y'all use Dumps for?

8 Oct 2024


      Oh I don't know where to even start:
AI/ML done by me:
- https://www.mediawiki.org/wiki/User:Ladsgroup/masz this uses dumps to
   help checkusers of 11 wikis do their work more efficiently.
   - Tool that finds "bad words" of each wiki automatically using the
   history of edits dumps which later got used in ores and abuse filters:
   https://gist.github.com/Ladsgroup/cc22515f55ae3d868f47
General research done by others:
- There are around 5,000 papers indexed in google scholar where they
   explicitly mention "dumps.wikimedia.org"
   https://scholar.google.com/scholar?hl=de&as_sdt=0%2C5&q=%22dumps.wik...
Other projects I remember from top of my head:
-
   https://www.wikidata.org/wiki/Wikidata:Automated_finding_references_input
   - https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Check_Wikipedia
   Cleaning up articles from syntax and styling issues
   - I have done also all sorts of small clean ups using dumps,
      - guessing the name of a person in another language based on how
      first name and last name is used in wikidata.
      - Same for translating the name of species into Persian using their
      scientific name (e.g. Acacia sibirica -> آکاسیای سیبری which is Siberian
      Acacia).
      - Finding duplicate items to merge
      - Many more I can't remember right now.
Hope that helps
Am Di., 8. Okt. 2024 um 18:06 Uhr schrieb Kimmo Virtanen <
kimmo.virtanen@wikimedia.fi>:
...
I am from time to time using dumps for parsing data that I cannot get via
SQL/API. For example in summer I fetched Wikimedia Commons page history for
getting the list of old categories of images so that I would not be
re-inserting categories by bot which were least once removed from the photo.
Br,
-- Kimmo Virtanen, Zache
On Tue, Oct 8, 2024 at 6:59 PM Bryan Davis bd808@wikimedia.org wrote:
...
I was asked recently what I knew about the types of tools that use
data from the https://dumps.wikimedia.org/ project. I had to admit
that I really didn't know of many tools off the top of my head that
relied on dumps. Most of the use cases I have heard about are for
research topics like looking at word frequencies and sentence
complexity, or machine learning things that consume some or all of the
wiki corpus.
Do you run a tool that needs data from Dumps to do its job? I would
love to hear some stories about how this data helps folks advance the
work of the movement.
Bryan
Bryan Davis                                        Wikimedia Foundation
Principal Software Engineer                               Boise, ID USA
[[m:User:BDavis_(WMF)]]                                      irc: bd808
_______________________________________________
Cloud mailing list -- cloud@lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/

Cloud mailing list -- cloud@lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
-- 
Amir (he/him)

2024

2023

2022

2021

2020

2019

2018

2017

[Cloud] Re: What do y'all use Dumps for?

Bryan