Hello data dump enthusiasts,
I've been working on an API (written in Python) for downloading files from Wikimedia Data Dump files, and I've just released its earliest version here: https://github.com/jon-edward/wiki_dump.
I'd love to know what you all think (how it can be improved, how you are interested in using it, positive/negative remarks) if you can spare the time.
All the best, jon-edward
On Fri, 1 Apr 2022, at 02:26, arithmatlic@gmail.com wrote:
Hello data dump enthusiasts,
I've been working on an API (written in Python) for downloading files from Wikimedia Data Dump files, and I've just released its earliest version here: https://github.com/jon-edward/wiki_dump.
I'd love to know what you all think (how it can be improved, how you are interested in using it, positive/negative remarks) if you can spare the time.
Great, I was looking for something like that. Had some troubles getting the example to run, opened an issue on GH. I'm interested in support for the new HTML dumps, and some logic to automatically get files from the local storage when the code is run on toolforge.
– Jan
Closed issue with a fix. The data dumps I was testing on had sha1 sums for every file, but that's not required for every file I learned with the newest dump. Thank you for bringing that to my attention!
- Jon
xmldatadumps-l@lists.wikimedia.org