There's an ongoing discussion in ops about improving the dump process, see
https://phabricator.wikimedia.org/T88728
https://phabricator.wikimedia.org/T93396
https://phabricator.wikimedia.org/T17017
https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Backlog/Improv…
I would like to join in and add our requirements and thoughts to the list, and
would like some input on that. So far I have:
Make it easier to register a new type of dump via a config change.
A dump should define:
* a script(s) to run
* output file(s)
* the dump schedule
* a short name
* brief description (wikitext or HTML? translatable?)
* required input files (maybe)
Make clear timelines of consistent dumps.
* drop the misleading "one dir with one timestamp for all dumps" appraoch
* have one timeline per dump instead
* for dumps that are guaranteed to be consistent (one generated from the other),
generate a timeline of directories with symlinks to the actual files.
Make dumps discoverable:
* There should be a machine readable overview of which dumps exist in which
versions for each project.
* This overview should be a JSON document (may even be static)
* Perhaps we also want a DCAT-AP description of our dumps
Promote stable URLs:
* The latest dump of any type should be available under a stable, predictable URL.
* TBD: "latest" URL could point to a symlink, get rewritten to the actual file,
or trigger an HTTP redirect.
Thoughts? Comments? Additions?
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.