There's an ongoing discussion in ops about improving the dump process, see
https://phabricator.wikimedia.org/T88728 https://phabricator.wikimedia.org/T93396 https://phabricator.wikimedia.org/T17017 https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Backlog/Improve...
I would like to join in and add our requirements and thoughts to the list, and would like some input on that. So far I have:
Make it easier to register a new type of dump via a config change. A dump should define: * a script(s) to run * output file(s) * the dump schedule * a short name * brief description (wikitext or HTML? translatable?) * required input files (maybe)
Make clear timelines of consistent dumps. * drop the misleading "one dir with one timestamp for all dumps" appraoch * have one timeline per dump instead * for dumps that are guaranteed to be consistent (one generated from the other), generate a timeline of directories with symlinks to the actual files.
Make dumps discoverable: * There should be a machine readable overview of which dumps exist in which versions for each project. * This overview should be a JSON document (may even be static) * Perhaps we also want a DCAT-AP description of our dumps
Promote stable URLs: * The latest dump of any type should be available under a stable, predictable URL. * TBD: "latest" URL could point to a symlink, get rewritten to the actual file, or trigger an HTTP redirect.
Thoughts? Comments? Additions?
wikidata-tech@lists.wikimedia.org