Hi, Samuel,

Sorry for not communicating early, all my work happened in the open[0] but I didn't want to do any public announcements until there was a 100% completed run! :-)

> I have the feeling the bulk of Commons media (~300 TB in all) is not mirrored anywhere right now
> I saw something related mentioned on phab? within the last year, but can't find it now.

So this was/is the state of multimedia storage at the moment:

* There are 3 copies of each file on the live OpenStack Swift cluster in WMF's eqiad datacenter in Virginia
* There is an almost-real time replication of eqiad's multimedia cluster into the codfw datacenter in Texas, with its own 3 separate copies
* Images can and are regularly served from both datacenters, protecting against local disasters like floodings or earthquakes

That has been like that for a few years already, the following is new! :-)

I (with the assistance of many other WMF engineers) started working on an offline/offsite backup solution for all multimedia files at the end of 2020- one that would save against application bugs, operator mistakes or potential ill-intentioned unauthorized users. The system required a completely different backup workflow than that of our regular -wikitext or otherwise-, backups due to the nature and size of multimedia files (large append-only store). We were also hit with long hardware delays due to supplier shortage for a while.

I advocated at first to solve multimedia backups and dumps at the same time, but this was not possible- because how wiki file permissions are handled currently on the mediawiki software, it is not just a question of "creating bundles of images". Mediawiki image storage is lacking basic features like a unique identifier for each file uploaded, and still uses sha-1 hashing, which is known to generate false collisions. This doesn't impact full backups, which is just "copying everything" privately (although I had to reimplement some of that functionality myself), but doesn't make it easy to identify individual files to update the status of already publicly available files.

Because of that, we (data persistence team) decided to solve the backups first, and then it will be possible to use the backup metadata to generate dumps in the future (reusing much of the work already done). My team is not in charge of xmldumps, so maybe a workmate will be able to update you more accurately on the priority of that- but I really think the work I've done will speed up dump production by a lot- e.g. dumps could be (maybe?) generated more easily from the backup data.

So I can announce that the first full (non-public) offline backup of Commons on eqiad datacenter finished in September (it took around 20 days to run), and a second offline and remote copy is happening right now on codfw datacenter and will likely finish before the end of this year. You can see the hosts containing the backup here: [4] [5] [6] [7] These hosts are not connected to the wikis/Internet, so if a vulnerability caused data loss on Swift, we will be able to recover from the backups.

Because of privacy and latency -fast recovery- reasons, those copies are hosted within WMF infrastructure (but geographically separate among each other), but an extra offsite copy, not hosted on WMF hardware is also planned for the short future. More work will be needed for fast recovery tooling, as well as incremental/streaming backups, too. More information about this will be documented on a wiki soon.

Those copies cannot be shared "as is", as they have been optimized for fast recovery to production, not for separation of public and private files (like the rest of our backups).

So if the question is, what is the main blocker for faster image dumps? I would say it is the lack of a modern metadata storage model for images [1], one where there is a unique identifier of each uploaded image or a modern hashing (sha256) method is used. There is also some additional legal and technical considerations to make regular public image datasets- those are not impossible to solve but require some solutions. I am also personally heavily delayed by the lack of a dedicated Multimedia Team (I am a system administrator/Site reliability Engineer- in charge of data recovery, not a Mediawiki developer) that can support all the bugs [2] and corruption [3] I find along the way. It is my understanding that, at the moment, there is not any Mediawiki developer in charge of file management code maintenance.

[0] <url:https://phabricator.wikimedia.org/T262668>
[1] <url:https://phabricator.wikimedia.org/T28741>
[2] <url:https://phabricator.wikimedia.org/T290462#7405740>
[3] <url:https://phabricator.wikimedia.org/T289996>
[4] <https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&from=1638230666524&to=now&var-server=backup2004&var-datasource=thanos>
[5] <url:https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&from=1638230666524&to=now&var-server=backup2005&var-datasource=thanos>
[6] <url:https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&from=1638230666524&to=now&var-server=backup2006&var-datasource=thanos>
[7] <url:https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&from=1638230666524&to=now&var-server=backup2007&var-datasource=thanos>

--
Jaime Crespo
<http://wikimedia.org>