Re: Commons media archives - Commons-l

13 Dec 2021

Hi, Samuel,

Sorry for not communicating early, all my work happened in the open[0] but
I didn't want to do any public announcements until there was a 100%
completed run! :-)

...
  I have the feeling the bulk of Commons media (~300 TB
in all) is not mirrored anywhere right now
...
  I saw something related mentioned on phab? within the
last year, but can't find it now.

So this was/is the state of multimedia storage at the moment:

* There are 3 copies of each file on the live OpenStack Swift cluster in
WMF's eqiad datacenter in Virginia
* There is an almost-real time replication of eqiad's multimedia cluster
into the codfw datacenter in Texas, with its own 3 separate copies
* Images can and are regularly served from both datacenters, protecting
against local disasters like floodings or earthquakes

That has been like that for a few years already, the following is new! :-)

I (with the assistance of many other WMF engineers) started working on an
offline/offsite backup solution for all multimedia files at the end of
2020- one that would save against application bugs, operator mistakes or
potential ill-intentioned unauthorized users. The system required a
completely different backup workflow than that of our regular -wikitext or
otherwise-, backups due to the nature and size of multimedia files (large
append-only store). We were also hit with long hardware delays due to
supplier shortage for a while.

*I advocated at first to solve multimedia backups and dumps at the same
time, but this was not possible*- because how wiki file permissions are
handled currently on the mediawiki software, it is not just a question of
"creating bundles of images". Mediawiki image storage is lacking basic
features like a unique identifier for each file uploaded, and still uses
sha-1 hashing, which is known to generate false collisions. This doesn't
impact full backups, which is just "copying everything" privately (although
I had to reimplement some of that functionality myself), but doesn't make
it easy to identify individual files to update the status of already
publicly available files.

Because of that, we (data persistence team) decided to solve the backups
first, and then it will be possible to use the backup metadata to generate
dumps in the future (reusing much of the work already done). My team is not
in charge of xmldumps, so maybe a workmate will be able to update you more
accurately on the priority of that- but I really think the work I've done
will speed up dump production by a lot- e.g. dumps could be (maybe?)
generated more easily from the backup data.

So I can announce that *the first full (non-public) offline backup of
Commons on eqiad datacenter finished in September* (it took around 20 days
to run), and *a second offline and remote copy is happening right now on
codfw datacenter* and will likely finish before the end of this year. You
can see the hosts containing the backup here: [4] [5] [6] [7] These hosts
are not connected to the wikis/Internet, so if a vulnerability caused data
loss on Swift, we will be able to recover from the backups.

Because of privacy and latency -fast recovery- reasons, those copies are
hosted within WMF infrastructure (but geographically separate among each
other), but *an extra offsite copy, not hosted on WMF hardware is also
planned for the short future*. More work will be needed for fast recovery
tooling, as well as incremental/streaming backups, too. More information
about this will be documented on a wiki soon.

Those copies cannot be shared "as is", as they have been optimized for fast
recovery to production, not for separation of public and private files
(like the rest of our backups).

So if the question is, what is the main blocker for faster image dumps? I
would say it is the lack of a modern metadata storage model for images [1],
one where there is a unique identifier of each uploaded image or a modern
hashing (sha256) method is used. There is also some additional legal and
technical considerations to make regular public image datasets- those are
not impossible to solve but require some solutions. I am also personally
heavily delayed by the lack of a dedicated Multimedia Team (I am a system
administrator/Site reliability Engineer- in charge of data recovery, not a
Mediawiki developer) that can support all the bugs [2] and corruption [3] I
find along the way. It is my understanding that, at the moment, there is
not any Mediawiki developer in charge of file management code maintenance.

[0] <url:https://phabricator.wikimedia.org/T262668...

[1] <url:https://phabricator.wikimedia.org/T28741...

[2] <url:https://phabricator.wikimedia.org/T290462#7405740...
  [3]
<url:https://phabricator.wikimedia.org/T289996...

[4] <
https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&or…
...
  [5] <url:
https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&or…
...
  [6] <url:
https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&or…
...
  [7] <url:
https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&or…
...

-- 
Jaime Crespo
<http://wikimedia.org...