Dear Wikimedia cloud support
What storage options does the Wikimedia cloud have? Can external developers
(i.e. people not employed by the Wikimedia foundation) write to Cinder
and/or Swift? Either from Toolforge or from Cloud VPS?
See below for context. (Actually, is this the right list, or should I ask
elsewhere?)
For Wikidata QRank [
https://qrank.toolforge.org/], I run a cronjob on the
toolforge Kubernetes cluster. The cronjob mainly works on Wikidata dumps
and anonymized Wikimedia access logs, which it reads from the NFS-mounted
/public/dumps/public directory. Currently, the job produces 40 internal
files with a total size of 21G; these files need to be preserved between
individual cronjob runs. (In a forthcoming version of the cronjob, this
will grow to ~200 files with a total size of ~40G). For storing these
intermediate files, Cinder might be a good solution. However, afaik Cinder
isn’t available on Toolforge. Therefore, I’m currently storing the
intermediate files in the account’s home directory on NFS. Presumably (but
not sure, but speculating because I’ve seen NFS crumbling elsewhere)
Wikimedia’s NFS server will be easily overloaded; in any case, Wikimedia’s
NFS server seems to protect itself by throttling access. Because of the
throttling, the cronjob is slow when working with its intermediate files.
* Will Cinder be made available to Toolforge users? When?
* Or should I move from Toolforge to Cloud-VPS, so I can store my
intermediate files on Cinder?
* Or should I store my intermediate files in some object storage? Swift?
Ceph? Something else?
* Is access to Cinder and Swift subject to the same throttling as NFS? Or
will moving away from NFS increase the available I/O throughput?
The final output of the QRank system is a single file, currently ~100M in
size but eventually growing to ~1G. When the cronjob has computed a fresh
version of its output, it deletes any old outputs from previous runs (with
the exception of the previous last two versions, which are kept around
internally for debugging). Typical users are other bots or external
pipelines who need a signal for prioritizing Wikidata entities, not end
users on the web. Users typically check for updates with HTTP HEAD, or with
conditional HTTP GET requests (using the standard If-Modified-Since and
If-None-Match headers). Currently, I’m serving the output file with a
custom-written HTTP server that runs as a web service on
Toolforge behind Toolforge’s nginx instance. My server reads its content
from the NFS-mounted home directory that’s getting populated by the
cronjob. Now, it’s not exactly a great idea to serve large data files from
NFS, but afaik it’s the only option available in the Wikimedia cloud, at
least for Toolforge users. Of course I might be wrong.
* Should I move from Toolforge to Cloud-VPS, so I can serve my final output
files from Cinder instead of NFS?
* Or should I rather store my final output files in some object storage?
Swift? Ceph? Something else?
* Or is NFS just fine, even if the size of my data grows from 100M to 1G+?
The cronjob also uses ~5G of temporary files in /tmp, which it deletes
towards the end of each run. The temp files are used for external sorting,
so all access is sequential. I’m not sure where these temporary files
currently sit when running on Toolforge Kubernetes. Given their volume, I
presume that the tmpfs of the Kubernetes nodes will eventually run out of
memory and then fall back to disk, but I wouldn’t know how to find this
out. _If_ the backing store disk for tmpfs eventually ends up being mounted
on NFS, it sounds wasteful for the poor NFS server;, especially since the
files get deleted at job completion. In that case, I’d love to save common
resources by using a local disk. (It doesn’t have to be an SSD; a spinning
hard drive would be fine, given the sequential access pattern). But I’m not
sure how to set this up on Toolforge Kubernetes, and I couldn’t find docs
on wikitech. Actually, this might be a micro-optimization, so perhaps not
worth the trouble. But then, I’d like to be nice with the precious shared
resources in the Wikimedia cloud.
Sorry that I couldn’t find the answers online. While searching, I came
across the following pointers:
–
https://wikitech.wikimedia.org/wiki/Ceph: This page has a warning that
it’s probably “no longer true”. If the warning is correct, perhaps the page
could be deleted entirely? Or maybe it could link to the current docs?
–
https://wikitech.wikimedia.org/wiki/Swift: This sounds perfect, but the
page doesn’t mention how the files are getting populated, what the ACLs are
managed, and if Wikimedia’s Swift cluster is even accessible to external
developers.
–
https://wikitech.wikimedia.org/wiki/Media_storage: This seems current (I
guess?), but the page doesn’t mention if/how external Toolforge/Cloud-VPS
users may upload objects, or if this is just for the current users.
Thanks for your help, and happy holidays,
— Sascha, sascha(a)brawer.ch