Hi!
On 12/22/21 18:29, Sascha Brawer wrote:
What storage options does the Wikimedia cloud have? Can external developers (i.e. people not employed by the Wikimedia foundation) write to Cinder and/or Swift? Either from Toolforge or from Cloud VPS?
I've left more detailed replies inline. tl;dr: Currently Toolforge doesn't really have any other options than NFS. Cloud VPS additionally gives you the option to use Cinder (extra volumes you can attach to a VM and move from a VM to another).
See below for context. (Actually, is this the right list, or should I ask elsewhere?)
For Wikidata QRank [https://qrank.toolforge.org/ https://qrank.toolforge.org/], I run a cronjob on the toolforge Kubernetes cluster. The cronjob mainly works on Wikidata dumps and anonymized Wikimedia access logs, which it reads from the NFS-mounted /public/dumps/public directory. Currently, the job produces 40 internal files with a total size of 21G; these files need to be preserved between individual cronjob runs. (In a forthcoming version of the cronjob, this will grow to ~200 files with a total size of ~40G). For storing these intermediate files, Cinder might be a good solution. However, afaik Cinder isn’t available on Toolforge. Therefore, I’m currently storing the intermediate files in the account’s home directory on NFS. Presumably (but not sure, but speculating because I’ve seen NFS crumbling elsewhere) Wikimedia’s NFS server will be easily overloaded; in any case, Wikimedia’s NFS server seems to protect itself by throttling access. Because of the throttling, the cronjob is slow when working with its intermediate files.
- Will Cinder be made available to Toolforge users? When?
We're interested in it, but no-one has time or interest to work on making it a reality yet. This is tracked on Phabricator: https://phabricator.wikimedia.org/T275555.
As a reminder: if anyone is interested in working on this or other parts of the WMCS infrastructure, please talk to us!
- Or should I move from Toolforge to Cloud-VPS, so I can store my
intermediate files on Cinder?
~40G is in the range where Cinder/Cloud VPS might indeed be a better solution than NFS. While we don't currently have any official numbers on what is acceptable on NFS and what's not, for context the Toolforge project NFS cluster has currently about 8T of storage for about 3,000 tools.
- Or should I store my intermediate files in some object storage? Swift?
Ceph? Something else?
WMCS currently doesn't offer direct access to any object storage service. This is something we're likely to work on in the mid-term (next 6-12 months is the last estimate I've heard). This project is currently stalled on some network design work: https://phabricator.wikimedia.org/T289882.
- Is access to Cinder and Swift subject to the same throttling as
NFS? Or will moving away from NFS increase the available I/O throughput?
No, NFS is subject to completely separate throttling and Ceph-backed storage methods (local VM disks and Cinder volumes) have much higher amount of bandwidth available.
The final output of the QRank system is a single file, currently ~100M in size but eventually growing to ~1G. When the cronjob has computed a fresh version of its output, it deletes any old outputs from previous runs (with the exception of the previous last two versions, which are kept around internally for debugging). Typical users are other bots or external pipelines who need a signal for prioritizing Wikidata entities, not end users on the web. Users typically check for updates with HTTP HEAD, or with conditional HTTP GET requests (using the standard If-Modified-Since and If-None-Match headers). Currently, I’m serving the output file with a custom-written HTTP server that runs as a web service on Toolforge behind Toolforge’s nginx instance. My server reads its content from the NFS-mounted home directory that’s getting populated by the cronjob. Now, it’s not exactly a great idea to serve large data files from NFS, but afaik it’s the only option available in the Wikimedia cloud, at least for Toolforge users. Of course I might be wrong.
- Should I move from Toolforge to Cloud-VPS, so I can serve my final
output files from Cinder instead of NFS?
- Or should I rather store my final output files in some object storage?
Swift? Ceph? Something else?
- Or is NFS just fine, even if the size of my data grows from 100M to 1G+?
When we offer object storage, yes, storing your files in it is a good idea. I think you should be fine NFS for now (please don't quote me on that). Cloud VPS is an option too if you prefer it.
The cronjob also uses ~5G of temporary files in /tmp, which it deletes towards the end of each run. The temp files are used for external sorting, so all access is sequenyoutial. I’m not sure where these temporary files currently sit when running on Toolforge Kubernetes. Given their volume, I presume that the tmpfs of the Kubernetes nodes will eventually run out of memory and then fall back to disk, but I wouldn’t know how to find this out. _If_ the backing store disk for tmpfs eventually ends up being mounted on NFS, it sounds wasteful for the poor NFS server;, especially since the files get deleted at job completion. In that case, I’d love to save common resources by using a local disk. (It doesn’t have to be an SSD; a spinning hard drive would be fine, given the sequential access pattern). But I’m not sure how to set this up on Toolforge Kubernetes, and I couldn’t find docs on wikitech. Actually, this might be a micro-optimization, so perhaps not worth the trouble. But then, I’d like to be nice with the precious shared resources in the Wikimedia cloud.
Good question, I'm not sure either if tmpfs for Kubernetes containers is on Ceph (SSDs) or on RAM. At least it's not on NFS.
Sorry that I couldn’t find the answers online. While searching, I came across the following pointers: – https://wikitech.wikimedia.org/wiki/Ceph https://wikitech.wikimedia.org/wiki/Ceph: This page has a warning that it’s probably “no longer true”. If the warning is correct, perhaps the page could be deleted entirely? Or maybe it could link to the current docs? – https://wikitech.wikimedia.org/wiki/Swift https://wikitech.wikimedia.org/wiki/Swift: This sounds perfect, but the page doesn’t mention how the files are getting populated, what the ACLs are managed, and if Wikimedia’s Swift cluster is even accessible to external developers. – https://wikitech.wikimedia.org/wiki/Media_storage https://wikitech.wikimedia.org/wiki/Media_storage: This seems current (I guess?), but the page doesn’t mention if/how external Toolforge/Cloud-VPS users may upload objects, or if this is just for the current users.
Those pages document the media storage systems used to store uploads for the production MediaWiki projects (Wikipedia and friends). Those are not accessible from WMCS and should be treated as completely separate systems, and any future WMCS (object) storage services will not use them.
Documentation about the Ceph cluster powering Cloud VPS is on a separate Wikitech page: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Ceph.