I'm exploring various ways of working with the XML data dumps on /publib/dumps/public/enwiki. I've got a process which runs through all of the enwiki-20210301-pages-articles*.xml* files in about 6 hours. If I've done the math right, that's just about 18 GB of data, or 3 GB/h, or 8 MB/s that I'm slurping off NFS.
If I were to spin up 8 VPS nodes and run 8 jobs in parallel, in theory I could process 64 MB/s (512 Mb/s). Is that realistic? Or am I just going to beat the hell out of the poor NFS server, or peg some backbone network link, or hit some other rate limiting bottleneck long before I run out of CPU? Hitting a bottleneck doesn't bother me so much as not wanting to trash a shared resource by doing something stupid to it.
Putting it another way, would trying this be a bad idea?
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly email@example.com)
Engineering Manager, Cloud Services