I'm exploring various ways of working with the XML data dumps on
/publib/dumps/public/enwiki. I've got a process which runs through all of the
enwiki-20210301-pages-articles*.xml* files in about 6 hours. If I've done
the math right, that's just about 18 GB of data, or 3 GB/h, or 8 MB/s that I'm
slurping off NFS.
If I were to spin up 8 VPS nodes and run 8 jobs in parallel, in theory I could process 64
MB/s (512 Mb/s). Is that realistic? Or am I just going to beat the hell out of the poor
NFS server, or peg some backbone network link, or hit some other rate limiting bottleneck
long before I run out of CPU? Hitting a bottleneck doesn't bother me so much as not
wanting to trash a shared resource by doing something stupid to it.
Putting it another way, would trying this be a bad idea?