I'm exploring various ways of working with the XML data dumps on /publib/dumps/public/enwiki. I've got a process which runs through all of the enwiki-20210301-pages-articles[123456789]*.xml* files in about 6 hours. If I've done the math right, that's just about 18 GB of data, or 3 GB/h, or 8 MB/s that I'm slurping off NFS.
If I were to spin up 8 VPS nodes and run 8 jobs in parallel, in theory I could process 64 MB/s (512 Mb/s). Is that realistic? Or am I just going to beat the hell out of the poor NFS server, or peg some backbone network link, or hit some other rate limiting bottleneck long before I run out of CPU? Hitting a bottleneck doesn't bother me so much as not wanting to trash a shared resource by doing something stupid to it.
Putting it another way, would trying this be a bad idea?
Parallelizing the processing isn't a bad idea, though in this case I think I concur with your thoughts about hammering the NFS server. Are you making a copy of the data to process? If your processing threads utilize a "local" copy of the data, you needn't worry as much about the network. If you just wanted more processing threads, I presume you don't need to spin up more instances to parallelize things. One big instance should be easier than multiple instances, and if needed your project could get access to higher CPU and memory limits.
On Sun, Apr 18, 2021 at 9:07 PM Roy Smith roy@panix.com wrote:
I'm exploring various ways of working with the XML data dumps on /publib/dumps/public/enwiki. I've got a process which runs through all of the enwiki-20210301-pages-articles[123456789]*.xml* files in about 6 hours. If I've done the math right, that's just about 18 GB of data, or 3 GB/h, or 8 MB/s that I'm slurping off NFS.
If I were to spin up 8 VPS nodes and run 8 jobs in parallel, in theory I could process 64 MB/s (512 Mb/s). Is that realistic? Or am I just going to beat the hell out of the poor NFS server, or peg some backbone network link, or hit some other rate limiting bottleneck long before I run out of CPU? Hitting a bottleneck doesn't bother me so much as not wanting to trash a shared resource by doing something stupid to it.
Putting it another way, would trying this be a bad idea?
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud