Parallelizing the processing isn't a bad idea, though in this case I think
I concur with your thoughts about hammering the NFS server. Are you making
a copy of the data to process? If your processing threads utilize a "local"
copy of the data, you needn't worry as much about the network. If you
just wanted more processing threads, I presume you don't need to spin up
more instances to parallelize things. One big instance should be easier
than multiple instances, and if needed your project could get access to
higher CPU and memory limits.
On Sun, Apr 18, 2021 at 9:07 PM Roy Smith <roy(a)panix.com> wrote:
I'm exploring various ways of working with the XML
data dumps on
/publib/dumps/public/enwiki. I've got a process which runs through all of
the enwiki-20210301-pages-articles*.xml* files in about 6
hours. If I've done the math right, that's just about 18 GB of data, or 3
GB/h, or 8 MB/s that I'm slurping off NFS.
If I were to spin up 8 VPS nodes and run 8 jobs in parallel, in theory I
could process 64 MB/s (512 Mb/s). Is that realistic? Or am I just going
to beat the hell out of the poor NFS server, or peg some backbone network
link, or hit some other rate limiting bottleneck long before I run out of
CPU? Hitting a bottleneck doesn't bother me so much as not wanting to
trash a shared resource by doing something stupid to it.
Putting it another way, would trying this be a bad idea?
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
Engineering Manager, Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>