Dear devs,
I would like to initiate a discussion about how to reduce the time required to generate dump files. A while ago Emmanuel Engelhart opened a bugreport suggesting to parallelize this feature and I would like to go through the available options and hopefully determine a course of action.
The current process is straightforward and sequential (as far as I know): it reads table by table and row by row and stores the output. The drawbacks of this process are that it takes increasingly more time to generate a dump as the different projects continue to grow and when the process halts or is interrupted then it needs to start all over again.
I believe that there are two approaches to parallelizing the export dump: 1) Launch multiple PHP processes that each take care of a particular range of ids. This might not be called true parallelization, but it achieves the same goal. The reason for this approach is that PHP has very limited (maybe no) support for parallelization / multiprocessing. The only thing PHP can do is fork a process (I might be incorrect about this)
2) Use a different language with builtin support for multiprocessing like Java or Python. I am not intending to start an heated debate but I think this is an option that at least should be on the table and be discussed. Obviously, an important reason not to do it is that it's a different language. I am not sure how integral the export functionality is to MediaWiki and if it is then this is a dead end.
However, if the export functionality is primarily used by Wikimedia and nobody else then we might consider a different language. Or, we make a standalone app that is not part of Mediawiki and it's use is only internally for Wikimedia.
If i am missing other approaches or solutions then please chime in.
Best regards,
Diederik