2010/12/15 Diederik van Liere dvanliere@gmail.com:
However, if the export functionality is primarily used by Wikimedia and nobody else then we might consider a different language. Or, we make a standalone app that is not part of Mediawiki and it's use is only internally for Wikimedia.
If i am missing other approaches or solutions then please chime in.
I have an idea for a Wikimedia-specific performance hack: use lower-level ways of accessing ES so each blob is only fetched and decompressed once. This should reduce ES hits by a factor ~50 AFAIK.
<background for people less familiar with the ES setup>
WMF stores revision text in a system called external storage (ES), which is basically a MySQL database that stores blobs of compressed data. Revisions are not stored in order but are grouped per-page, such that each blob only contains revisions belonging to a specific page, in chronological order (up to N revs per blob or M bytes, whichever comes first; if memory serves, N=50 and M=10MB). Because the contents of consecutive versions of the same page are highly similar, compression performs very well in this case, and this type of storage is very space-efficient.
It's not particularly fast for random access, though, because to retrieve a revision you have to look up its text table entry (in the 'normal' wiki database), which will tell you which blob it's in and what its index in the blob is, then you have to fetch the entire blob, decompress it and find your revision. MediaWiki stores the result of each revision text fetch in memcached, probably for this reason.
Instead, if we would just fetch and decompress the entire blob /and use all of it/, we'd have the text of a number of consecutive revisions to the same page basically for the price of one fetch. This seems to be ideally suited to the dumps, because they output consecutive revisions to the same page.
</background>
I'm not sure how hard this would be to achieve (you'd have to correlate blob parts with revisions manually using the text table; there might be gaps for deleted revs because ES is append-only) or how much it would help (my impression is ES is one of the slower parts of our system and reducing the number of ES hits by a factor 50 should help, but I may be wrong), maybe someone with more relevant knowledge and experience can comment on that (Tim?).
Roan Kattouw (Catrope)