2010/12/15 Diederik van Liere <dvanliere(a)gmail.com>om>:
However, if the export functionality is primarily used
by Wikimedia and nobody else then we might consider a different language. Or, we make a
standalone app that is not part of Mediawiki and it's use is only internally for
Wikimedia.
If i am missing other approaches or solutions then please chime in.
I have an idea for a Wikimedia-specific performance hack: use
lower-level ways of accessing ES so each blob is only fetched and
decompressed once. This should reduce ES hits by a factor ~50 AFAIK.
<background for people less familiar with the ES setup>
WMF stores revision text in a system called external storage (ES),
which is basically a MySQL database that stores blobs of compressed
data. Revisions are not stored in order but are grouped per-page, such
that each blob only contains revisions belonging to a specific page,
in chronological order (up to N revs per blob or M bytes, whichever
comes first; if memory serves, N=50 and M=10MB). Because the contents
of consecutive versions of the same page are highly similar,
compression performs very well in this case, and this type of storage
is very space-efficient.
It's not particularly fast for random access, though, because to
retrieve a revision you have to look up its text table entry (in the
'normal' wiki database), which will tell you which blob it's in and
what its index in the blob is, then you have to fetch the entire blob,
decompress it and find your revision. MediaWiki stores the result of
each revision text fetch in memcached, probably for this reason.
Instead, if we would just fetch and decompress the entire blob /and
use all of it/, we'd have the text of a number of consecutive
revisions to the same page basically for the price of one fetch. This
seems to be ideally suited to the dumps, because they output
consecutive revisions to the same page.
</background>
I'm not sure how hard this would be to achieve (you'd have to
correlate blob parts with revisions manually using the text table;
there might be gaps for deleted revs because ES is append-only) or how
much it would help (my impression is ES is one of the slower parts of
our system and reducing the number of ES hits by a factor 50 should
help, but I may be wrong), maybe someone with more relevant knowledge
and experience can comment on that (Tim?).
Roan Kattouw (Catrope)