On Dec 19, 2004, at 3:06 PM, Erik Zachte wrote:
The new setup seems to imply I need to build huge
tables which exceed
phys.
memory, hence sharp performance penalties (the job already runs +/- 24
hrs),
or I need to sort and merge these huge files several times before real
work
starts.
I strongly recommend against pulling data directly from the SQL dumps.
As you see already, as we continue to refactor the schema to make it
possible for the wiki to handle our workload efficiently it's going to
get less and less convenient to pull data from a raw dump in a linear
fashion.
Note that we're going to be moving to a new compression format as well,
which will merge multiple old revisions of pages into a single
compressed field in another table to dramatically reduce space
requirements.
All of his would not be necessary if a few small
fields were replicated
across tables.
Impact on db size would be trivial, on page save time zero.
Actually, it's a very significant burden -- that's why we're removing
the duplication.
As one example, duplicating namespace and title on every revision means
we have to update *every revision* (possibly many thousands) when pages
are renamed. This can take a significant amount of time, locks up
database resources, and can result in weird conflicts arising. Every
few weeks somebody renames a heavily-edited page and there's a mad
scramble to clean up after it. We can't sustain the wiki through
another couple years with that kind of problem wide open.
One thing I think I will add is a text byte size field on the revision
table; with individual-revision compression we no longer can easily get
the size short of decompressing the text to see what it looks like.
Generally this size will not change, either, since a given revision's
source text is immutable.
Unrelated, will there be a periodic (costly) query to
produce something
similar to the cur dump, which is used by quite a few scripts.
Downloading all complete db's is not workable.
We ought to have a more usable public dump format than the raw SQL
backups. Something based on Special:Export might be good for the kind
of processing you're doing, for instance. Consider also rewriting the
stats to work from the database.
-- brion vibber (brion @
pobox.com)