db schema change and scripts - Wikitech-l

20 Dec 2004

I know I'm a bit late with this, but it is pretty hard to keep in touch with
everything going on.
I'm trying to grasp what will be the consequences of the db schema change
for the wikistats scripts, which use raw database dumps.

I need to know which user edits in Revision were for namespace 0.
Also which entries in Text belong to the same article, were in namespace 0
and at what time they were saved.

The new setup seems to imply I need to build huge tables which exceed phys.
memory,
hence sharp performance penalties (the job already runs +/- 24 hrs),
or I need to sort and merge these huge files several times before real work
starts.

If I understand correctly I would have to sort Page and Revision on
page_id=rev_page and merge into a new file, say PageRev.
Then sort PageRev file on rev_id and merge with Text.

All of his would not be necessary if a few small fields were replicated
across tables.
Impact on db size would be trivial, on page save time zero.

Namespace -> Revision.
Namespace, Rev_Page, Timestamp -> Text.

---------

Unrelated, will there be a periodic (costly) query to produce something
similar to the cur dump, which is used by quite a few scripts.
Downloading all complete db's is not workable.

Erik Zachte