Lee Daniel Crocker wrote:
On Mon, 2005-05-09 at 12:43 -0700, Brion Vibber
wrote:
A possible middle road is to rewrite the core wiki
engine to a separate
daemon, and adapt the existing PHP user interface to call into it for
much of the backend work that actually touches data.
That's the approach I'd favor. The existing PHP code represents a lot
of good user-interface work for which PHP is perfectly suited. The
underlying stuff could easily be split up into multiple daemons (say,
one for wikitext, one for images, one for equations,...) that could
feed the PHP front-end.
This would also allow incremental development, since each of the daemons
could be written and attached indicidually without disturbing the rest
of the codebase.
<nod>
Our biggest,
nastiest burden is with internal communications: the
database changes too much. We have to wait on things getting sent,
received, applied, and copied around, and lagging databases send
everything into the toilet fast.
Yep. That's why I think the bulk of the text should be stored in a
plain filesystem, where those problems are already well-known and
solved, for the most part. That would reduce the database to just
metadata, which would be much smaller and more efficient.
Replication's still an issue with filesystems; it seems like every
network filesystem a) sucks and b) is a SPOF, and every distributed
filesystem a) sucks and b) sucks.
This is one reason we've been talking about using an external object
store rather than a filesystem. But a filesystem might work too; anyway
that's a separate issue.
The 1.5 schema separates the text storage backend from the revision
metadata for just this purpose; the metadata tables are smaller, and we
can more easily transition to putting bulk text elsewhere entirely. In
fact we can do a gradual transition by storing serialized reference
objects in the text table.
Yes! There's only one tricky part for which we
may have to consider
creative implementations: I tried as much as possible to take style
markup (especially skin-specific) out of the rendered wikitext to
allow it to be cached, but there's one case that's still a problem:
red links (i.e., links to non-existent pages). Users shouted at me
that this was a sine-qua-non feature, and so I had to leave it in.
But it makes caching rendered wikitext hard, and slows down rendering.
That doesn't really make output caching harder at all; you just follow
the link tables back a step and purge the affected pages.
Currently this is done 'inline' during save; we could return from
certain saves a little faster if we can shunt the purging to a
background daemon, and it could handle the occasional mass-change case
more gracefully than a giant 30,000-row update.
(There are two steps here: updating the page_touched timestamp on page
records, and sending PURGE requests to the squid proxies. The updated
touched time invalidates the parser cache and client-side cached output
on the next visit, while making explicit squid purges lets the squids
serve the cached pages they do have to anonymous visitors without having
to check with the master servers on every hit.)
At one point I had started to write a purge daemon to handle the squid
purge requests, but put it off after various improvements were made to
how they're handled which sped it up quite a bit. (Currently we're using
some kind of multicast thing, IIRC.)
Note that updates to template pages share roughly the same issue here,
exacerbated by the possibility that a changed template will contain
different links. Currently we make no attempt to update the link
references from pages containing a template when the template changes,
as I recall.
Somewhat different but worth mentioning is the updating of link tables
on save: we've tended to flip-flop between deleting and rewriting all
the rows, and ending up locking things during the save which turned
ugly. Should make sure those are maintained cleanly, swiftly, and
without too much pain.
[snip various other stuff mostly addressed by Tim & co]
-- brion vibber (brion @
pobox.com)