"Brion Vibber" <brion(a)pobox.com> schrieb:
> On Apr 2, 2004, at 00:53, Jimmy Wales wrote:
> > It shouldn't run more than once per day at first. I'm not sure what
> > their goals are with respect to how often they would *like* to receive
> > it, but daily is a fine start.
>
> It would take hours just to run a complete dump, which would be the
> equivalent of a sizeable fraction of our total daily page views. (Best
> case might be 100ms per page for 240,000 pages =~ 6 hours 40 minutes)
>
> If we're going to run something like this daily, some sort of
> incremental updates are a must, though we can probably get away with
> stuffing the saved data per page in a database or such and slurping it
> back out fairly quickly.
What about having a table daily of all pages that are changed, removed
or new? In that case, we could read that table when the new version is
made, and only those pages need to be in the XML dump (for slower search
engines, we should then keep them for a while so they can get a number of
days if they download them not every day). A search engine would then have
to do a complete spidering (either by itself or through XML feed) once,
but after that the XML feed can be much smaller.
As another issue, what do we do with the international aspect? My proposal
would be to have XML feeds for the larger Wikipedias, and a single one for
the whole of the smaller ones; the cut-off being determined by the size
of the files in the feed.
Andre Engels