"Brion Vibber" brion@pobox.com schrieb:
On Apr 2, 2004, at 00:53, Jimmy Wales wrote:
It shouldn't run more than once per day at first. I'm not sure what their goals are with respect to how often they would *like* to receive it, but daily is a fine start.
It would take hours just to run a complete dump, which would be the equivalent of a sizeable fraction of our total daily page views. (Best case might be 100ms per page for 240,000 pages =~ 6 hours 40 minutes)
If we're going to run something like this daily, some sort of incremental updates are a must, though we can probably get away with stuffing the saved data per page in a database or such and slurping it back out fairly quickly.
What about having a table daily of all pages that are changed, removed or new? In that case, we could read that table when the new version is made, and only those pages need to be in the XML dump (for slower search engines, we should then keep them for a while so they can get a number of days if they download them not every day). A search engine would then have to do a complete spidering (either by itself or through XML feed) once, but after that the XML feed can be much smaller.
As another issue, what do we do with the international aspect? My proposal would be to have XML feeds for the larger Wikipedias, and a single one for the whole of the smaller ones; the cut-off being determined by the size of the files in the feed.
Andre Engels
On Fri, 02 Apr 2004 15:18:00 +0000, Andre Engels wrote:
What about having a table daily of all pages that are changed, removed or new?
A list of urls that are changed is easy to generate and could even be distributed in realtime by using the purge messages we send out anyway. All it needs is another ip added in the squid array.
As another issue, what do we do with the international aspect? My proposal would be to have XML feeds for the larger Wikipedias, and a single one for the whole of the smaller ones; the cut-off being determined by the size of the files in the feed.
The purges are for all languages and can be filtered by language (i wrote a small python script that does this already daily for the stats).
The main tasks i see for an xml feed are * improving the parser to produce validating xhtml and
* either writing a small wrapper that includes the same rendered content area as used by regular page views (-> esi fragment in squid3, on the todo, relies on the parser being fixed) * or fetches the (often cached) (x)html from the squid, runs it through tidy --asxml, wraps it in a small xml file and returns the result. If this script was accessed through squid as well subsequent requests from other search engines would use the cached version until the page changes again. An additional url would need to be added to the purge call in Article.php to purge the feed version.
This would ensure that there are no additional db requests and no additional content rendering involved, and if the feed was mainly fetched at night it would also pre-fill the squids with up-to-date anon pages.
wikitech-l@lists.wikimedia.org