On Fri, 02 Apr 2004 15:18:00 +0000, Andre Engels wrote:
What about having a table daily of all pages that are changed, removed or new?
A list of urls that are changed is easy to generate and could even be distributed in realtime by using the purge messages we send out anyway. All it needs is another ip added in the squid array.
As another issue, what do we do with the international aspect? My proposal would be to have XML feeds for the larger Wikipedias, and a single one for the whole of the smaller ones; the cut-off being determined by the size of the files in the feed.
The purges are for all languages and can be filtered by language (i wrote a small python script that does this already daily for the stats).
The main tasks i see for an xml feed are * improving the parser to produce validating xhtml and
* either writing a small wrapper that includes the same rendered content area as used by regular page views (-> esi fragment in squid3, on the todo, relies on the parser being fixed) * or fetches the (often cached) (x)html from the squid, runs it through tidy --asxml, wraps it in a small xml file and returns the result. If this script was accessed through squid as well subsequent requests from other search engines would use the cached version until the page changes again. An additional url would need to be added to the purge call in Article.php to purge the feed version.
This would ensure that there are no additional db requests and no additional content rendering involved, and if the feed was mainly fetched at night it would also pre-fill the squids with up-to-date anon pages.