Timwi wrote:
$sql = "SELECT cur_title as title from cur where cur_namespace=0";
This query sucks big time.
Do you know what this does? This retrieves the titles of ALL ARTICLES in Wikipedia. Do you know how many there are? ...
Right. The purpose here is to make a friendly giant XML file to enable Yahoo (and presumably other likeminded whoevers) to grab a single giant document to study rather than having to crawl over the whole site. The purpose (from our point of view) is to have search engines more up to date, since this file can be downloaded by them once per day or hour rather than a crawl taking weeks.
I'm afraid I don't understand the purpose of the script. It seems to me that it is generating one ridiculously huge file that contains all of Wikipedia. What use would such a file be to anyone, even Yahoo?
*nod* It's so they can do the same thing they would do with a crawl of the site, but virtually instantaneously.
I stress I don't really understand the purpose of the script, nor do I know exactly what Yahoo!'s (or anyone else's) requirements are, but it would seem way more sensible to me to have several smaller files, each of which containing maybe at most 100 articles or perhaps at most 1 MB of data or something. Each file should then contain a list of cur_ids, and then you can easily check for each file if any of the articles therein have changed since the last update.
It does seem that rather than feeding them One Big File, we could feed them files of diffs or whatever. But that'd be more complex and require greater co-ordination. This at least has the virtue of simplicity.
It shouldn't run more than once per day at first. I'm not sure what their goals are with respect to how often they would *like* to receive it, but daily is a fine start.
--Jimbo