Jason Richey wrote:
- Someone would look at it (I attached it) and say "this sucks because..."
OK, since you asked for it... :)
$sql = "SELECT cur_title as title from cur where cur_namespace=0";
This query sucks big time.
Do you know what this does? This retrieves the titles of ALL ARTICLES in Wikipedia. Do you know how many there are? ...
The Main Page states 239180, and that's only articles that meet certain criteria...
$data = getPageData($s->title);
It seems that getPageData() retrieves the text of a page. In other words, it performs yet another database query. And you're calling that FOR EVERY ARTICLE in Wikipedia!
I'm afraid I don't understand the purpose of the script. It seems to me that it is generating one ridiculously huge file that contains all of Wikipedia. What use would such a file be to anyone, even Yahoo?
I stress I don't really understand the purpose of the script, nor do I know exactly what Yahoo!'s (or anyone else's) requirements are, but it would seem way more sensible to me to have several smaller files, each of which containing maybe at most 100 articles or perhaps at most 1 MB of data or something. Each file should then contain a list of cur_ids, and then you can easily check for each file if any of the articles therein have changed since the last update.
Of course, that's just a suggestion.
Greetings, Timwi