On Apr 2, 2004, at 00:35, Timwi wrote:
$sql = "SELECT cur_title as title from cur where cur_namespace=0";
This query sucks big time.
Do you know what this does? This retrieves the titles of ALL ARTICLES in Wikipedia.
That's kinda the point, yeah. It might be better to skip redirects, though; otherwise they should be handled in some distinct way.
It seems that getPageData() retrieves the text of a page. In other words, it performs yet another database query. And you're calling that FOR EVERY ARTICLE in Wikipedia!
That's obviously a bit inefficient, but yes. Incremental updates of only changed pages could hypothetically lead to faster output generation after the first run, though this would require some intermediate storage (since we don't yet have a running parser cache).
I'm afraid I don't understand the purpose of the script. It seems to me that it is generating one ridiculously huge file that contains all of Wikipedia. What use would such a file be to anyone, even Yahoo?
(It would produce a series of files up to about 12.5 megabytes in length, not one big file.)
A text base without the unnecessary UI elements could improve search results, and I suppose can be kept more complete more easily than constant spidering of a 200k+ page site. *shrug* If that's the data format they want, hey fine, though having to download the entire set of a couple hundred megabytes for every update doesn't sound ideal.
Jason, would each output need to be self-contained, or can they accept incremental updates in IDIF? How often would they pull updates?
-- brion vibber (brion @ pobox.com)