Re: [Wikitech-l] Re: Yahoo! XML feed

2 Apr 2004


      On Apr 2, 2004, at 00:35, Timwi wrote:
...
...
$sql = "SELECT cur_title as title from cur where cur_namespace=0";
This query sucks big time.
Do you know what this does? This retrieves the titles of ALL ARTICLES 
in Wikipedia.
That's kinda the point, yeah. It might be better to skip redirects, 
though; otherwise they should be handled in some distinct way.
...
It seems that getPageData() retrieves the text of a page. In other 
words, it performs yet another database query. And you're calling that 
FOR EVERY ARTICLE in Wikipedia!
That's obviously a bit inefficient, but yes. Incremental updates of 
only changed pages could hypothetically lead to faster output 
generation after the first run, though this would require some 
intermediate storage (since we don't yet have a running parser cache).
...
I'm afraid I don't understand the purpose of the script. It seems to 
me that it is generating one ridiculously huge file that contains all 
of Wikipedia. What use would such a file be to anyone, even Yahoo?
(It would produce a series of files up to about 12.5 megabytes in 
length, not one big file.)
A text base without the unnecessary UI elements could improve search 
results, and I suppose can be kept more complete more easily than 
constant spidering of a 200k+ page site. *shrug* If that's the data 
format they want, hey fine, though having to download the entire set of 
a couple hundred megabytes for every update doesn't sound ideal.
Jason, would each output need to be self-contained, or can they accept 
incremental updates in IDIF? How often would they pull updates?
-- brion vibber (brion @ pobox.com)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Yahoo! XML feed