Re: [Wikitech-l] Re: Yahoo! XML feed

2 Apr 2004


      Timwi wrote:
...
...
$sql = "SELECT cur_title as title from cur where cur_namespace=0";
This query sucks big time.
Do you know what this does? This retrieves the titles of ALL ARTICLES in 
Wikipedia. Do you know how many there are? ...
Right.  The purpose here is to make a friendly giant XML file to
enable Yahoo (and presumably other likeminded whoevers) to grab a
single giant document to study rather than having to crawl over the
whole site.  The purpose (from our point of view) is to have search
engines more up to date, since this file can be downloaded by them
once per day or hour rather than a crawl taking weeks.
...
I'm afraid I don't understand the purpose of the script. It seems to me 
that it is generating one ridiculously huge file that contains all of 
Wikipedia. What use would such a file be to anyone, even Yahoo?
*nod* It's so they can do the same thing they would do with a crawl of
the site, but virtually instantaneously.
...
I stress I don't really understand the purpose of the script, nor do I 
know exactly what Yahoo!'s (or anyone else's) requirements are, but it 
would seem way more sensible to me to have several smaller files, each 
of which containing maybe at most 100 articles or perhaps at most 1 MB 
of data or something. Each file should then contain a list of cur_ids, 
and then you can easily check for each file if any of the articles 
therein have changed since the last update.
It does seem that rather than feeding them One Big File, we could feed
them files of diffs or whatever.  But that'd be more complex and
require greater co-ordination.  This at least has the virtue of
simplicity.
It shouldn't run more than once per day at first.  I'm not sure what
their goals are with respect to how often they would *like* to receive
it, but daily is a fine start.
--Jimbo

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Yahoo! XML feed