Re: [Wikitech-l] Yahoo! XML feed

1 Apr 2004


      On Apr 1, 2004, at 18:20, Jason Richey wrote:
...
I'm putting the finishing touches on a script that exports the
wikipedia in a format that can be directly imported to Yahoo!'s (and
other's) search engine.  It's nothing pretty (in fact, it's my first
PHP), but I'd be grateful if 2 things would happen:

Someone would look at it (I attached it) and say "this sucks
because..."

A couple notes...
Unless a charset is explicitly set, XML is assumed to be Unicode 
(auto-detected between UTF-8, UTF-16 big-ending, and UTF-16 little 
endian). The output should either be converted to UTF-8 or marked as 
ISO-8859-1 like this:
<?xml version="1.0" charset="ISO-8859-1" ?>
Actually it might be better to mark it as Windows-1252 rather than 
ISO-8859-1, since sometimes bad Windows-specific characters sneak in 
and could make the feed invalid if included literally.
Page titles may contain ampersands and some other funky chars, and need 
to be escaped in a URL; you can use urlencode() to get it URL-encoded 
and htmlspecialchars() to make it safe for literal inclusion in XML. 
The $title_text should also be run through htmlspecialchars().
Otherwise it looks like it ought to be more or less functional, if a 
bit hard-coded in some places. (I haven't tested it yet.)
...
I didn't come up with any good way of getting keywords for a given
page.  Using the linked page titles was a suggestion.  Other ideas?
I believe the keywords Magnus recently added to the CVS code grabs 
additional keywords from the set of links in the page (which you could 
pull out of a join on links and cur). That's one possibility...
-- brion vibber (brion @ pobox.com)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Yahoo! XML feed