On Apr 1, 2004, at 18:20, Jason Richey wrote:
I'm putting the finishing touches on a script that exports the wikipedia in a format that can be directly imported to Yahoo!'s (and other's) search engine. It's nothing pretty (in fact, it's my first PHP), but I'd be grateful if 2 things would happen:
- Someone would look at it (I attached it) and say "this sucks because..."
A couple notes...
Unless a charset is explicitly set, XML is assumed to be Unicode (auto-detected between UTF-8, UTF-16 big-ending, and UTF-16 little endian). The output should either be converted to UTF-8 or marked as ISO-8859-1 like this: <?xml version="1.0" charset="ISO-8859-1" ?>
Actually it might be better to mark it as Windows-1252 rather than ISO-8859-1, since sometimes bad Windows-specific characters sneak in and could make the feed invalid if included literally.
Page titles may contain ampersands and some other funky chars, and need to be escaped in a URL; you can use urlencode() to get it URL-encoded and htmlspecialchars() to make it safe for literal inclusion in XML. The $title_text should also be run through htmlspecialchars().
Otherwise it looks like it ought to be more or less functional, if a bit hard-coded in some places. (I haven't tested it yet.)
I didn't come up with any good way of getting keywords for a given page. Using the linked page titles was a suggestion. Other ideas?
I believe the keywords Magnus recently added to the CVS code grabs additional keywords from the set of links in the page (which you could pull out of a join on links and cur). That's one possibility...
-- brion vibber (brion @ pobox.com)