I have been developing a small search engine for the last 12 months and have been using Wikipedia as sample data. I made this available via futef.com. Recently, I did further parsing and merging of the wikipedia data and now merge all the titles together ie "Jimmy Wales" and "Jimbo Wales" that are attached to the same article and not connected just via #REDIRECT. In doing this, along w/ some relevancy tuning I uncovered some interesting things about the dataset. A search for "jimbo wales" returns as a top article "exploding whale" since somebody has included a redirect for "exploding jimbo wales" as well as "king jimbo wales". I can fix my search since nobody links to "exploding jimbo wales" - I can assume it is a junk link and exclude it. But I wanted to know if this community would be interested in using the search facilities to verify and explore some of the data. If there is some interested, I would be happy to create a richer interface into the search engine that would allow for more data anomalies to be exposed.
A Few More Examples:
Better than france -> Italy Better than germany -> Italy Cheese Eating Surrender Monkies -> France
Then there are more subtle issues like
Educational background of George W. Bush -> Yale
I haven't absorbed enough of the wikipedia ethos to offer a strong position on all of the things I have found but it would be great if people that are interested have a better set of tools to work w/ the data. What do your think?
BTW, I have been lurking here for sometime and watched a conversation appear about the relevancy of FUTEF and I took the criticism to heart. It was not as good as it should be - it still isn't but I have worked to make it better. The several cases that were mention on the mailing list in particular have been fixed and in general the overall relevancy has greatly improved. If anyone has issues please let me know - the previous comments were very helpful.
thanks, derek
-- http://futef.com derek@futef.com dgottfrid@gmail.com