Hi,
We have developed a fast similarity search algorithm (FastSS) for keyword search and we have used English Wikipedia articles to test it. The similarity metric is the edit distance between words, which is language independent. The result is displayed according to the occurrence and edit distance. The website (http://fastss.csg.uzh.ch/) has a demo of our prototype.
The indexing of the complete English Wikipedia takes ~3 days. We are still working on improving the performance, solving issues with umlauts, improving the rendering of the output page and integrating the title in the indexing phase.
Should we work towards a mediawiki integration of FastSS? Is it of interest to you to include FastSS into wikimedia? Any comments are welcome.
Regards,
Thomas Bocek
It'd be great if you could train this on different OSS packages like Wordpress, Mediawiki and Vanilla forum to generate a site-wide search of content, and also extract related posts/articles/discussions from one topic.
Hi,
We have developed a fast similarity search algorithm (FastSS) for keyword search and we have used English Wikipedia articles to test it. The similarity metric is the edit distance between words, which is language independent. The result is displayed according to the occurrence and edit distance. The website (http://fastss.csg.uzh.ch/) has a demo of our prototype.
The indexing of the complete English Wikipedia takes ~3 days. We are still working on improving the performance, solving issues with umlauts, improving the rendering of the output page and integrating the title in the indexing phase.
Should we work towards a mediawiki integration of FastSS? Is it of interest to you to include FastSS into wikimedia? Any comments are welcome.
Regards,
Thomas Bocek
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 11/6/07, Thomas Bocek bocek@ifi.uzh.ch wrote:
Should we work towards a mediawiki integration of FastSS? Is it of interest to you to include FastSS into wikimedia? Any comments are welcome.
Three people I would suggest you consider contacting are Robert Stojnić (rainman), who maintains the Lucene search extension used by Wikipedia and has been fiddling with improvements somewhat recently; Gregory Maxwell (gmaxwell), who has the title of Chief Research Coordinator and has (IIRC) done work on similarity-searching on Wikipedia in the past; and maybe Brion Vibber (brion), the lead developer and CTO, although he tends to be really busy.
wikitech-l@lists.wikimedia.org