Hello,
I'm a computer science researcher in the university of Avignon, in France. I recently developed a software that automatically and quickly extract from an UTF-8 text all the (longest) terms that belongs to a large set of terms. The term extractor works as a server and I tested it successfully with a thesaurus made of the page's titles of fr.wikipedia.org, en.wikipedia.org and es.wikipedia.org, i.e. 9,387,079 distinct terms composed from 4,496,195 distinct words. You are invited to test my demonstration at : http://dev.termwatch.es/~jourlin/demo.php The source code can be found at Github (condition of use, redistribution, modification under the terms of the GNU Public License V3): https://github.com/jourlin/FELTS
I roughly guessed that it could be of some interest for the development of Mediawiki but I would very much appreciate any feedback before I look further into that question.
Best regards,
Pierre Jourlin.