Hello,
I'm a computer science researcher in the university of Avignon, in France. I
recently developed a software that automatically and quickly extract from an
UTF-8 text all the (longest) terms that belongs to a large set of terms.
The term extractor works as a server and I tested it successfully with a
thesaurus made of the page's titles of
fr.wikipedia.org,
en.wikipedia.org
and
es.wikipedia.org, i.e. 9,387,079 distinct terms composed from 4,496,195
distinct words.
You are invited to test my demonstration at :
http://dev.termwatch.es/~jourlin/demo.php
The source code can be found at Github (condition of use, redistribution,
modification under the terms of the GNU Public License V3):
https://github.com/jourlin/FELTS
I roughly guessed that it could be of some interest for the development of
Mediawiki but I would very much appreciate any feedback before I look
further into that question.
Best regards,
Pierre Jourlin.