[Wikitech-l] automatic extraction of wikipedia entries

9 May 2013


      Hello,
I'm a computer science researcher in the university of Avignon, in France. I
recently developed a software that automatically and quickly extract from an
UTF-8 text all the (longest) terms that belongs to a large set of terms. 
The term extractor works as a server and I tested it successfully with a
thesaurus made of the page's titles of fr.wikipedia.org, en.wikipedia.org
and es.wikipedia.org, i.e. 9,387,079 distinct terms composed from 4,496,195
distinct words. 
You are invited to test my demonstration at :
http://dev.termwatch.es/~jourlin/demo.php
The source code can be found at Github (condition of use, redistribution,
modification under the terms of the GNU Public License V3):
https://github.com/jourlin/FELTS
I roughly guessed that it could be of some interest for the development of
Mediawiki but I would very much appreciate any feedback before I look
further into that question.
Best regards,
Pierre Jourlin.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] automatic extraction of wikipedia entries