An update on search - Wikitext-l

10 Feb 2012

Hi Good People.

I'd like to thank everyone for helping me with labs, code reviews and other
difficulties.

Recent Search Related Activity:

1. Branched the project to
svn:https://svn.wikimedia.org/svnroot/mediawiki/trunk/lucene-search-3
2. Upgraded the code from is Lucene 2.4.0  to 2.9.1 last December and I've been
reviewing and committing to svn.
3. I've migrated the project from Ant To Maven.
4. We have placed the Maven based code is in Continuous Integration on Jenkins with JUnit
PMD & Coverage report in place.
5. With the help of some excellent volunteers, I've set up a lab to test the build
using simple English Wikipedia.
6. One major setback is that there is no proper testing or deployment possible for update.
For this reason I've not closed any of the bugs I've worked on. (Access to the
production machines is considered too sensitive now that there are labs. At this time labs
do not have capacity for setting up a  At this Labs do not have There is no labs Setting
up a lab which replicates the production has been unsuccessful,. Once the scripts are
sanitized, and production search will be put into puppet it may be possible. However as
the labs environment is a far cry from the production in terms of both content, and
updating.
7. I've done some rough analysis and design for  the next version of search which will
feature computational linguistics support for the many languages being used in wikipedias.
Search analytics (optimizing ranking) and innovative content analytics for Ranking but
including objective metrics on neutral point of view (via. sentiment analysis), notability
(via. semantic algorithms), checking of external links (anti-link spam).
8. We are trying to relicense the search code so that the Lucene community in the apache
projects will become more involved. It may be necessary also to relicense MWdumper since
the two projects are related.

In the pipeline:

1. Testing and Integration into Lucene a ANTLR grammar which parses wiki syntax Tables.
Once successful it will be also integrated 
   into SOLR and become the prototype of more difficult wiki syntax analysis tasks.
2. I've started on some of the NLP tasks:
 a. a Transducer of Devanagari scripts to IPA. (in HFST)
 b. a Transducer of English to IPA with the common goal to index named entities based on
their sound in a language agnostic fashion. 
    (also in HFST)
 c. Extraction of phonetics data from English Wiktionary.
 d. conversion of CMU pronunciation dictionary to IPA.
 e. Extraction of bi-lingual lexicons from Wiktionary and conversion to Apertium Formats.
 f. Unsupervised learning of morphologies using minimum description length.
 g. Sentence boundary detection (SVM and MaxEnt Models).
 h. Topological text alignment algorithm.
3. A Maven based POM for building and packaging SOLR + our extension for distributed use.
4. A repository for NLP artifacts built from WikiContent.

Oren Bochman

MediaWiki Search Lead.