Hi Good People.
I'd like to thank everyone for helping me with labs, code reviews and other difficulties.
Recent Search Related Activity:
1. Branched the project to svn:https://svn.wikimedia.org/svnroot/mediawiki/trunk/lucene-search-3 2. Upgraded the code from is Lucene 2.4.0 to 2.9.1 last December and I've been reviewing and committing to svn. 3. I've migrated the project from Ant To Maven. 4. We have placed the Maven based code is in Continuous Integration on Jenkins with JUnit PMD & Coverage report in place. 5. With the help of some excellent volunteers, I've set up a lab to test the build using simple English Wikipedia. 6. One major setback is that there is no proper testing or deployment possible for update. For this reason I've not closed any of the bugs I've worked on. (Access to the production machines is considered too sensitive now that there are labs. At this time labs do not have capacity for setting up a At this Labs do not have There is no labs Setting up a lab which replicates the production has been unsuccessful,. Once the scripts are sanitized, and production search will be put into puppet it may be possible. However as the labs environment is a far cry from the production in terms of both content, and updating. 7. I've done some rough analysis and design for the next version of search which will feature computational linguistics support for the many languages being used in wikipedias. Search analytics (optimizing ranking) and innovative content analytics for Ranking but including objective metrics on neutral point of view (via. sentiment analysis), notability (via. semantic algorithms), checking of external links (anti-link spam). 8. We are trying to relicense the search code so that the Lucene community in the apache projects will become more involved. It may be necessary also to relicense MWdumper since the two projects are related.
In the pipeline:
1. Testing and Integration into Lucene a ANTLR grammar which parses wiki syntax Tables. Once successful it will be also integrated into SOLR and become the prototype of more difficult wiki syntax analysis tasks. 2. I've started on some of the NLP tasks: a. a Transducer of Devanagari scripts to IPA. (in HFST) b. a Transducer of English to IPA with the common goal to index named entities based on their sound in a language agnostic fashion. (also in HFST) c. Extraction of phonetics data from English Wiktionary. d. conversion of CMU pronunciation dictionary to IPA. e. Extraction of bi-lingual lexicons from Wiktionary and conversion to Apertium Formats. f. Unsupervised learning of morphologies using minimum description length. g. Sentence boundary detection (SVM and MaxEnt Models). h. Topological text alignment algorithm. 3. A Maven based POM for building and packaging SOLR + our extension for distributed use. 4. A repository for NLP artifacts built from WikiContent.
Oren Bochman
MediaWiki Search Lead.