[Wikitech-l] Re: better search engine for MediaWiki?

3 Sep 2004

Hi,

...
  [[de:Benutzer:Joma]] has written a fast search engine
... 
well I am "joma" and after Magnus gave me a hint, I finally found this 
newsgroup ;-)

I wrote a fulltext engine, called "joda". Fortunatly it works for our 
Wikipedia mirror on http://lexikon.rhein-zeitung.de/ stable for some 
weeks. Unfortunatly I have to change the name (which I find quite 
pretty) because it is already in use at sourceforge ;-)

But however it will be named, I am willing to pusblish the source in 
sourceforge and I hope, it will be useful for the Wikipedia project!

I wrote joda years ago as an fulltext database for the online archives 
of our local newspaper for which I work for. In this environment, joda 
stores up to 80 million words for one vintage (volume) of our paper. 
This is approximately the same counter as I do expect for the english 
wikipedia and four times more than the german one. Therefore I am 
thinking that it is sufficient for Wikipedia, for which I made some 
improvements in the last few month. In the first line my program is now 
able to update existing files (which means word lists in the joda context).

joda works as an enhancement to the MySQL database. All it knows are 
nearly all words in Wikipedia (nearly means: except [common] stopp 
words), their  positions in the text and the text to which they belong 
to (this means the primary key of the table 'cur', the cur_id).

joda requests can quite easily be integrated into the module 
SearchEngine.php of the Mediawiki software. I tested this in practice 
(http://wikipedia.rhein-zeitung.de/  - use the "Suche" Button ie. for 
(Albert and.1 Einstein) and Quant* not Physik*.

You see that joda can handle word logical operators like AND, OR, NOT 
and NEAR, word distance values (ie. and.50 for the NEAR operator) and 
parenthesis for grouping the operators. The syntax parser tries to 
optimate such complex requests by the expectation of the number of hits 
for each branche.

There are four joda binaries: a command line programm, a TCP based 
server, a C-standard library for which a collegue of mine and I wrote 
import interfaces to Perl and Python and a CGI programm for read only 
access in a web environment.

For Wikipedia we can use the library version for an overall indexing. 
Therefore I wrote (and will publish to) a Perl script which archives the 
whole content of the cur table (in our case only namespace 0). The joda 
server version can be used in one or multiple processes (this will 
require a kind of load balancing) and in one process for updating 
changing articles in a master database.

I am not sure but there may be one drop of bitterness: At the time, joda 
is not able to work internally with Unicode. This is because the recent 
Free Pascal compiler version has no well Unicode support. But 
practically joda can convert UTF-8 into all ISO-8859 charsets while 
archiving or retrieving. So the restriction pertains only to language 
which uses more than 256 chars. When Free Pascal will get a full Unicode 
Support (which is on their roadmap) I can extend joda for those languages.

So long for the moment. In the next few days I will pusblish joda at 
sourceforge under GPL. I'm afraid that I will have a lot to do with 
documentation...

jo

P.S.: joda does not know anything about cases. It is case insensitiv in 
its core :-)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Re: better search engine for MediaWiki?