Hi,
[[de:Benutzer:Joma]] has written a fast search engine ...
well I am "joma" and after Magnus gave me a hint, I finally found this newsgroup ;-)
I wrote a fulltext engine, called "joda". Fortunatly it works for our Wikipedia mirror on http://lexikon.rhein-zeitung.de/ stable for some weeks. Unfortunatly I have to change the name (which I find quite pretty) because it is already in use at sourceforge ;-)
But however it will be named, I am willing to pusblish the source in sourceforge and I hope, it will be useful for the Wikipedia project!
I wrote joda years ago as an fulltext database for the online archives of our local newspaper for which I work for. In this environment, joda stores up to 80 million words for one vintage (volume) of our paper. This is approximately the same counter as I do expect for the english wikipedia and four times more than the german one. Therefore I am thinking that it is sufficient for Wikipedia, for which I made some improvements in the last few month. In the first line my program is now able to update existing files (which means word lists in the joda context).
joda works as an enhancement to the MySQL database. All it knows are nearly all words in Wikipedia (nearly means: except [common] stopp words), their positions in the text and the text to which they belong to (this means the primary key of the table 'cur', the cur_id).
joda requests can quite easily be integrated into the module SearchEngine.php of the Mediawiki software. I tested this in practice (http://wikipedia.rhein-zeitung.de/ - use the "Suche" Button ie. for (Albert and.1 Einstein) and Quant* not Physik*.
You see that joda can handle word logical operators like AND, OR, NOT and NEAR, word distance values (ie. and.50 for the NEAR operator) and parenthesis for grouping the operators. The syntax parser tries to optimate such complex requests by the expectation of the number of hits for each branche.
There are four joda binaries: a command line programm, a TCP based server, a C-standard library for which a collegue of mine and I wrote import interfaces to Perl and Python and a CGI programm for read only access in a web environment.
For Wikipedia we can use the library version for an overall indexing. Therefore I wrote (and will publish to) a Perl script which archives the whole content of the cur table (in our case only namespace 0). The joda server version can be used in one or multiple processes (this will require a kind of load balancing) and in one process for updating changing articles in a master database.
I am not sure but there may be one drop of bitterness: At the time, joda is not able to work internally with Unicode. This is because the recent Free Pascal compiler version has no well Unicode support. But practically joda can convert UTF-8 into all ISO-8859 charsets while archiving or retrieving. So the restriction pertains only to language which uses more than 256 chars. When Free Pascal will get a full Unicode Support (which is on their roadmap) I can extend joda for those languages.
So long for the moment. In the next few days I will pusblish joda at sourceforge under GPL. I'm afraid that I will have a lot to do with documentation...
jo
P.S.: joda does not know anything about cases. It is case insensitiv in its core :-)