Hi,
[[de:Benutzer:Joma]] has written a fast search engine
...
well I am "joma" and after Magnus gave me a hint, I finally found this
newsgroup ;-)
I wrote a fulltext engine, called "joda". Fortunatly it works for our
Wikipedia mirror on
http://lexikon.rhein-zeitung.de/ stable for some
weeks. Unfortunatly I have to change the name (which I find quite
pretty) because it is already in use at sourceforge ;-)
But however it will be named, I am willing to pusblish the source in
sourceforge and I hope, it will be useful for the Wikipedia project!
I wrote joda years ago as an fulltext database for the online archives
of our local newspaper for which I work for. In this environment, joda
stores up to 80 million words for one vintage (volume) of our paper.
This is approximately the same counter as I do expect for the english
wikipedia and four times more than the german one. Therefore I am
thinking that it is sufficient for Wikipedia, for which I made some
improvements in the last few month. In the first line my program is now
able to update existing files (which means word lists in the joda context).
joda works as an enhancement to the MySQL database. All it knows are
nearly all words in Wikipedia (nearly means: except [common] stopp
words), their positions in the text and the text to which they belong
to (this means the primary key of the table 'cur', the cur_id).
joda requests can quite easily be integrated into the module
SearchEngine.php of the Mediawiki software. I tested this in practice
(
http://wikipedia.rhein-zeitung.de/ - use the "Suche" Button ie. for
(Albert and.1 Einstein) and Quant* not Physik*.
You see that joda can handle word logical operators like AND, OR, NOT
and NEAR, word distance values (ie. and.50 for the NEAR operator) and
parenthesis for grouping the operators. The syntax parser tries to
optimate such complex requests by the expectation of the number of hits
for each branche.
There are four joda binaries: a command line programm, a TCP based
server, a C-standard library for which a collegue of mine and I wrote
import interfaces to Perl and Python and a CGI programm for read only
access in a web environment.
For Wikipedia we can use the library version for an overall indexing.
Therefore I wrote (and will publish to) a Perl script which archives the
whole content of the cur table (in our case only namespace 0). The joda
server version can be used in one or multiple processes (this will
require a kind of load balancing) and in one process for updating
changing articles in a master database.
I am not sure but there may be one drop of bitterness: At the time, joda
is not able to work internally with Unicode. This is because the recent
Free Pascal compiler version has no well Unicode support. But
practically joda can convert UTF-8 into all ISO-8859 charsets while
archiving or retrieving. So the restriction pertains only to language
which uses more than 256 chars. When Free Pascal will get a full Unicode
Support (which is on their roadmap) I can extend joda for those languages.
So long for the moment. In the next few days I will pusblish joda at
sourceforge under GPL. I'm afraid that I will have a lot to do with
documentation...
jo
P.S.: joda does not know anything about cases. It is case insensitiv in
its core :-)