better search engine for MediaWiki?

List overview All Threads
Download

newer

older

testmail, pls ignore

wikispecies.org

Kurt Jansson

27 Aug 2004 27 Aug '04

3:57 p.m.

[[de:Benutzer:Joma]] has written a fast search engine for a Wikipedia mirror which can handle complex search queries. He emphasises that it's meant as a supplement to an existing SQL database, not a replacement. He is willing to put the code under GPL if we are interested.

I think this might be useful for us, could one of the German speaking developers please take a look at http://de.wikipedia.org/wiki/Wikipedia_Diskussion:Projekte%2C_die_Wikipedia_... and - if we want to give it a try - get in contact with Joma?

Kurt

-- http://leihnetzwerk.de -- Teile Bücher, Videos und CDs mit anderen! http://wikipedia.de -- Arbeite mit bei der freien Enzyklopädie! Kurt Jansson, Wiener Str. 7, 10999 Berlin, http://jansson.de

Show replies by date

Magnus Manske

27 Aug 27 Aug

4:35 p.m.

I just wrote him and told him we're interested. Apparently, his software is in FreePascal and scans a complete SQL dump, then generates an index (for de in ~30min). Might have to tweaked to support changes up-to-the minute. Oh, good ol' Pascal days... :-)

I suggested for him to put the software up at sourceforge as its own project. If its holds up the promises (of which I have little doubt), other projects might want to use it as well. A simple, free, fast full-text search with powerful syntax like "(Gravitation* near (Newton or Cassini)) not Einstein" seems to be hard to find so far (pardon the pun).

Magnus

Kurt Jansson wrote:

...

[[de:Benutzer:Joma]] has written a fast search engine for a Wikipedia mirror which can handle complex search queries. He emphasises that it's meant as a supplement to an existing SQL database, not a replacement. He is willing to put the code under GPL if we are interested.

I think this might be useful for us, could one of the German speaking developers please take a look at http://de.wikipedia.org/wiki/Wikipedia_Diskussion:Projekte%2C_die_Wikipedia_...

and - if we want to give it a try - get in contact with Joma?

Kurt

Mathias Schindler

4:54 p.m.

Kurt Jansson schrieb:

...

[[de:Benutzer:Joma]] has written a fast search engine for a Wikipedia mirror which can handle complex search queries. He emphasises that it's meant as a supplement to an existing SQL database, not a replacement. He is willing to put the code under GPL if we are interested.

Even better, this software is already in a productive environment:

http://rhein-zeitung.de/

Just take a page at random:

http://rhein-zeitung.de/a/tt/t/rzo83007.html

and doubleclick on a random noun (capital letter), such as "Irak".

select "Lexikon"

and voila.

http://lexikon.rhein-zeitung.de/?Irak

I consider this combination between a newspaper and wikipedia great...

Jochen Magnus

2 Sep 2 Sep

8:13 p.m.

Hi,

...

[[de:Benutzer:Joma]] has written a fast search engine ...

well I am "joma" and after Magnus gave me a hint, I finally found this newsgroup ;-)

I wrote a fulltext engine, called "joda". Fortunatly it works for our Wikipedia mirror on http://lexikon.rhein-zeitung.de/ stable for some weeks. Unfortunatly I have to change the name (which I find quite pretty) because it is already in use at sourceforge ;-)

But however it will be named, I am willing to pusblish the source in sourceforge and I hope, it will be useful for the Wikipedia project!

I wrote joda years ago as an fulltext database for the online archives of our local newspaper for which I work for. In this environment, joda stores up to 80 million words for one vintage (volume) of our paper. This is approximately the same counter as I do expect for the english wikipedia and four times more than the german one. Therefore I am thinking that it is sufficient for Wikipedia, for which I made some improvements in the last few month. In the first line my program is now able to update existing files (which means word lists in the joda context).

joda works as an enhancement to the MySQL database. All it knows are nearly all words in Wikipedia (nearly means: except [common] stopp words), their positions in the text and the text to which they belong to (this means the primary key of the table 'cur', the cur_id).

joda requests can quite easily be integrated into the module SearchEngine.php of the Mediawiki software. I tested this in practice (http://wikipedia.rhein-zeitung.de/ - use the "Suche" Button ie. for (Albert and.1 Einstein) and Quant* not Physik*.

You see that joda can handle word logical operators like AND, OR, NOT and NEAR, word distance values (ie. and.50 for the NEAR operator) and parenthesis for grouping the operators. The syntax parser tries to optimate such complex requests by the expectation of the number of hits for each branche.

There are four joda binaries: a command line programm, a TCP based server, a C-standard library for which a collegue of mine and I wrote import interfaces to Perl and Python and a CGI programm for read only access in a web environment.

For Wikipedia we can use the library version for an overall indexing. Therefore I wrote (and will publish to) a Perl script which archives the whole content of the cur table (in our case only namespace 0). The joda server version can be used in one or multiple processes (this will require a kind of load balancing) and in one process for updating changing articles in a master database.

I am not sure but there may be one drop of bitterness: At the time, joda is not able to work internally with Unicode. This is because the recent Free Pascal compiler version has no well Unicode support. But practically joda can convert UTF-8 into all ISO-8859 charsets while archiving or retrieving. So the restriction pertains only to language which uses more than 256 chars. When Free Pascal will get a full Unicode Support (which is on their roadmap) I can extend joda for those languages.

So long for the moment. In the next few days I will pusblish joda at sourceforge under GPL. I'm afraid that I will have a lot to do with documentation...

P.S.: joda does not know anything about cases. It is case insensitiv in its core :-)

Krzysztof Kowalczyk

9:53 p.m.

Why not just use Lucene (http://jakarta.apache.org/lucene/docs/index.html) or one of its many ports?

It's mature, stable, open-source, actively developed and widely considered to be a very fast, and very high-quality full-text searching and indexing engine started by an expert in full-text searching (http://www.nutch.org/blog/cutting.html, http://lucene.sourceforge.net/publications.html)

And it does Unicode just fine.

Krzysztof Kowalczyk | http://blog.kowalczyk.info

Magnus Manske

3 Sep 3 Sep

5:32 a.m.

Krzysztof Kowalczyk wrote:

...

Why not just use Lucene (http://jakarta.apache.org/lucene/docs/index.html) or one of its many ports?

<voice type=disgusted>Java! &ugh;</voice> ;-)

Jochen Magnus

11:54 a.m.

Meanwhile I applied a Sourceforge project named "ioda". After sf replies, I will upload the source and a summary documentation.

Krzysztof: Former versions of joda are runing since 1996 as fulltext archive for our newspaper (and for other purposes). joda is not brand new (only the improved version for Wikipedia environment is it). I do know nothing about jakarta but (like Magnus?) my experiences with java apps are ... hmm... not really good.

Krzysztof Kowalczyk

4 Sep 4 Sep

3:33 p.m.

...

...
Why not just use Lucene (http://jakarta.apache.org/lucene/docs/index.html) or one of its many ports?

<voice type=disgusted>Java! &ugh;</voice> ;-)

There is a C# .NET port used e.g. in a very succesful Lookut add-on to Outlook. There are 2 python ports (one slow in native Python, another one that is part of Chandler project which uses python bindings to gcj compiled java code). There is C++ port.

"I don't like java" or "I have bad experience with java projects" are non-arguments. Do some research on technology before stating your objections to it.

Krzysztof Kowalczyk | http://blog.kowalczyk.info

Jochen Magnus

3:22 p.m.

Meanwhile I started the SourceForge Project for joda (now renamed to ioda).

See details at http://ioda.sourceforge.net/ and https://sourceforge.net/projects/ioda/.

At the moment the interface documentation is insufficient - U know about it! I will do some more docs in the next days...

There is one source packet esspecially made for a Mediawiki test integration: See "ioda_integration_in_mediawiki_example.tar.bz2" in the project files. It contains a Perl importer (cur to joda), a Joda.php file for joda requests from PHP and a patched searchEngine.php (from version 1.2.6)

Jochen Magnus

14 Sep 14 Sep

12:01 p.m.

Meanwhile I have completed the work on ioda fulltext engine. It is available from http://ioda.sourceforge.net/ including tools to archive a cur table from Wikipedias MySQL database.

A complete ready-to-run example, using a fully indexed de-wikipedia from Sept 8 2004, and including the binaries is available for download from: http://magnus.de/wikipedia/wikidemo.tar.bz2 (91 MB bz2). Unpack and run it for a performance test.

Cheers

7419

Age (days ago)

7437

Last active (days ago)

wikitech-l@lists.wikimedia.org

9 comments

5 participants

tags (0)

participants (5)

Jochen Magnus
Krzysztof Kowalczyk
Kurt Jansson
Magnus Manske
Mathias Schindler