Full text search

List overview All Threads
Download

newer

older

graphviz extension -- can we have...

[[tlh:]] language code...

Ivan Vighetto

29 Mar 2005 29 Mar '05

12:32 p.m.

Nowadays full text search is too heavy for wikimedia server. I think it could be very useful text search limited to page titles, expecially for readers: not so difficult modification to existing code and much more lighter than complete text search.

en.wikipedia:User:Sbisolo it.wikipedia:Utente:Sbisolo

Show replies by date

Minty

29 Mar 29 Mar

9:16 p.m.

anyone playing with http://nutch.org/

Daniel Wunsch

10:16 p.m.

On Tuesday 29 March 2005 21:16, Minty wrote:

...

anyone playing with http://nutch.org/

there is a working prototype of a search engine using lucene (the framework nutch is built on).

however, there are some political difficulties with using this - it's based on java, and java is not free.

a number of people seem to value the fact that wikipedia runs completely on open source software higher than a working fulltext search.

one _could_ try to port that code to mono and nlucene..

daniel

-- I find the whole business of religion profoundly interesting. But it does mystify me that otherwise intelligent people take it seriously. - Douglas Adams

Minty

11:13 p.m.

On Tue, 29 Mar 2005 22:16:23 +0200, Daniel Wunsch the.gray@gmx.net wrote:

...

however, there are some political difficulties with using this - it's based on java, and java is not free.

Fair point. What about Plucene, the perl port of Lucene?

It is not nearly as mature, stable or feature rich as Lucene, but hey.

I plan to be playing with Plucene a bit over the next couple of months : one initial avenue of interest is some rough and ready benchmarks on speed/resource requirements. I was planning to use a local copy of the wikimedia text as a corpus for this testing.

What I don't want to do is duplicate any existing work...

Kate Turner

30 Mar 30 Mar

9:47 a.m.

Minty wrote in gmane.science.linguistics.wikipedia.technical:

...

I plan to be playing with Plucene a bit over the next couple of months : one initial avenue of interest is some rough and ready benchmarks on speed/resource requirements. I was planning to use a local copy of the wikimedia text as a corpus for this testing.

...

What I don't want to do is duplicate any existing work...

look at the "lucene-search" module in CVS (http://cvs.sourceforge.net/viewcvs.py/wikipedia/lucene-search/). this is a (mostly) complete and functional Lucene (Java version) based search server for MediaWiki. i'm not sure how similar the Java version is to other versions, but you may be able to port the relevant bits without too much effort.

an experimental test of this on the live site showed that it was able to handle our search load on a single 3.0GHz P4 with very minimal CPU usage, as long as the typo suggestion feature isn't enabled (because that uses several slow searches to produce the result; it could almost certainly be reimplemented in a much more efficient manner).

kate.

Thomas Gries

9:53 a.m.

Kate Turner schrieb:

...

Minty wrote in gmane.science.linguistics.wikipedia.technical:

[ I copied text about wikipedia/lucene-search onto http://meta.wikipedia.org/wiki/FulltextSearchEngines ]

Tomer Chachamu

29 Mar 29 Mar

10:18 p.m.

On Tue, 29 Mar 2005 20:16:03 +0100, Minty mintywalker@gmail.com wrote:

...

anyone playing with http://nutch.org/ ?

Actually, I think the more generic Lucene library which Nutch is built upon will be more useful. We should be indexing the wikitext, not the HTML (which is a lower quality version ;))

Seriously, we also don't want a crawler. What is left in Nutch's favour?

However, I don't imagine either will be used by Wikimedia, as they are written in Java, which is slow and takes up too much memory compared to natively running stuff (i.e. C or C++). It's already bad enough that we're using PHP! (In one extreme case, a diff took 45.5 seconds in PHP while the same algorithm took 0.5 seconds in C (or maybe C++) (this is from a developer)).

Daniel Wunsch

10:43 p.m.

On Tuesday 29 March 2005 22:18, Tomer Chachamu wrote:

...

However, I don't imagine either will be used by Wikimedia, as they are written in Java, which is slow and takes up too much memory compared to natively running stuff (i.e. C or C++). It's already bad enough that we're using PHP! (In one extreme case, a diff took 45.5 seconds in PHP while the same algorithm took 0.5 seconds in C (or maybe C++) (this is from a developer)).

java cannot beat raw C-code (except sometimes) but it sure is an order of magnitude faster than PHP. a java VM does need a lot of memory to run fast, but you need that anyway if you want reasonable search performance.

this is from a developer, too ;)

daniel

P.S. mono is said to be quite slow up to now :/

Stefan Groschupf

11:09 p.m.

...

Actually, I think the more generic Lucene library which Nutch is built upon will be more useful. We should be indexing the wikitext, not the HTML (which is a lower quality version ;))

This is the only open issue when you plan to use lucene, you need a good parser for the syntax and this is very difficult.

...

Seriously, we also don't want a crawler. What is left in Nutch's favour?

Nothing! Use Lucene - trust me. :-) It will definitely save wikipedia very very much load!!!

Lee Daniel Crocker

30 Mar 30 Mar

6:16 a.m.

On Tue, 2005-03-29 at 23:09 +0200, Stefan Groschupf wrote:

...

This is the only open issue when you plan to use lucene, you need a good parser for the syntax and this is very difficult.

Mediawiki already saves a pre-parsed text at article save time. It would be no problem to save this into a place where something like lucene could get to it. But even if the search engine does have to do some of its own parsing, that's not such a big deal.

-- Lee Daniel Crocker lee@piclab.com http://www.piclab.com/lee/ http://creativecommons.org/licenses/publicdomain/

Thomas Gries

8:42 p.m.

Please perhaps you want to consider to visit the new page http://meta.wikipedia.org/wiki/FulltextSearchEngines when you want to participate the discussion about a new full text search engine ?

The page and its talk page is awaiting your discussions.

Tom

Thomas Gries

9:25 a.m.

New subject: Full text search - JODA program by Jochen Magnus

Instead of Lucene (or clones of this), we could consider to use JODA [1]. Jochen has written JODA especially for Wikipedia and Mediawiki purposes and has published this several times in this mailing list.

His indexer can be visited live on Neue-Ruhr-Zeitung as an indexer for Wikipedia data, see [2]

I also propose to set up a metawiki page http://meta.wikipedia.org/wiki/FulltextSearchEngines to discuss all aspects and variants separated from this mailinglist to keep the list messages "KISS" i.e. keep it short and simple.

Tom

[1] http://sourceforge.net/projects/ioda/ [2] http://wikipedia.rhein-zeitung.de/index.php/Hauptseite (this page demonstrate only the indexer and is not intended as a mirror for wikipedia)

Jochen Magnus

12:47 p.m.

New subject: Full text search - JODA program by Jochen Magnus

Thank you, Thomas,

older versions of Joda are working since 1996 as news paper archive for the Rhein-Zeitung (Koblenz and Mainz, Germany). It's also used for archive and newsdesk purposes from several other european newspapers. At the moment it is going into action as full text index for europeans biggest magazine. It is also in use for the public index of the state archive of Rheinland-Pfalz (Germany).

Last year I created two mirrors of WikiPedia, one using MediaWiki for demonstration purposes and another - our public one - using our own read-only web frontend. Joda is integrated into both mirrors:

http://wikipedia.rhein-zeitung.de/index.php/Hauptseite (MediaWiki) http://lexikon.rhein-zeitung.de/ (our special Wikipedia interface)

At the suggestion of Magnus Manske (not related :-) I published Joda under LGPL and made serveral improvements for the Wikipedia task. I wrote tools for indexing a whole cur table either from MySQL or from a SQL dump (which is twice faster). Indexing the german Wikipedia cur table (>210.000 articles, 36 million words) lasts approx. 45 minutes. An optional database optimization lasts additional 25 minutes. Both on a dual Athlon 2800+ machine with 1 GB RAM (the indexer is a multi threaded perl program).

Joda can erase or update entries on the fly and can handle queries with parantheses and word distance operators like http://lexikon.rhein-zeitung.de/?((Albert OR Alfred) AND.1 Einstein) NEAR Quant*) NOT Gravitation. See more features under http://ioda.sourceforge.net/

Joda kernel is written with the Free Pascal compiler (http://sourceforge.net/projects/freepascal/). The tools are written in Perl. There a libraries for using joda directly from C, Perl, Python and PHP, all published under LGPL. The joda binaries are: command line program, TCP socket driven server and CGI.

Yours

Stefan Groschupf

29 Mar 29 Mar

10:49 p.m.

...

anyone playing with http://nutch.org/

Well nutch makes no sense for you guys since it uses a webcrawler you don't need since you can simply index your database content. Lucene it definitely what you need!!!

If you one day decide to use java, let me know I can contribute a search engine for you. Anyway we never really got the wiki syntax parser done (we tried regex, neko and javacc) but all was just to bad and slow results. To index en wikipedia content uses 4 h on a dual os x. My prototype search engine is able to answer more then 10 queries per second with less then 50 CPU usage on dual g5.

So, discuss the java issue and let me know. :-)

Stefan

Edward Z. Yang

10:51 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Stefan Groschupf wrote:

...

Well nutch makes no sense for you guys since it uses a webcrawler you don't need since you can simply index your database content. Lucene it definitely what you need!!!

If you one day decide to use java, let me know I can contribute a search engine for you. Anyway we never really got the wiki syntax parser done (we tried regex, neko and javacc) but all was just to bad and slow results. To index en wikipedia content uses 4 h on a dual os x. My prototype search engine is able to answer more then 10 queries per second with less then 50 CPU usage on dual g5.

So, discuss the java issue and let me know. :-)

How does Google do it? :P

Maybe we'll just have to deal with slow index times as a tradeoff to a fast, working (albeit slightly out of date) search.

- -- Edward Z. Yang Personal: edwardzyang@thewritingpot.com SN:Ambush Commander Website: http://www.thewritingpot.com/ GPGKey:0x869C48DA http://www.thewritingpot.com/gpgpubkey.asc 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (MingW32) iD8DBQFCSb/eqTO+fYacSNoRAmplAJoDYXvMX6jf7tEJiT/1/VpAtuJIkwCfbK2s n9+ApKMrfZJob8+mZWb+E8Y= =A8mb -----END PGP SIGNATURE-----

Jimmy (Jimbo) Wales

11:02 p.m.

Why don't we use pylucene?

http://pylucene.osafoundation.org/

This was recommended to me by Mitch Kapor when I met him in San Francisco. It has the benefits of lucene, but since it runs a GCJ-compiled versio of the lucene engine, it doesn't have the dependency on the non-free Sun stuff.

--Jimbo

Lee Daniel Crocker

30 Mar 30 Mar

6:09 a.m.

On Tue, 2005-03-29 at 13:02 -0800, Jimmy (Jimbo) Wales wrote:

...

Why don't we use pylucene?

It should be pretty simple to use something like this. Now that we have several servers, we can dedicate one or more to text searches only (perhaps "search.wikimedia.org" on round-robin DNS or something) and not have any impact the wiki servers at all. I'll experiment a bit at let you know.

-- Lee Daniel Crocker lee@piclab.com http://www.piclab.com/lee/ http://creativecommons.org/licenses/publicdomain/

Minty

9:36 a.m.

On Tue, 29 Mar 2005 20:09:17 -0800, Lee Daniel Crocker lee@piclab.com wrote:

...

It should be pretty simple to use something like this. Now that we have several servers, we can dedicate one or more to text searches

Also, you can use one or two servers to build the index then mirror the index files out to multiple query handling servers

Minty

9:53 a.m.

I don't know where things are/got-to wrt Google donating servers, but:

Would they be able to donate one of their Search-Appliance thingies? Even if as a short term measure?

+ It'd allow for daily updates + It'd be no coding/dev work for WP + No extra software to install on your servers, no extra hardware to look after + Look n Feel could be customised, and I imagine it could even live under it's own wikipedia subdomain

- Would involve Google crawling the site again - Might not deal well with older revisions / seperation of meta pages etc.

If there is a philosophical objection to the "non free" java, I can at least see the potential for concern about doing this, but equally if we are getting hardware from them for free anyway it seems logically to get Search customised hardware+service, rather than just pure hosting hardware.....

And in any event, there is no "lock in" if you ever wanted to revert to a different implementation.

Brian

10:04 a.m.

Those things aren't cheap (the Mini's run five grand, and the Google Search Appliance is tens of $$) but they do have the ability to index MySQL databases [1]

Interestingly they have special pricing for nonprofits[2]

Anyway, i'm surprised with this whole Google shenanigan that no one has mentioned Google runs proprietary software, free as in "it used to be free until we modified it and only released a insignificantly small portion of it" [3]

[1] http://www.google.com/enterprise/gsa/features.html [2] http://www.google.com/support/gsa/bin/answer.py?answer=16213&topic=-1 [3] http://code.google.com/

On Wed, 30 Mar 2005 08:53:31 +0100, Minty mintywalker@gmail.com wrote:

...

I don't know where things are/got-to wrt Google donating servers, but:

Would they be able to donate one of their Search-Appliance thingies? Even if as a short term measure?

It'd allow for daily updates

It'd be no coding/dev work for WP

No extra software to install on your servers, no extra hardware to look after

Look n Feel could be customised, and I imagine it could even live

under it's own wikipedia subdomain

Would involve Google crawling the site again

Might not deal well with older revisions / seperation of meta pages etc.

If there is a philosophical objection to the "non free" java, I can at least see the potential for concern about doing this, but equally if we are getting hardware from them for free anyway it seems logically to get Search customised hardware+service, rather than just pure hosting hardware.....

And in any event, there is no "lock in" if you ever wanted to revert to a different implementation. _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- "I'd hate to die twice. It's so boring." - Richard Feynman's last words -- http://www.br1an.net (blog) http://www.gawr.com/br1annet/gallery2 (photos)

Kate Turner

10:08 a.m.

Minty wrote in gmane.science.linguistics.wikipedia.technical:

...

Would they be able to donate one of their Search-Appliance thingies? Even if as a short term measure?

i've thought about this before. i didn't look at the appliances too closely, but my impression was that they could only index a relatively small number of documents (about 10,000). is this wrong?

other than that, this would probably be the easiest way to solve the problem for us; OTOH it's not much use to other MediaWiki users.

[...]

...

If there is a philosophical objection to the "non free" java, I can at least see the potential for concern about doing this,

the current situation appears to be that non-free software is not allowed, but software contained on other embedded devices is okay (e.g. switch firmware). given this i don't think there would be an issue with using one of the google devices.

kate.

Minty

10:16 a.m.

Kate Turner keturner@livejournal.com wrote:

...

i've thought about this before. i didn't look at the appliances too closely, but my impression was that they could only index a relatively small number of documents (about 10,000). is this wrong?

http://www.google.com/enterprise/ "Searches up to 15 million documents"

7060

Age (days ago)

7061

Last active (days ago)

wikitech-l@lists.wikimedia.org

21 comments

12 participants

tags (0)

participants (12)

Brian
Daniel Wunsch
Edward Z. Yang
Ivan Vighetto
Jimmy (Jimbo) Wales
Jochen Magnus
Kate Turner
Lee Daniel Crocker
Minty
Stefan Groschupf
Thomas Gries
Tomer Chachamu