lucene search 2.0 test webinterface

List overview All Threads
Download

newer

older

api.php and private wikis

MediaWiki automated test run...

Robert Stojnic

22 May 2007 22 May '07

10:59 p.m.

Hi all,

I'm working on the internal lucene search engine. I've setup a webinterface (with the kind help of Tim :) for the new engine, visit it here:

http://ls2.wikimedia.org/

Most changes are in the internals (i.e. making searching/indexing distributed, incremental updates...), but I also tried to improve the scoring, and added some new search syntax, and enabled stemming for another ten or so major languages. Highlights: - prefix searches. E.g. entering help:images in the search box will search only the help namespace - search categories. You can limit search by category. e.g. clarinet incategory:"woodwind instruments" - improved scoring. Default lucene scoring favors short articles, I tried to make scoring as relevant to wikipedia as possible. Good test is entering "commodity" into search. Top two articles have almost the same score, first one: Commodity (Marxism) is a long article about usage of the word in Marxism, and other: Commodity is an article that is much shorter but whose title fits more accurately.

Test index is based on latest dumps for 15 largest wikis, with updates from last 4-5 days.

Any feedback will be appreciated :)

Robert

http://en.wikipedia.org/wiki/Commodity_%28Marxism%29

Show replies by date

Brion Vibber

22 May 22 May

11:47 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Robert Stojnic wrote:

...

Hi all,

I'm working on the internal lucene search engine. I've setup a webinterface (with the kind help of Tim :) for the new engine, visit it here:

http://ls2.wikimedia.org/

Most changes are in the internals (i.e. making searching/indexing distributed, incremental updates...), but I also tried to improve the scoring, and added some new search syntax, and enabled stemming for another ten or so major languages.

Looking good! :)

- -- brion vibber (brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGUx6cwRnhpk1wk44RAiVIAJ9jsssCsIADOKokCGyXP3Cb2+bDuACfe6TS XDjsbhvDpmtDLKBWqLN+9bw= =S4y3 -----END PGP SIGNATURE-----

Mohamed Magdy

11:59 p.m.

Robert Stojnic wrote:

...

Hi all,

I'm working on the internal lucene search engine. I've setup a webinterface (with the kind help of Tim :) for the new engine, visit it here:

http://ls2.wikimedia.org/

Most changes are in the internals (i.e. making searching/indexing distributed, incremental updates...), but I also tried to improve the scoring, and added some new search syntax, and enabled stemming for another ten or so major languages. Highlights:

prefix searches. E.g. entering help:images in the search box will search

only the help namespace

search categories. You can limit search by category. e.g. clarinet

incategory:"woodwind instruments"

improved scoring. Default lucene scoring favors short articles, I tried

to make scoring as relevant to wikipedia as possible. Good test is entering "commodity" into search. Top two articles have almost the same score, first one: Commodity (Marxism) is a long article about usage of the word in Marxism, and other: Commodity is an article that is much shorter but whose title fits more accurately.

Test index is based on latest dumps for 15 largest wikis, with updates from last 4-5 days.

Any feedback will be appreciated :)

Robert

http://en.wikipedia.org/wiki/Commodity_%28Marxism%29 _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."

Robert Stojnic

23 May 23 May

12:12 a.m.

...

Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."

The stemming issue is reported in bug 2511 [*]. The bug is caused by the indexer not indexing the original word, but only it's root (i.e. stemmed word). Now both are indexed, and original words are preferred, i.e. have larger scores.

ps.. about search in categories, note that subcategories are not expanded, this needs to be done at client side (i.e. in mediawiki extension).

[*] http://bugzilla.wikimedia.org/show_bug.cgi?id=2511

Mohamed Magdy

12:26 a.m.

Robert Stojnic wrote:

...

...
Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."

The stemming issue is reported in bug 2511 [*]. The bug is caused by the indexer not indexing the original word, but only it's root (i.e. stemmed word). Now both are indexed, and original words are preferred, i.e. have larger scores.

I may be wrong .. but isn't it right that before the program could get the root of the word it have to know it? i mean.. it should have a big list of words and its roots? and that is not for english only..you have to have lists for each language? or where else the program will strip the words

...

ps.. about search in categories, note that subcategories are not expanded, this needs to be done at client side (i.e. in mediawiki extension).

so you have to search each one separately ?

...

[*] http://bugzilla.wikimedia.org/show_bug.cgi?id=2511 _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brion Vibber

12:45 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Mohamed Magdy wrote:

...

Robert Stojnic wrote:

...
...
Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."

The stemming issue is reported in bug 2511 [*]. The bug is caused by the indexer not indexing the original word, but only it's root (i.e. stemmed word). Now both are indexed, and original words are preferred, i.e. have larger scores.

I may be wrong .. but isn't it right that before the program could get the root of the word it have to know it? i mean.. it should have a big list of words and its roots? and that is not for english only..you have to have lists for each language? or where else the program will strip the words

Roughly speaking, stemming is the process of taking inflected forms of words ("category" -> "categories") and extracting a normalized root form (say, "categori") for comparison purposes. This allows you to search on one form and receive results containing the other.

The exact code to do this will vary depending on language. A number of preexisting stemming filters exist for Lucene's indexing engine, some of which are used here.

Our currently-live search does basic stemming for English, German, Russian, and Esperanto, but not for other languages.

The issue Robert mentioned was that the old way would return results with any inflected form unconditionally, which can be annoying when you really did want an exact match. The new code has a preference for exact matches in the ranking, but will return related inflected forms as well.

- -- brion vibber (brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGUyxCwRnhpk1wk44RAs+VAKCmkfbxCS2KhCfXP5IANjfDpOJAQwCeLr3B h31LTAQFL6WLz8M1gcM/FZ0= =sqlV -----END PGP SIGNATURE-----

Tian-Jian "Barabbas" Jiang＠Gmail

1:01 a.m.

Brion Vibber said:

...

Our currently-live search does basic stemming for English, German, Russian, and Esperanto, but not for other languages.

The issue Robert mentioned was that the old way would return results with any inflected form unconditionally, which can be annoying when you really did want an exact match. The new code has a preference for exact matches in the ranking, but will return related inflected forms as well.

I guess Wikipedia is using SnowballAnalyzer rather than StandardAnalyzer? To my knowledge, the later one does not using any stemmer.

Regards, /Mike "b6s" Jiang/

Mohamed Magdy

1:20 a.m.

Brion Vibber wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Mohamed Magdy wrote:

...
Robert Stojnic wrote:

...
...
Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."

The stemming issue is reported in bug 2511 [*]. The bug is caused by the indexer not indexing the original word, but only it's root (i.e. stemmed word). Now both are indexed, and original words are preferred, i.e. have larger scores.

I may be wrong .. but isn't it right that before the program could get the root of the word it have to know it? i mean.. it should have a big list of words and its roots? and that is not for english only..you have to have lists for each language? or where else the program will strip the words

Roughly speaking, stemming is the process of taking inflected forms of words ("category" -> "categories") and extracting a normalized root form (say, "categori") for comparison purposes. This allows you to search on one form and receive results containing the other.

Thanks for explaining !

...

The exact code to do this will vary depending on language. A number of preexisting stemming filters exist for Lucene's indexing engine, some of which are used here.

Our currently-live search does basic stemming for English, German, Russian, and Esperanto, but not for other languages.

Comes the question, when/how other languages will have stemming as well?

If it is a bit annoying, but what is the difference between the basic and advanced stemming? or it will be added?

...

The issue Robert mentioned was that the old way would return results with any inflected form unconditionally, which can be annoying when you really did want an exact match. The new code has a preference for exact matches in the ranking, but will return related inflected forms as well.

-- brion vibber (brion @ wikimedia.org)

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGUyxCwRnhpk1wk44RAs+VAKCmkfbxCS2KhCfXP5IANjfDpOJAQwCeLr3B h31LTAQFL6WLz8M1gcM/FZ0= =sqlV -----END PGP SIGNATURE-----

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Robert Stojnic

1:37 a.m.

Comes the question, when/how other languages will have stemming as well?

The new search engine has stemmers for these languages: English, Danish, Dutch, Finnish, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish and Esperanto. It also has a filter for Thai, so that words are properly separated.

Tian-Jian "Barabbas" Jiang＠Gmail

1:49 a.m.

Robert Stojnic said:

...

Comes the question, when/how other languages will have stemming as well?

The new search engine has stemmers for these languages: English, Danish, Dutch, Finnish, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish and Esperanto. It also has a filter for Thai, so that words are properly separated.

r.

For example: https://svn.apache.org/repos/asf/lucene/java/trunk/contrib/snowball/src/java...

private Among a_1[] = { new Among ( "la", -1, -1, "", this), new Among ( "sela", 0, -1, "", this), new Among ( "le", -1, -1, "", this), new Among ( "me", -1, -1, "", this), new Among ( "se", -1, -1, "", this), new Among ( "lo", -1, -1, "", this), new Among ( "selo", 5, -1, "", this), new Among ( "las", -1, -1, "", this), new Among ( "selas", 7, -1, "", this), new Among ( "les", -1, -1, "", this), new Among ( "los", -1, -1, "", this), new Among ( "selos", 10, -1, "", this), new Among ( "nos", -1, -1, "", this) };

Regards, /Mike/

Tian-Jian "Barabbas" Jiang＠Gmail

12:46 a.m.

Mohamed Magdy said:

...

Robert Stojnic wrote:

...
...
Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."

The stemming issue is reported in bug 2511 [*]. The bug is caused by the indexer not indexing the original word, but only it's root (i.e. stemmed word). Now both are indexed, and original words are preferred, i.e. have larger scores.

I may be wrong .. but isn't it right that before the program could get the root of the word it have to know it? i mean.. it should have a big list of words and its roots? and that is not for english only..you have to have lists for each language? or where else the program will strip the words

Porter Stemmer only cuts inflected forms and known suffices, it does not convert words back to lemmas. This heuristic algorithm saves spaces of dictionaries, and experiences show that it is usually good enough for English, at least. Certainly it harmed if Wikipedia needs accurate results.

Cheers, /Mike "b6s" Jiang/

Tian-Jian "Barabbas" Jiang＠Gmail

1:12 a.m.

Hi all,

Now search results of "commodity" changes:

* Commodities http://en.wikipedia.org/wiki/Commodities Relevance: 100.0% - - * Commodity http://en.wikipedia.org/wiki/Commodity Relevance: 95.4% - - * Commodate http://en.wikipedia.org/wiki/Commodate Relevance: 94.7% - - * Commode http://en.wikipedia.org/wiki/Commode Relevance: 94.6% - -

I suggest that you may want to index "Title" with StandardAnalyzer and "Content" with SnowballAnalyzer, since the title field of Wikipedia is almost all named entities that should not be modified at all. IMHO, to have a mixture of original words and stemmed forms is a good heuristic rule though, but it is only suitable for content field.

Sincerely, /Mike "b6s" Jiang/

Robert Stojnic

1:30 a.m.

That is exactly how it works in the new engine: only contents is stemmed and indexed with stemmed/original pairs. What you quoted is the current search engine output, consult the results from the new engine: http://ls2.wikimedia.org/search?dbname=enwiki&query=commodity&ns0=1

On 5/22/07, Tian-Jian Barabbas Jiang@Gmail barabbas@gmail.com wrote:

...

Hi all,
Now search results of "commodity" changes:

* Commodities <http://en.wikipedia.org/wiki/Commodities>
  Relevance: 100.0% - -
* Commodity <http://en.wikipedia.org/wiki/Commodity>
  Relevance: 95.4% - -
* Commodate <http://en.wikipedia.org/wiki/Commodate>
  Relevance: 94.7% - -
* Commode <http://en.wikipedia.org/wiki/Commode>
  Relevance: 94.6% - -
I suggest that you may want to index "Title" with StandardAnalyzer and "Content" with SnowballAnalyzer, since the title field of Wikipedia is almost all named entities that should not be modified at all. IMHO, to have a mixture of original words and stemmed forms is a good heuristic rule though, but it is only suitable for content field.
Sincerely,
/Mike "b6s" Jiang/

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Tian-Jian "Barabbas" Jiang＠Gmail

1:42 a.m.

Robert Stojnic said:

...

That is exactly how it works in the new engine: only contents is stemmed and indexed with stemmed/original pairs. What you quoted is the current search engine output, consult the results from the new engine: http://ls2.wikimedia.org/search?dbname=enwiki&query=commodity&ns0=1

r.

I see, my bad. Thank you, Robert.

Cheers, /Mike/

Dave Sigafoos

1:03 a.m.

This is working really nice. Are you documenting the implementation <G>

DSig David Tod Sigafoos | SANMAR Corporation PICK Guy 206-770-5585 davesigafoos@sanmar.com

-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Robert Stojnic Sent: Tuesday, May 22, 2007 9:00 To: wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] lucene search 2.0 test webinterface

Hi all,

I'm working on the internal lucene search engine. I've setup a webinterface (with the kind help of Tim :) for the new engine, visit it here:

http://ls2.wikimedia.org/

Test index is based on latest dumps for 15 largest wikis, with updates from last 4-5 days.

Any feedback will be appreciated :)

Robert

http://en.wikipedia.org/wiki/Commodity_%28Marxism%29 _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Mathias Schindler

4:24 a.m.

On 5/22/07, Robert Stojnic rainmansr@gmail.com wrote:

...

improved scoring. Default lucene scoring favors short articles, I tried

to make scoring as relevant to wikipedia as possible. Good test is entering "commodity" into search.

Hmmm. Searching for "Noam Chomsky" gives me a rather strange result. Why is [[Noam Chomsky]] only at #4 in the results?

It is also possible that there is currently no restriction on the namespace even if you indicate that you only want results from the main namespace?

Otherwise, the search engine is fast and the results are overall promising. Are you considering adding snippets of the search results?

Mathias

Jay R. Ashworth

7:34 a.m.

On Tue, May 22, 2007 at 11:24:21PM +0200, Mathias Schindler wrote:

...

On 5/22/07, Robert Stojnic rainmansr@gmail.com wrote:

...

improved scoring. Default lucene scoring favors short articles, I tried

to make scoring as relevant to wikipedia as possible. Good test is entering "commodity" into search.

Hmmm. Searching for "Noam Chomsky" gives me a rather strange result. Why is [[Noam Chomsky]] only at #4 in the results?

I *often* search (by accident) for exact article titles, and when I discover that they *are* exact article titles, and did not appear at number one in the search, I wander off, muttering to myself...

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274

Robert Stojnic

5:03 p.m.

Hmmm. Searching for "Noam Chomsky" gives me a rather strange result.

...

Why is [[Noam Chomsky]] only at #4 in the results?

Hmm, good, this made me rethink the scoring, so I made some adjustments that favors larger articles a bit more, and favors more exact title matches. So now Noam Chomsky is in the right first place. :)

Otherwise, the search engine is fast and the results are overall

...

promising. Are you considering adding snippets of the search results?

Highlighting is a very cpu and memory consuming thingy. You need to fetch all articles in search results (i.e. 20 per page), retokenize them, fragment them in snippets, and score each snippet so you can show the best. I'm currently working on an distributed implementation for this, but it might still put too heavy load on the cluster.

Tian-Jian "Barabbas" Jiang

7:25 p.m.

Hi Robert,

2007/5/23, Robert Stojnic rainmansr@gmail.com:

...

Hmmm. Searching for "Noam Chomsky" gives me a rather strange result.

...
Why is [[Noam Chomsky]] only at #4 in the results?

Hmm, good, this made me rethink the scoring, so I made some adjustments that favors larger articles a bit more, and favors more exact title matches. So now Noam Chomsky is in the right first place. :)

I suggest you test this by MRR (Mean Reciprocal Rate): 1. Uses all Titles as both Queries and Answers. 2. Evaluate each query result by Reciprocal Rate like this: If the answer shown as the top N result, score = 1/N 3. Calculate the average of each RR to get the MRR.

For queries contain common phrases, you may want to manipulate a more complicated RR based on similarity to the title, or just annotate an answer set by hand.

Otherwise, the search engine is fast and the results are overall

...

...
promising. Are you considering adding snippets of the search results?

Highlighting is a very cpu and memory consuming thingy. You need to fetch all articles in search results (i.e. 20 per page), retokenize them, fragment them in snippets, and score each snippet so you can show the best. I'm currently working on an distributed implementation for this, but it might still put too heavy load on the cluster.

Apache Solr may be an alternative solution for this.

BTW, I'm pretty interesting in lucene-related tasks. If it's OK to you, I would like to help. :)

Sincerely, /Mike "b6s" Jiang/

Tian-Jian "Barabbas" Jiang＠Gmail

7:49 p.m.

Hi all,

Tian-Jian "Barabbas" Jiang said:

...

I suggest you test this by MRR (Mean Reciprocal Rate):

s/Rate/Rank/g Sorry about the typo. You may also want to check MAPs (Mean Average Precisions)

Although I bet you have already done it, here's my 2 cents: I usually adapt a concept to my IR system: Precision first, Recall next. For example, my system may do exact match first, get the results from

searcher.doc(topDocs.scoreDocs[i].doc)

and save them externally. It allows me to merge some more partial matched results later. Apparently these can be done by something like parallel queries, but I like to merge them sequentially by myself.

...

For queries contain common phrases, you may want to manipulate a more complicated RR based on similarity to the title, or just annotate an answer set by hand.
Otherwise, the search engine is fast and the results are overall
> promising. Are you considering adding snippets of the search
results?
>

Highlighting is a very cpu and memory consuming thingy. You need
to fetch
all articles in search results (i.e. 20 per page), retokenize
them, fragment
them in snippets, and score each snippet so you can show the best. I'm
currently working on an distributed implementation for this, but
it might
still put too heavy load on the cluster.
Apache Solr may be an alternative solution for this.

BTW, I'm pretty interesting in lucene-related tasks. If it's OK to you, I would like to help. :)
Sincerely,
/Mike "b6s" Jiang/

Jimmy Wales

8:09 p.m.

On May 22, 2007, at 5:24 PM, Mathias Schindler wrote:

...

On 5/22/07, Robert Stojnic rainmansr@gmail.com wrote:

...

improved scoring. Default lucene scoring favors short

articles, I tried to make scoring as relevant to wikipedia as possible. Good test is entering "commodity" into search.

Hmmm. Searching for "Noam Chomsky" gives me a rather strange result. Why is [[Noam Chomsky]] only at #4 in the results?

I got similar results for [[Thomas Jefferson]].

Seems easy to fix by giving a bit of a scoring boost to exact title matches?

Normally the exact title match should come first, though I suppose there might be exceptions.

Mohamed Magdy

24 May 24 May

12:31 a.m.

...

Seems easy to fix by giving a bit of a scoring boost to exact title matches?

May be add a suggestion box? telling the engine to fix it...

David Gerard

12:30 a.m.

On 23/05/07, Mohamed Magdy mohamed.m.k@gmail.com wrote:

...

...
Seems easy to fix by giving a bit of a scoring boost to exact title matches?

...

May be add a suggestion box? telling the engine to fix it...

Now, *that* would be a useful thing.

- d.

Tim Starling

12:31 a.m.

Mohamed Magdy wrote:

...

...
Seems easy to fix by giving a bit of a scoring boost to exact title matches?

May be add a suggestion box? telling the engine to fix it...

I think Robert already fixed it, see his post dated 10:03 UTC. So the mailing list seems to be doing a sufficient job so far.

-- Tim Starling

Mohamed Magdy

1:21 a.m.

Tim Starling wrote:

...

Mohamed Magdy wrote:

...
...
Seems easy to fix by giving a bit of a scoring boost to exact title matches?

May be add a suggestion box? telling the engine to fix it...

I think Robert already fixed it, see his post dated 10:03 UTC. So the mailing list seems to be doing a sufficient job so far.

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Yes, he is doing fine..I meant something automated like this for example: http://www.google.com/quality_form?q=orphan+cats&hl=en

Nick Jenkins

7:43 a.m.

...

...
...
...
Seems easy to fix by giving a bit of a scoring boost to exact title matches?

May be add a suggestion box? telling the engine to fix it...

I think Robert already fixed it, see his post dated 10:03 UTC. So the mailing list seems to be doing a sufficient job so far.

Yes, he is doing fine ... I meant something automated like this for example: http://www.google.com/quality_form?q=orphan+cats&hl=en

I _think_ maybe what you're saying is that you want something like this: http://files.nickj.org/MediaWiki/lucene-search-2-feedback-mockup.png I.e. if you click the up arrow, it visual bumps something up one line in the search results (and sends some kind of AJAX message to the server to indicate what the user has done), same for the down arrow bumping something down a line; and if you enter a page title in the input box, and click the "Add" button, it AJAX-checks that the page actually exists, and if so adds it to the search results as the first item. And you probably want some kind of mechanism like Google has, where clicking on a link acts as a "+1" vote for that link (they use JavaScript to do this, so that the link looks right when you mouseover it, but so that it bounces the HTTP request off their servers when you click it, and they only do it on some requests, although for the Wikipedia it could be done on every request, since there's very limited commercial benefit to gaming our search results). At least, the above is my best guess as to what some kind of automated feedback mechanism would look like :-)

Oh, and [[French Revolution]] should maybe be the first result for: http://ls2.wikimedia.org/search?dbname=enwiki&query=french+revolution&am...

-- All the best, Nick.

Mohamed Magdy

7:49 a.m.

Nick Jenkins wrote:

...

...
...
...
...
Seems easy to fix by giving a bit of a scoring boost to exact title matches?

May be add a suggestion box? telling the engine to fix it...

I think Robert already fixed it, see his post dated 10:03 UTC. So the mailing list seems to be doing a sufficient job so far.

Yes, he is doing fine ... I meant something automated like this for example: http://www.google.com/quality_form?q=orphan+cats&hl=en

I _think_ maybe what you're saying is that you want something like this: http://files.nickj.org/MediaWiki/lucene-search-2-feedback-mockup.png I.e. if you click the up arrow, it visual bumps something up one line in the search results (and sends some kind of AJAX message to the server to indicate what the user has done), same for the down arrow bumping something down a line; and if you enter a page title in the input box, and click the "Add" button, it AJAX-checks that the page actually exists, and if so adds it to the search results as the first item. And you probably want some kind of mechanism like Google has, where clicking on a link acts as a "+1" vote for that link (they use JavaScript to do this, so that the link looks right when you mouseover it, but so that it bounces the HTTP request off their servers when you click it, and they only do it on some requests, although for the Wikipedia it could be done on every request, since there's very limited commercial benefit to gaming our search results). At least, the above is my best guess as to what some kind of automated feedback mechanism would look like :-)

Oh, and [[French Revolution]] should maybe be the first result for: http://ls2.wikimedia.org/search?dbname=enwiki&query=french+revolution&am...

-- All the best, Nick.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

You know, that looks really cool that I clicked on it ;)

The Add box is what I meant originally but the arrows is a good idea too..but one problem..what if the title is in the second or third page..you will have to keep kicking it upwards all this kicks? or there would be a super kick, like point on the title, and tell it to go to the top?

You mentioned the abuse, commercial benefit or not, there will be always those who like to abuse it..

Last thing, could you add the ability to search within the results too?

Robert Stojnic

4:17 p.m.

Last few e-mails made me think that why the results still kinda sucked is because there was no relevance mechanism, measuring score only based on number of words matches does only a fair job.. This is why I included a page-rank-like term in the score. I get the number of articles that refer to a certain article (i.e. number of entries in "what links here"), and using a formula calculate a boost for title field (if query matches the title, it means it matches link to the article, and that score should be boosted)..

I've installed the rebuilt en.wiki index at http://ls2.wikimedia.org/ Note however, there is a bug with redirects showing up in search results, but this is just a glitch I'm fixing later today, so you can safely ignore them.

You can see the difference in scoring by looking at Nick's screenshot, and this link: http://ls2.wikimedia.org/search?dbname=enwiki&query=french+revolution&am...

I _think_ maybe what you're saying is that you want something like this:

...

http://files.nickj.org/MediaWiki/lucene-search-2-feedback-mockup.png I.e. if you click the up arrow, it visual bumps something up one line in the search results (and sends some kind of AJAX message to the server to indicate what the user has done), same for the down arrow bumping something down a line; and if you enter a page title in the input box, and click the "Add" button, it AJAX-checks that the page actually exists, and if so adds it to the search results as the first item. And you probably want some kind of mechanism like Google has, where clicking on a link acts as a "+1" vote for that link (they use JavaScript to do this, so that the link looks right when you mouseover it, but so that it bounces the HTTP request off their servers when you click it, and they only do it on some requests, although for the Wikipedia it could be done on every request, since there's very limited commercial benefit to gaming our search results). At least, the above is my best guess as to what some kind of automated feedback mechanism would look like :-)

Oh, and [[French Revolution]] should maybe be the first result for:

http://ls2.wikimedia.org/search?dbname=enwiki&query=french+revolution&am...

-- All the best, Nick.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Mathias Schindler

4:34 p.m.

On 5/24/07, Robert Stojnic rainmansr@gmail.com wrote:

...

I've installed the rebuilt en.wiki index at http://ls2.wikimedia.org/ Note however, there is a bug with redirects showing up in search results, but this is just a glitch I'm fixing later today, so you can safely ignore them.

Please don't! This search is simply perfect for finding old unusable redirect with no purpose. For example http://ls2.wikimedia.org/search?dbname=enwiki&query=Germany&ns0=1

(who needs camelcase redirects these days?)

It is my impression that the overall search result quality has improved since I last checked.

As a sidenote: this search is still incredibly fast. Do you have more information about the number of queries and the load on the involved machines?

Mathias

Dschwen

5:04 p.m.

...

...
Note however, there is a bug with redirects showing up in search results, but this is just a glitch I'm fixing later today, so you can safely ignore them.

Please don't! This search is simply perfect for finding old unusable redirect with no purpose. For example

Marking redirects as such would be helpful though.

-- [[:en:User:Dschwen]] [[:de:User:Dschwen]] [[:fr:User:Dschwen]] [[:commons:User:Dschwen]]

Brion Vibber

25 May 25 May

3:31 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Mathias Schindler wrote:

...

On 5/24/07, Robert Stojnic rainmansr@gmail.com wrote:

...
I've installed the rebuilt en.wiki index at http://ls2.wikimedia.org/ Note however, there is a bug with redirects showing up in search results, but this is just a glitch I'm fixing later today, so you can safely ignore them.

Please don't! This search is simply perfect for finding old unusable redirect with no purpose. For example http://ls2.wikimedia.org/search?dbname=enwiki&query=Germany&ns0=1

(who needs camelcase redirects these days?)

Redirects should almost never be deleted -- if there was previously a link to that page, the link should remain working. That link might be in a static page, archived post, or published paper, and breaking it doesn't benefit anyone.

Deleting redirects is a big "FUCK YOU" to the public.

http://www.w3.org/Provider/Style/URI

- -- brion vibber (brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGVfY9wRnhpk1wk44RApNmAKC1ttv0efaFdnTNMNSG8/IFLzlnHwCcDp5l AESV8eWsFl86VIOcGQBv308= =xIuy -----END PGP SIGNATURE-----

Mathias Schindler

4:33 a.m.

On 5/24/07, Brion Vibber brion@wikimedia.org wrote:

...

Redirects should almost never be deleted -- if there was previously a link to that page, the link should remain working. That link might be in a static page, archived post, or published paper, and breaking it doesn't benefit anyone.

using a proper citation style with old-id will make you unaffected by deleting redirects, even the stupid ones.

Why on earth should we keep a page [[Jimmy wales]] that was never a legitimate redirect (opposite to [[Jimmy Donal Wales]])? Just because a blogger was unable to set the proper (and still-working) link, a bunch of redirects with no value and sense should clutter the article lists?

[[GerMany]] used to be a "proper" Wikipedia article. In 2002.

Mathias

David Gerard

5:09 a.m.

On 24/05/07, Mathias Schindler mathias.schindler@gmail.com wrote:

...

Why on earth should we keep a page [[Jimmy wales]] that was never a legitimate redirect (opposite to [[Jimmy Donal Wales]])? Just because a blogger was unable to set the proper (and still-working) link, a bunch of redirects with no value and sense should clutter the article lists?

Why on earth shouldn't we? Are the disks going to fill?

- d.

Mathias Schindler

5:17 a.m.

On 5/25/07, David Gerard dgerard@gmail.com wrote:

...

On 24/05/07, Mathias Schindler mathias.schindler@gmail.com wrote:

...
Why on earth should we keep a page [[Jimmy wales]] that was never a legitimate redirect (opposite to [[Jimmy Donal Wales]])? Just because a blogger was unable to set the proper (and still-working) link, a bunch of redirects with no value and sense should clutter the article lists?

Why on earth shouldn't we? Are the disks going to fill?

Even a redirect usually carries a meaning. There are proper uses for redirects, the same way there are proper non-encyclopedic uses of disambiguation pages, for example when dealing with computerlinguistic problems.

Heaven will not fall down, and god will not kill a kitten or a sysadmin every time there is an inproper redirect. Nor will it fill our disks more than images do.

Mathias

Minh Nguyen

3:11 p.m.

Brion Vibber wrote:

...

Redirects should almost never be deleted -- if there was previously a link to that page, the link should remain working. That link might be in a static page, archived post, or published paper, and breaking it doesn't benefit anyone.

Deleting redirects is a big "FUCK YOU" to the public.

http://www.w3.org/Provider/Style/URI

-- brion vibber (brion @ wikimedia.org)

We certainly wouldn't want to keep the ON WHEELS! redirects, that's for sure. :P

-- Minh Nguyen mxn@zoomtown.com [[en:User:Mxn]] [[vi:User:Mxn]] [[m:User:Mxn]] AIM: trycom2000; Jabber: mxn@myjabber.net; Blog: http://mxn.f2o.org/

Platonides

3:15 a.m.

...

although for the Wikipedia it could be done on every request, since there's very limited commercial benefit to gaming our search results). At least, the above is my best guess as to what some kind of automated feedback mechanism would look like :-)

Sorry, but i disagree. There is bad people outside. If this system comes public, expect hundreds of votes (from the same ip) on "Cat" for "John's cat shop" Also, scoring opens an easy way to "google bombing". At the very least, it shouldn't be available to anonymous users. And even better if it also counts that the user is autoconfirmed not blocked and with more than 10 edits :P

Very good work Robert, and it (at last!) skips accents on search :) :)

David Gerard

4:10 a.m.

On 24/05/07, Platonides Platonides@gmail.com wrote:

...

...
although for the Wikipedia it could be done on every request, since there's very limited commercial benefit to gaming our search results). At least, the above is my best guess as to what some kind of automated feedback mechanism would look like :-)

...

Sorry, but i disagree. There is bad people outside. If this system comes public, expect hundreds of votes (from the same ip) on "Cat" for "John's cat shop" Also, scoring opens an easy way to "google bombing". At the very least, it shouldn't be available to anonymous users. And even better if it also counts that the user is autoconfirmed not blocked and with more than 10 edits :P

Nah. Try it first and see how stupid the outside world is. There's no point protecting against an effect you haven't measured, and there's no reason to *use* outside votes just because you've *gathered* them.

- d.

David Gerard

23 May 23 May

5:48 p.m.

On 22/05/07, Robert Stojnic rainmansr@gmail.com wrote:

...

I'm working on the internal lucene search engine. I've setup a webinterface (with the kind help of Tim :) for the new engine, visit it here: http://ls2.wikimedia.org/

Has any attention been paid to searching for images on Commons? A vexed problem indeed ...

(btw: Has any progress been made to a tagging implementation that wouldn't cripple the Wikimedia servers if complex tag queries are run?)

- d.

6423

Age (days ago)

6426

Last active (days ago)

wikitech-l@lists.wikimedia.org

37 comments

15 participants

tags (0)

participants (15)

Brion Vibber
Dave Sigafoos
David Gerard
Dschwen
Jay R. Ashworth
Jimmy Wales
Mathias Schindler
Minh Nguyen
Mohamed Magdy
Nick Jenkins
Platonides
Robert Stojnic
Tian-Jian "Barabbas" Jiang
Tian-Jian "Barabbas" Jiang＠Gmail
Tim Starling