Hi all,
I'm working on the internal lucene search engine. I've setup a webinterface (with the kind help of Tim :) for the new engine, visit it here:
Most changes are in the internals (i.e. making searching/indexing distributed, incremental updates...), but I also tried to improve the scoring, and added some new search syntax, and enabled stemming for another ten or so major languages. Highlights: - prefix searches. E.g. entering help:images in the search box will search only the help namespace - search categories. You can limit search by category. e.g. clarinet incategory:"woodwind instruments" - improved scoring. Default lucene scoring favors short articles, I tried to make scoring as relevant to wikipedia as possible. Good test is entering "commodity" into search. Top two articles have almost the same score, first one: Commodity (Marxism) is a long article about usage of the word in Marxism, and other: Commodity is an article that is much shorter but whose title fits more accurately.
Test index is based on latest dumps for 15 largest wikis, with updates from last 4-5 days.
Any feedback will be appreciated :)
Robert
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Robert Stojnic wrote:
Hi all,
I'm working on the internal lucene search engine. I've setup a webinterface (with the kind help of Tim :) for the new engine, visit it here:
Most changes are in the internals (i.e. making searching/indexing distributed, incremental updates...), but I also tried to improve the scoring, and added some new search syntax, and enabled stemming for another ten or so major languages.
Looking good! :)
- -- brion vibber (brion @ wikimedia.org)
Robert Stojnic wrote:
Hi all,
I'm working on the internal lucene search engine. I've setup a webinterface (with the kind help of Tim :) for the new engine, visit it here:
Most changes are in the internals (i.e. making searching/indexing distributed, incremental updates...), but I also tried to improve the scoring, and added some new search syntax, and enabled stemming for another ten or so major languages. Highlights:
- prefix searches. E.g. entering help:images in the search box will search
only the help namespace
- search categories. You can limit search by category. e.g. clarinet
incategory:"woodwind instruments"
- improved scoring. Default lucene scoring favors short articles, I tried
to make scoring as relevant to wikipedia as possible. Good test is entering "commodity" into search. Top two articles have almost the same score, first one: Commodity (Marxism) is a long article about usage of the word in Marxism, and other: Commodity is an article that is much shorter but whose title fits more accurately.
Test index is based on latest dumps for 15 largest wikis, with updates from last 4-5 days.
Any feedback will be appreciated :)
Robert
http://en.wikipedia.org/wiki/Commodity_%28Marxism%29 _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."
Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."
The stemming issue is reported in bug 2511 [*]. The bug is caused by the indexer not indexing the original word, but only it's root (i.e. stemmed word). Now both are indexed, and original words are preferred, i.e. have larger scores.
ps.. about search in categories, note that subcategories are not expanded, this needs to be done at client side (i.e. in mediawiki extension).
Robert Stojnic wrote:
Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."
The stemming issue is reported in bug 2511 [*]. The bug is caused by the indexer not indexing the original word, but only it's root (i.e. stemmed word). Now both are indexed, and original words are preferred, i.e. have larger scores.
I may be wrong .. but isn't it right that before the program could get the root of the word it have to know it? i mean.. it should have a big list of words and its roots? and that is not for english only..you have to have lists for each language? or where else the program will strip the words
ps.. about search in categories, note that subcategories are not expanded, this needs to be done at client side (i.e. in mediawiki extension).
so you have to search each one separately ?
[*] http://bugzilla.wikimedia.org/show_bug.cgi?id=2511 _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Mohamed Magdy wrote:
Robert Stojnic wrote:
Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."
The stemming issue is reported in bug 2511 [*]. The bug is caused by the indexer not indexing the original word, but only it's root (i.e. stemmed word). Now both are indexed, and original words are preferred, i.e. have larger scores.
I may be wrong .. but isn't it right that before the program could get the root of the word it have to know it? i mean.. it should have a big list of words and its roots? and that is not for english only..you have to have lists for each language? or where else the program will strip the words
Roughly speaking, stemming is the process of taking inflected forms of words ("category" -> "categories") and extracting a normalized root form (say, "categori") for comparison purposes. This allows you to search on one form and receive results containing the other.
The exact code to do this will vary depending on language. A number of preexisting stemming filters exist for Lucene's indexing engine, some of which are used here.
Our currently-live search does basic stemming for English, German, Russian, and Esperanto, but not for other languages.
The issue Robert mentioned was that the old way would return results with any inflected form unconditionally, which can be annoying when you really did want an exact match. The new code has a preference for exact matches in the ranking, but will return related inflected forms as well.
- -- brion vibber (brion @ wikimedia.org)
Brion Vibber said:
Our currently-live search does basic stemming for English, German, Russian, and Esperanto, but not for other languages.
The issue Robert mentioned was that the old way would return results with any inflected form unconditionally, which can be annoying when you really did want an exact match. The new code has a preference for exact matches in the ranking, but will return related inflected forms as well.
I guess Wikipedia is using SnowballAnalyzer rather than StandardAnalyzer? To my knowledge, the later one does not using any stemmer.
Regards, /Mike "b6s" Jiang/
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Mohamed Magdy wrote:
Robert Stojnic wrote:
Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."
The stemming issue is reported in bug 2511 [*]. The bug is caused by the indexer not indexing the original word, but only it's root (i.e. stemmed word). Now both are indexed, and original words are preferred, i.e. have larger scores.
I may be wrong .. but isn't it right that before the program could get the root of the word it have to know it? i mean.. it should have a big list of words and its roots? and that is not for english only..you have to have lists for each language? or where else the program will strip the words
Roughly speaking, stemming is the process of taking inflected forms of words ("category" -> "categories") and extracting a normalized root form (say, "categori") for comparison purposes. This allows you to search on one form and receive results containing the other.
Thanks for explaining !
The exact code to do this will vary depending on language. A number of preexisting stemming filters exist for Lucene's indexing engine, some of which are used here.
Our currently-live search does basic stemming for English, German, Russian, and Esperanto, but not for other languages.
Comes the question, when/how other languages will have stemming as well?
If it is a bit annoying, but what is the difference between the basic and advanced stemming? or it will be added?
The issue Robert mentioned was that the old way would return results with any inflected form unconditionally, which can be annoying when you really did want an exact match. The new code has a preference for exact matches in the ranking, but will return related inflected forms as well.
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGUyxCwRnhpk1wk44RAs+VAKCmkfbxCS2KhCfXP5IANjfDpOJAQwCeLr3B h31LTAQFL6WLz8M1gcM/FZ0= =sqlV -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Comes the question, when/how other languages will have stemming as well?
The new search engine has stemmers for these languages: English, Danish, Dutch, Finnish, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish and Esperanto. It also has a filter for Thai, so that words are properly separated.
r.
Robert Stojnic said:
Comes the question, when/how other languages will have stemming as well?
The new search engine has stemmers for these languages: English, Danish, Dutch, Finnish, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish and Esperanto. It also has a filter for Thai, so that words are properly separated.
r.
For example: https://svn.apache.org/repos/asf/lucene/java/trunk/contrib/snowball/src/java...
private Among a_1[] = { new Among ( "la", -1, -1, "", this), new Among ( "sela", 0, -1, "", this), new Among ( "le", -1, -1, "", this), new Among ( "me", -1, -1, "", this), new Among ( "se", -1, -1, "", this), new Among ( "lo", -1, -1, "", this), new Among ( "selo", 5, -1, "", this), new Among ( "las", -1, -1, "", this), new Among ( "selas", 7, -1, "", this), new Among ( "les", -1, -1, "", this), new Among ( "los", -1, -1, "", this), new Among ( "selos", 10, -1, "", this), new Among ( "nos", -1, -1, "", this) };
Regards, /Mike/
Mohamed Magdy said:
Robert Stojnic wrote:
Sounds nice to limit the search in certain category.. nice work! but what does this mean? "and stemmed words are penalized."
The stemming issue is reported in bug 2511 [*]. The bug is caused by the indexer not indexing the original word, but only it's root (i.e. stemmed word). Now both are indexed, and original words are preferred, i.e. have larger scores.
I may be wrong .. but isn't it right that before the program could get the root of the word it have to know it? i mean.. it should have a big list of words and its roots? and that is not for english only..you have to have lists for each language? or where else the program will strip the words
Porter Stemmer only cuts inflected forms and known suffices, it does not convert words back to lemmas. This heuristic algorithm saves spaces of dictionaries, and experiences show that it is usually good enough for English, at least. Certainly it harmed if Wikipedia needs accurate results.
Cheers, /Mike "b6s" Jiang/
Hi all,
Now search results of "commodity" changes:
* Commodities http://en.wikipedia.org/wiki/Commodities Relevance: 100.0% - - * Commodity http://en.wikipedia.org/wiki/Commodity Relevance: 95.4% - - * Commodate http://en.wikipedia.org/wiki/Commodate Relevance: 94.7% - - * Commode http://en.wikipedia.org/wiki/Commode Relevance: 94.6% - -
I suggest that you may want to index "Title" with StandardAnalyzer and "Content" with SnowballAnalyzer, since the title field of Wikipedia is almost all named entities that should not be modified at all. IMHO, to have a mixture of original words and stemmed forms is a good heuristic rule though, but it is only suitable for content field.
Sincerely, /Mike "b6s" Jiang/
That is exactly how it works in the new engine: only contents is stemmed and indexed with stemmed/original pairs. What you quoted is the current search engine output, consult the results from the new engine: http://ls2.wikimedia.org/search?dbname=enwiki&query=commodity&ns0=1
r.
On 5/22/07, Tian-Jian Barabbas Jiang@Gmail barabbas@gmail.com wrote:
Hi all,
Now search results of "commodity" changes: * Commodities <http://en.wikipedia.org/wiki/Commodities> Relevance: 100.0% - - * Commodity <http://en.wikipedia.org/wiki/Commodity> Relevance: 95.4% - - * Commodate <http://en.wikipedia.org/wiki/Commodate> Relevance: 94.7% - - * Commode <http://en.wikipedia.org/wiki/Commode> Relevance: 94.6% - -
I suggest that you may want to index "Title" with StandardAnalyzer and "Content" with SnowballAnalyzer, since the title field of Wikipedia is almost all named entities that should not be modified at all. IMHO, to have a mixture of original words and stemmed forms is a good heuristic rule though, but it is only suitable for content field.
Sincerely,
/Mike "b6s" Jiang/
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Robert Stojnic said:
That is exactly how it works in the new engine: only contents is stemmed and indexed with stemmed/original pairs. What you quoted is the current search engine output, consult the results from the new engine: http://ls2.wikimedia.org/search?dbname=enwiki&query=commodity&ns0=1
r.
I see, my bad. Thank you, Robert.
Cheers, /Mike/
This is working really nice. Are you documenting the implementation <G>
DSig David Tod Sigafoos | SANMAR Corporation PICK Guy 206-770-5585 davesigafoos@sanmar.com
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Robert Stojnic Sent: Tuesday, May 22, 2007 9:00 To: wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] lucene search 2.0 test webinterface
Hi all,
I'm working on the internal lucene search engine. I've setup a webinterface (with the kind help of Tim :) for the new engine, visit it here:
Most changes are in the internals (i.e. making searching/indexing distributed, incremental updates...), but I also tried to improve the scoring, and added some new search syntax, and enabled stemming for another ten or so major languages. Highlights: - prefix searches. E.g. entering help:images in the search box will search only the help namespace - search categories. You can limit search by category. e.g. clarinet incategory:"woodwind instruments" - improved scoring. Default lucene scoring favors short articles, I tried to make scoring as relevant to wikipedia as possible. Good test is entering "commodity" into search. Top two articles have almost the same score, first one: Commodity (Marxism) is a long article about usage of the word in Marxism, and other: Commodity is an article that is much shorter but whose title fits more accurately.
Test index is based on latest dumps for 15 largest wikis, with updates from last 4-5 days.
Any feedback will be appreciated :)
Robert
http://en.wikipedia.org/wiki/Commodity_%28Marxism%29 _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 5/22/07, Robert Stojnic rainmansr@gmail.com wrote:
- improved scoring. Default lucene scoring favors short articles, I tried
to make scoring as relevant to wikipedia as possible. Good test is entering "commodity" into search.
Hmmm. Searching for "Noam Chomsky" gives me a rather strange result. Why is [[Noam Chomsky]] only at #4 in the results?
It is also possible that there is currently no restriction on the namespace even if you indicate that you only want results from the main namespace?
Otherwise, the search engine is fast and the results are overall promising. Are you considering adding snippets of the search results?
Mathias
On Tue, May 22, 2007 at 11:24:21PM +0200, Mathias Schindler wrote:
On 5/22/07, Robert Stojnic rainmansr@gmail.com wrote:
- improved scoring. Default lucene scoring favors short articles, I tried
to make scoring as relevant to wikipedia as possible. Good test is entering "commodity" into search.
Hmmm. Searching for "Noam Chomsky" gives me a rather strange result. Why is [[Noam Chomsky]] only at #4 in the results?
I *often* search (by accident) for exact article titles, and when I discover that they *are* exact article titles, and did not appear at number one in the search, I wander off, muttering to myself...
Cheers, -- jra
Hmmm. Searching for "Noam Chomsky" gives me a rather strange result.
Why is [[Noam Chomsky]] only at #4 in the results?
Hmm, good, this made me rethink the scoring, so I made some adjustments that favors larger articles a bit more, and favors more exact title matches. So now Noam Chomsky is in the right first place. :)
Otherwise, the search engine is fast and the results are overall
promising. Are you considering adding snippets of the search results?
Highlighting is a very cpu and memory consuming thingy. You need to fetch all articles in search results (i.e. 20 per page), retokenize them, fragment them in snippets, and score each snippet so you can show the best. I'm currently working on an distributed implementation for this, but it might still put too heavy load on the cluster.
r.
Hi Robert,
2007/5/23, Robert Stojnic rainmansr@gmail.com:
Hmmm. Searching for "Noam Chomsky" gives me a rather strange result.
Why is [[Noam Chomsky]] only at #4 in the results?
Hmm, good, this made me rethink the scoring, so I made some adjustments that favors larger articles a bit more, and favors more exact title matches. So now Noam Chomsky is in the right first place. :)
I suggest you test this by MRR (Mean Reciprocal Rate): 1. Uses all Titles as both Queries and Answers. 2. Evaluate each query result by Reciprocal Rate like this: If the answer shown as the top N result, score = 1/N 3. Calculate the average of each RR to get the MRR.
For queries contain common phrases, you may want to manipulate a more complicated RR based on similarity to the title, or just annotate an answer set by hand.
Otherwise, the search engine is fast and the results are overall
promising. Are you considering adding snippets of the search results?
Highlighting is a very cpu and memory consuming thingy. You need to fetch all articles in search results (i.e. 20 per page), retokenize them, fragment them in snippets, and score each snippet so you can show the best. I'm currently working on an distributed implementation for this, but it might still put too heavy load on the cluster.
Apache Solr may be an alternative solution for this.
BTW, I'm pretty interesting in lucene-related tasks. If it's OK to you, I would like to help. :)
Sincerely, /Mike "b6s" Jiang/
Hi all,
Tian-Jian "Barabbas" Jiang said:
I suggest you test this by MRR (Mean Reciprocal Rate):
s/Rate/Rank/g Sorry about the typo. You may also want to check MAPs (Mean Average Precisions)
Although I bet you have already done it, here's my 2 cents: I usually adapt a concept to my IR system: Precision first, Recall next. For example, my system may do exact match first, get the results from
searcher.doc(topDocs.scoreDocs[i].doc)
and save them externally. It allows me to merge some more partial matched results later. Apparently these can be done by something like parallel queries, but I like to merge them sequentially by myself.
For queries contain common phrases, you may want to manipulate a more complicated RR based on similarity to the title, or just annotate an answer set by hand.
Otherwise, the search engine is fast and the results are overall > promising. Are you considering adding snippets of the search results? > Highlighting is a very cpu and memory consuming thingy. You need to fetch all articles in search results (i.e. 20 per page), retokenize them, fragment them in snippets, and score each snippet so you can show the best. I'm currently working on an distributed implementation for this, but it might still put too heavy load on the cluster.
Apache Solr may be an alternative solution for this.
BTW, I'm pretty interesting in lucene-related tasks. If it's OK to you, I would like to help. :)
Sincerely,
/Mike "b6s" Jiang/
On May 22, 2007, at 5:24 PM, Mathias Schindler wrote:
On 5/22/07, Robert Stojnic rainmansr@gmail.com wrote:
- improved scoring. Default lucene scoring favors short
articles, I tried to make scoring as relevant to wikipedia as possible. Good test is entering "commodity" into search.
Hmmm. Searching for "Noam Chomsky" gives me a rather strange result. Why is [[Noam Chomsky]] only at #4 in the results?
I got similar results for [[Thomas Jefferson]].
Seems easy to fix by giving a bit of a scoring boost to exact title matches?
Normally the exact title match should come first, though I suppose there might be exceptions.
On 23/05/07, Mohamed Magdy mohamed.m.k@gmail.com wrote:
Seems easy to fix by giving a bit of a scoring boost to exact title matches?
May be add a suggestion box? telling the engine to fix it...
Now, *that* would be a useful thing.
- d.
Mohamed Magdy wrote:
Seems easy to fix by giving a bit of a scoring boost to exact title matches?
May be add a suggestion box? telling the engine to fix it...
I think Robert already fixed it, see his post dated 10:03 UTC. So the mailing list seems to be doing a sufficient job so far.
-- Tim Starling
Tim Starling wrote:
Mohamed Magdy wrote:
Seems easy to fix by giving a bit of a scoring boost to exact title matches?
May be add a suggestion box? telling the engine to fix it...
I think Robert already fixed it, see his post dated 10:03 UTC. So the mailing list seems to be doing a sufficient job so far.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Yes, he is doing fine..I meant something automated like this for example: http://www.google.com/quality_form?q=orphan+cats&hl=en
Seems easy to fix by giving a bit of a scoring boost to exact title matches?
May be add a suggestion box? telling the engine to fix it...
I think Robert already fixed it, see his post dated 10:03 UTC. So the mailing list seems to be doing a sufficient job so far.
Yes, he is doing fine ... I meant something automated like this for example: http://www.google.com/quality_form?q=orphan+cats&hl=en
I _think_ maybe what you're saying is that you want something like this: http://files.nickj.org/MediaWiki/lucene-search-2-feedback-mockup.png I.e. if you click the up arrow, it visual bumps something up one line in the search results (and sends some kind of AJAX message to the server to indicate what the user has done), same for the down arrow bumping something down a line; and if you enter a page title in the input box, and click the "Add" button, it AJAX-checks that the page actually exists, and if so adds it to the search results as the first item. And you probably want some kind of mechanism like Google has, where clicking on a link acts as a "+1" vote for that link (they use JavaScript to do this, so that the link looks right when you mouseover it, but so that it bounces the HTTP request off their servers when you click it, and they only do it on some requests, although for the Wikipedia it could be done on every request, since there's very limited commercial benefit to gaming our search results). At least, the above is my best guess as to what some kind of automated feedback mechanism would look like :-)
Oh, and [[French Revolution]] should maybe be the first result for: http://ls2.wikimedia.org/search?dbname=enwiki&query=french+revolution&am...
-- All the best, Nick.
Nick Jenkins wrote:
Seems easy to fix by giving a bit of a scoring boost to exact title matches?
May be add a suggestion box? telling the engine to fix it...
I think Robert already fixed it, see his post dated 10:03 UTC. So the mailing list seems to be doing a sufficient job so far.
Yes, he is doing fine ... I meant something automated like this for example: http://www.google.com/quality_form?q=orphan+cats&hl=en
I _think_ maybe what you're saying is that you want something like this: http://files.nickj.org/MediaWiki/lucene-search-2-feedback-mockup.png I.e. if you click the up arrow, it visual bumps something up one line in the search results (and sends some kind of AJAX message to the server to indicate what the user has done), same for the down arrow bumping something down a line; and if you enter a page title in the input box, and click the "Add" button, it AJAX-checks that the page actually exists, and if so adds it to the search results as the first item. And you probably want some kind of mechanism like Google has, where clicking on a link acts as a "+1" vote for that link (they use JavaScript to do this, so that the link looks right when you mouseover it, but so that it bounces the HTTP request off their servers when you click it, and they only do it on some requests, although for the Wikipedia it could be done on every request, since there's very limited commercial benefit to gaming our search results). At least, the above is my best guess as to what some kind of automated feedback mechanism would look like :-)
Oh, and [[French Revolution]] should maybe be the first result for: http://ls2.wikimedia.org/search?dbname=enwiki&query=french+revolution&am...
-- All the best, Nick.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
You know, that looks really cool that I clicked on it ;)
The Add box is what I meant originally but the arrows is a good idea too..but one problem..what if the title is in the second or third page..you will have to keep kicking it upwards all this kicks? or there would be a super kick, like point on the title, and tell it to go to the top?
You mentioned the abuse, commercial benefit or not, there will be always those who like to abuse it..
Last thing, could you add the ability to search within the results too?
Last few e-mails made me think that why the results still kinda sucked is because there was no relevance mechanism, measuring score only based on number of words matches does only a fair job.. This is why I included a page-rank-like term in the score. I get the number of articles that refer to a certain article (i.e. number of entries in "what links here"), and using a formula calculate a boost for title field (if query matches the title, it means it matches link to the article, and that score should be boosted)..
I've installed the rebuilt en.wiki index at http://ls2.wikimedia.org/ Note however, there is a bug with redirects showing up in search results, but this is just a glitch I'm fixing later today, so you can safely ignore them.
You can see the difference in scoring by looking at Nick's screenshot, and this link: http://ls2.wikimedia.org/search?dbname=enwiki&query=french+revolution&am...
r.
I _think_ maybe what you're saying is that you want something like this:
http://files.nickj.org/MediaWiki/lucene-search-2-feedback-mockup.png I.e. if you click the up arrow, it visual bumps something up one line in the search results (and sends some kind of AJAX message to the server to indicate what the user has done), same for the down arrow bumping something down a line; and if you enter a page title in the input box, and click the "Add" button, it AJAX-checks that the page actually exists, and if so adds it to the search results as the first item. And you probably want some kind of mechanism like Google has, where clicking on a link acts as a "+1" vote for that link (they use JavaScript to do this, so that the link looks right when you mouseover it, but so that it bounces the HTTP request off their servers when you click it, and they only do it on some requests, although for the Wikipedia it could be done on every request, since there's very limited commercial benefit to gaming our search results). At least, the above is my best guess as to what some kind of automated feedback mechanism would look like :-)
Oh, and [[French Revolution]] should maybe be the first result for:
http://ls2.wikimedia.org/search?dbname=enwiki&query=french+revolution&am...
-- All the best, Nick.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 5/24/07, Robert Stojnic rainmansr@gmail.com wrote:
I've installed the rebuilt en.wiki index at http://ls2.wikimedia.org/ Note however, there is a bug with redirects showing up in search results, but this is just a glitch I'm fixing later today, so you can safely ignore them.
Please don't! This search is simply perfect for finding old unusable redirect with no purpose. For example http://ls2.wikimedia.org/search?dbname=enwiki&query=Germany&ns0=1
(who needs camelcase redirects these days?)
It is my impression that the overall search result quality has improved since I last checked.
As a sidenote: this search is still incredibly fast. Do you have more information about the number of queries and the load on the involved machines?
Mathias
Note however, there is a bug with redirects showing up in search results, but this is just a glitch I'm fixing later today, so you can safely ignore them.
Please don't! This search is simply perfect for finding old unusable redirect with no purpose. For example
Marking redirects as such would be helpful though.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Mathias Schindler wrote:
On 5/24/07, Robert Stojnic rainmansr@gmail.com wrote:
I've installed the rebuilt en.wiki index at http://ls2.wikimedia.org/ Note however, there is a bug with redirects showing up in search results, but this is just a glitch I'm fixing later today, so you can safely ignore them.
Please don't! This search is simply perfect for finding old unusable redirect with no purpose. For example http://ls2.wikimedia.org/search?dbname=enwiki&query=Germany&ns0=1
(who needs camelcase redirects these days?)
Redirects should almost never be deleted -- if there was previously a link to that page, the link should remain working. That link might be in a static page, archived post, or published paper, and breaking it doesn't benefit anyone.
Deleting redirects is a big "FUCK YOU" to the public.
http://www.w3.org/Provider/Style/URI
- -- brion vibber (brion @ wikimedia.org)
On 5/24/07, Brion Vibber brion@wikimedia.org wrote:
Redirects should almost never be deleted -- if there was previously a link to that page, the link should remain working. That link might be in a static page, archived post, or published paper, and breaking it doesn't benefit anyone.
using a proper citation style with old-id will make you unaffected by deleting redirects, even the stupid ones.
Why on earth should we keep a page [[Jimmy wales]] that was never a legitimate redirect (opposite to [[Jimmy Donal Wales]])? Just because a blogger was unable to set the proper (and still-working) link, a bunch of redirects with no value and sense should clutter the article lists?
[[GerMany]] used to be a "proper" Wikipedia article. In 2002.
Mathias
On 24/05/07, Mathias Schindler mathias.schindler@gmail.com wrote:
Why on earth should we keep a page [[Jimmy wales]] that was never a legitimate redirect (opposite to [[Jimmy Donal Wales]])? Just because a blogger was unable to set the proper (and still-working) link, a bunch of redirects with no value and sense should clutter the article lists?
Why on earth shouldn't we? Are the disks going to fill?
- d.
On 5/25/07, David Gerard dgerard@gmail.com wrote:
On 24/05/07, Mathias Schindler mathias.schindler@gmail.com wrote:
Why on earth should we keep a page [[Jimmy wales]] that was never a legitimate redirect (opposite to [[Jimmy Donal Wales]])? Just because a blogger was unable to set the proper (and still-working) link, a bunch of redirects with no value and sense should clutter the article lists?
Why on earth shouldn't we? Are the disks going to fill?
Even a redirect usually carries a meaning. There are proper uses for redirects, the same way there are proper non-encyclopedic uses of disambiguation pages, for example when dealing with computerlinguistic problems.
Heaven will not fall down, and god will not kill a kitten or a sysadmin every time there is an inproper redirect. Nor will it fill our disks more than images do.
Mathias
Brion Vibber wrote:
Redirects should almost never be deleted -- if there was previously a link to that page, the link should remain working. That link might be in a static page, archived post, or published paper, and breaking it doesn't benefit anyone.
Deleting redirects is a big "FUCK YOU" to the public.
http://www.w3.org/Provider/Style/URI
- -- brion vibber (brion @ wikimedia.org)
We certainly wouldn't want to keep the ON WHEELS! redirects, that's for sure. :P
although for the Wikipedia it could be done on every request, since there's very limited commercial benefit to gaming our search results). At least, the above is my best guess as to what some kind of automated feedback mechanism would look like :-)
Sorry, but i disagree. There is bad people outside. If this system comes public, expect hundreds of votes (from the same ip) on "Cat" for "John's cat shop" Also, scoring opens an easy way to "google bombing". At the very least, it shouldn't be available to anonymous users. And even better if it also counts that the user is autoconfirmed not blocked and with more than 10 edits :P
Very good work Robert, and it (at last!) skips accents on search :) :)
On 24/05/07, Platonides Platonides@gmail.com wrote:
although for the Wikipedia it could be done on every request, since there's very limited commercial benefit to gaming our search results). At least, the above is my best guess as to what some kind of automated feedback mechanism would look like :-)
Sorry, but i disagree. There is bad people outside. If this system comes public, expect hundreds of votes (from the same ip) on "Cat" for "John's cat shop" Also, scoring opens an easy way to "google bombing". At the very least, it shouldn't be available to anonymous users. And even better if it also counts that the user is autoconfirmed not blocked and with more than 10 edits :P
Nah. Try it first and see how stupid the outside world is. There's no point protecting against an effect you haven't measured, and there's no reason to *use* outside votes just because you've *gathered* them.
- d.
On 22/05/07, Robert Stojnic rainmansr@gmail.com wrote:
I'm working on the internal lucene search engine. I've setup a webinterface (with the kind help of Tim :) for the new engine, visit it here: http://ls2.wikimedia.org/
Has any attention been paid to searching for images on Commons? A vexed problem indeed ...
(btw: Has any progress been made to a tagging implementation that wouldn't cripple the Wikimedia servers if complex tag queries are run?)
- d.
wikitech-l@lists.wikimedia.org