In the search engine, I am currently "smashing down" all 'ing' words. This does wonders in _most_ cases, but fails miserably in some other cases. It seemed to help, on balance, when I did it -- but the wikipedia was smaller then.
In the current case, we are looking for the page [[Conditioning]]. My technique of chopping off the 'ing' performs poorly here.
So, I'm eliminating the 'ing' trick now. I'm still keeping the 's' trick. So 'horse' and 'horses' return exactly the same results. Someday, if we have lots and lots of cases where that doesn't work, I'll switch back.
'ing' is a less obviously good idea, after all. It was nice to return the same results for 'network' and 'networking'... when there wasn't much in the database, this ensured that something marginally useful would show up.
Further clever tweaks are always possible -- but soon we will be upgrading to Magnus's software, and the search will -- at first -- just be whatever default behavior comes from MySQL. Perhaps that can be improved upon.
--Jimbo
Hi Jimmy, hi all!
I don´t like to be paranoid either, it sucks. However, I see that the search engine does different again on other words: Zero hits for reason but 2 for reasoning. Zero hits for most *ings. Is there manual editing involved, or doesn´t the search engine do a complete re-indexing every week or so?
Grasso
At 01:58 PM 1/25/02 +0100, Ulrich Grassberger wrote:
I don´t like to be paranoid either, it sucks. However, I see that the search engine does different again on other words: Zero hits for reason but 2 for reasoning. Zero hits for most *ings. Is there manual editing involved, or doesn´t the search engine do a complete re-indexing every week or so?
Words ending in 's' get weird responses, too. For example, I searched on "loris" and got lots of things with 'lori' in them. Even when it found the whole word, it bold-faced the 'lori," not the whole word.
Vicki Rosenzweig wrote:
Words ending in 's' get weird responses, too. For example, I searched on "loris" and got lots of things with 'lori' in them. Even when it found the whole word, it bold-faced the 'lori," not the whole word.
This is an excellent example of a case where my "trick" fails. Even so, I think that the results are qualitatively improved _on average_ by squashing 's'. The best example would be 'horse' and 'horses'. There are many similar cases. People may search for 'president' or 'presidents'. Since our database is not _huge_, squashing the 's' isn't bad.
There's a better way, a better version of this trick, and I could program that.
What should happen is that the _exact_ term gets a "score" boost, so that those results tend to show up higher. But we should still return results from the 'squashed' version. This is a nice balance. So, for 'Loris' you should find the _exact matches_ at or near the top, but then 'Lori' results further down.
This means that if you search for 'horses', you'll get 'horses' results first, in case that's EXACTLY what you meant. But if you're just searching for every article that mentions 'horse', you'll also get a good result.
I'll think about how to make this change, but I think I'll hold off until we have some experience with the php/mysql search engine. I think it'll be sort of o.k., but I think with some creativity, I can do something that will beat it.
--Jimbo
I'll think about how to make this change, but I think I'll hold off until we have some experience with the php/mysql search engine. I think it'll be sort of o.k., but I think with some creativity, I can do something that will beat it.
I bet you can ;) Currently, I am just putting jokers around the search request ("%foobar%") and look for it in titles and fulltext. Pityful, yes.
I'd also suggest to have some Google-like mechanism that uses your "trick" and suggests links with "Did you mean to search for *foobaring*?" or similar. That wouldn't mess up the search results and still make it easy to try variations.
Magnus
Ulrich Grassberger wrote:
I don´t like to be paranoid either, it sucks. However, I see that the search engine does different again on other words: Zero hits for reason but 2 for reasoning. Zero hits for most *ings. Is there manual editing involved, or doesn´t the search engine do a complete re-indexing every week or so?
There is no manual editing involved. I don't have time for that.
The search engine does a complete re-indexing whenever I run it by hand. I hesitate to put it on a cron job because I like to monitor the machine while I'm re-indexing... due to potential load problems. I'm sure I could do a better job.
In any event, you'll find that 'Conditioning' now returns '[Conditioning]' as the first result, as it should. :-)
If I'm visited by any corporate/government agents who desire to keep this incredibly powerful information suprressed, I'll let you know... if they don't lock me up first.
--Jimbo
wikipedia-l@lists.wikimedia.org