Julien Lemoine wrote:
With this solution, you will not be able to handle a lot of queries per second (it will probably take more than 1 second to query 2.3 millions of entries). If you use a trie (yes I know this word now :)), everything is pre-computed and a query will use very few cpu and you will be able to handle a lot of queries per second.
:-)
I'm sorry I hadn't taken the time to read about how you are doing it - I just looked up what a "trie" is, and I see why it's a hurdle - the whole structure is based on the first x characters. It sounds like you are using a file to store the trie datastructure, and not using SQL at all? I've got to say that I only know basic SQL, PHP, and some C - I don't have a background to speak intelligently about optimising code for zillions of hits, but I would approach this by taking all the article titles, and creating an index with every significant word in the title. After that, I'm afraid I'd have to use SQL, and likely fall back onto using "LIKE" somewhere if I was going to return partial-word matches (an "obvious and naive" solution). For whole word matches (but matching multiple words, or words that are not the first word in the page title) you could query the index directly. I suppose one could build an index of partial words... would that be faster than using "LIKE"? Hmm...
I'm throwing queries at my category_links table that match category_a OR category_b, then group by page and count results to perform a category math / category intersections function. It seems pretty fast, but I'm not using ajax, it's just a posted form. It would be fun to write an ajax implementation though :-)
I have always assumed that MySQL is internally optimised enough that if one sticks to simple queries and whole word matches, you get pretty good performance - but that's an assumption on my part. I'm very interested in that because I'd like to know more about optimising search queries, but of course it's not your problem to teach me.
Best Regards and good luck with your project.
And Timwi wrote:
I set mine up to require at least 4 characters and to to break on whitespace, so the SQL (pseudocode) is WHERE LIKE(wordone)
AND
LIKE(wordtwo) and so on.
This is, of course, the most obvious and most naive solution. For a data set as large as Wikipedia's list of article titles, this is far too slow and inefficient and would kill the site for everyone if it was used on the live database.
Timwi, from your other posts I can see that you know a lot more about computer programming than I do, but I would hope to be able to offer some opinion without getting slapped with "obvious and naive".
Regards, Aerik