Aerik Sylvan wrote:
I'm sorry I hadn't taken the time to read about how you are doing it - I just looked up what a "trie" is, and I see why it's a hurdle - the whole structure is based on the first x characters. It sounds like you are using a file to store the trie datastructure, and not using SQL at all? I've got to say that I only know basic SQL, PHP, and some C - I don't have a background to speak intelligently about optimising code for zillions of hits, but I would approach this by taking all the article titles, and creating an index with every significant word in the title. After that, I'm afraid I'd have to use SQL, and likely fall back onto using "LIKE" somewhere if I was going to return partial-word matches (an "obvious and naive" solution). For whole word matches (but matching multiple words, or words that are not the first word in the page title) you could query the index directly. I suppose one could build an index of partial words... would that be faster than using "LIKE"? Hmm...
I'm throwing queries at my category_links table that match category_a OR category_b, then group by page and count results to perform a category math / category intersections function. It seems pretty fast, but I'm not using ajax, it's just a posted form. It would be fun to write an ajax implementation though :-)
I have always assumed that MySQL is internally optimised enough that if one sticks to simple queries and whole word matches, you get pretty good performance - but that's an assumption on my part. I'm very interested in that because I'd like to know more about optimising search queries, but of course it's not your problem to teach me.
Best Regards and good luck with your project.
Thank you.
I will try to give you a first draft of idea : You will need a tokenizer to split the title in words and probably a stop list to remove some word (a, of, the, ...)
Compilation: create a trie with all the words of your titles (without words in stop list), each word will be associated to a number. This number will give the access position in the vector V. this vector store at a position, all the documents that contains this word.
Search : you receive one word, you can do a search with regular expression in you trie (as for exemple '*word*') and store results in a vector of matchs (for speed, you will need to have a already allocated vector, and fill it with correct matchs : for example keep the 10 best matchs).
The problem: you algorithm will visit a lot of nodes in your trie. The advantage: you will be really faster than the LIKE sql command, the like command will be applied on each word of list.
Best Regards. Julien Lemoine