Julien Lemoine wrote:
With this solution, you will not be able to handle a lot of queries per second (it will probably take more than 1 second to query 2.3 millions of entries). If you use a trie (yes I know this word now :)), everything is pre-computed and a query will use very few cpu and you will be able to handle a lot of queries per second.
:-)
I'm sorry I hadn't taken the time to read about how you are doing it - I just looked up what a "trie" is, and I see why it's a hurdle - the whole structure is based on the first x characters. It sounds like you are using a file to store the trie datastructure, and not using SQL at all? I've got to say that I only know basic SQL, PHP, and some C - I don't have a background to speak intelligently about optimising code for zillions of hits, but I would approach this by taking all the article titles, and creating an index with every significant word in the title. After that, I'm afraid I'd have to use SQL, and likely fall back onto using "LIKE" somewhere if I was going to return partial-word matches (an "obvious and naive" solution). For whole word matches (but matching multiple words, or words that are not the first word in the page title) you could query the index directly. I suppose one could build an index of partial words... would that be faster than using "LIKE"? Hmm...
I'm throwing queries at my category_links table that match category_a OR category_b, then group by page and count results to perform a category math / category intersections function. It seems pretty fast, but I'm not using ajax, it's just a posted form. It would be fun to write an ajax implementation though :-)
I have always assumed that MySQL is internally optimised enough that if one sticks to simple queries and whole word matches, you get pretty good performance - but that's an assumption on my part. I'm very interested in that because I'd like to know more about optimising search queries, but of course it's not your problem to teach me.
Best Regards and good luck with your project.
And Timwi wrote:
I set mine up to require at least 4 characters and to to break on whitespace, so the SQL (pseudocode) is WHERE LIKE(wordone)
AND
LIKE(wordtwo) and so on.
This is, of course, the most obvious and most naive solution. For a data set as large as Wikipedia's list of article titles, this is far too slow and inefficient and would kill the site for everyone if it was used on the live database.
Timwi, from your other posts I can see that you know a lot more about computer programming than I do, but I would hope to be able to offer some opinion without getting slapped with "obvious and naive".
Regards, Aerik
On 8/4/06, Aerik Sylvan aerik@thesylvans.com wrote:
I have always assumed that MySQL is internally optimised enough that if one sticks to simple queries and whole word matches, you get pretty good performance - but that's an assumption on my part. I'm very interested in that because I'd like to know more about optimising search queries, but of course it's not your problem to teach me.
MySQL can only be as optimized as the algorithm that uses it. LIKE may be optimized, but it's impossible to make it very efficient given what it does.
Timwi, from your other posts I can see that you know a lot more about computer programming than I do, but I would hope to be able to offer some opinion without getting slapped with "obvious and naive".
I don't think he meant it as an insult. In computer science, "naive" basically means "the most obvious way to accomplish something, which is probably not a good idea performance-wise".
Well considering the index will not be compiled in real time and probably updated lets say, for the sake of argument, every other day, the resources would be minimal after memcache ing and then mysql query cache. Since these are read-often write-seldom records the query cache would really excel here. and thats my 2ยข
On 8/4/06, Simetrical Simetrical+wikitech@gmail.com wrote:
On 8/4/06, Aerik Sylvan aerik@thesylvans.com wrote:
I have always assumed that MySQL is internally optimised enough that if
one
sticks to simple queries and whole word matches, you get pretty good performance - but that's an assumption on my part. I'm very interested
in
that because I'd like to know more about optimising search queries, but
of
course it's not your problem to teach me.
MySQL can only be as optimized as the algorithm that uses it. LIKE may be optimized, but it's impossible to make it very efficient given what it does.
Timwi, from your other posts I can see that you know a lot more about computer programming than I do, but I would hope to be able to offer
some
opinion without getting slapped with "obvious and naive".
I don't think he meant it as an insult. In computer science, "naive" basically means "the most obvious way to accomplish something, which is probably not a good idea performance-wise". _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Aerik Sylvan wrote:
I'm sorry I hadn't taken the time to read about how you are doing it - I just looked up what a "trie" is, and I see why it's a hurdle - the whole structure is based on the first x characters. It sounds like you are using a file to store the trie datastructure, and not using SQL at all? I've got to say that I only know basic SQL, PHP, and some C - I don't have a background to speak intelligently about optimising code for zillions of hits, but I would approach this by taking all the article titles, and creating an index with every significant word in the title. After that, I'm afraid I'd have to use SQL, and likely fall back onto using "LIKE" somewhere if I was going to return partial-word matches (an "obvious and naive" solution). For whole word matches (but matching multiple words, or words that are not the first word in the page title) you could query the index directly. I suppose one could build an index of partial words... would that be faster than using "LIKE"? Hmm...
I'm throwing queries at my category_links table that match category_a OR category_b, then group by page and count results to perform a category math / category intersections function. It seems pretty fast, but I'm not using ajax, it's just a posted form. It would be fun to write an ajax implementation though :-)
I have always assumed that MySQL is internally optimised enough that if one sticks to simple queries and whole word matches, you get pretty good performance - but that's an assumption on my part. I'm very interested in that because I'd like to know more about optimising search queries, but of course it's not your problem to teach me.
Best Regards and good luck with your project.
Thank you.
I will try to give you a first draft of idea : You will need a tokenizer to split the title in words and probably a stop list to remove some word (a, of, the, ...)
Compilation: create a trie with all the words of your titles (without words in stop list), each word will be associated to a number. This number will give the access position in the vector V. this vector store at a position, all the documents that contains this word.
Search : you receive one word, you can do a search with regular expression in you trie (as for exemple '*word*') and store results in a vector of matchs (for speed, you will need to have a already allocated vector, and fill it with correct matchs : for example keep the 10 best matchs).
The problem: you algorithm will visit a lot of nodes in your trie. The advantage: you will be really faster than the LIKE sql command, the like command will be applied on each word of list.
Best Regards. Julien Lemoine
Aerik Sylvan wrote:
I would approach this by taking all the article titles, and creating an index with every significant word in the title. After that, I'm afraid I'd have to use SQL, and likely fall back onto using "LIKE" somewhere if I was going to return partial-word matches (an "obvious and naive" solution).
Creating an index of words is not that native anymore. :-) Also, if you use "LIKE 'word%'" instead of "LIKE '%word%'", the index can be used quite efficiently. However, you still run into the problem: how do you list the top 10 most linked-to pages given just a single letter? You'd still have to query all article titles that have a word that begins with that letter. The trie gives you a much more direct way of looking that up.
I have always assumed that MySQL is internally optimised enough that if one sticks to simple queries and whole word matches, you get pretty good performance
No matter how well something is optimised, you would still expect some things to be more efficient than others. By relying on the efficient things, Wikipedia can run on fewer and/or cheaper servers. In other words, your queries may be sufficiently efficient that they run within a fraction in a second on your server which is not used by anyone else, but if the server is already as busy as Wikipedia, and then you're adding another few thousand extra queries per second, then you can't expect "internal optimisation" to do all the magic for you.
This is, of course, the most obvious and most naive solution. [...]
Timwi, from your other posts I can see that you know a lot more about computer programming than I do, but I would hope to be able to offer some opinion without getting slapped with "obvious and naive".
Please read http://en.wikipedia.org/wiki/Wikipedia:Assume_good_faith .
In the same way that you hope to be able to offer your opinion, so do I. The "getting slapped" is in your head, not in my intentions. If my opinion, given in good faith, is offensive to you, then quite frankly you need to lighten up. You can't expect it to be censored or suppressed just so that you feel better.
Timwi
wikitech-l@lists.wikimedia.org