New subject: Wikipedia Suggest

5 Aug 2006


      Julien Lemoine wrote:
...
With this solution, you will not be able to handle a lot of queries per
second (it will probably take more than 1 second to query 2.3 millions
of entries). If you use a trie (yes I know this word now :)), everything
is pre-computed and a query will use very few cpu and you will be able
to handle a lot of queries per second.
:-)
I'm sorry I hadn't taken the time to read about how you are doing it - I
just looked up what a "trie" is, and I see why it's a hurdle - the whole
structure is based on the first x characters.  It sounds like you are using
a file to store the trie datastructure, and not using SQL at all?  I've got
to say that I only know basic SQL, PHP, and some C - I don't have a
background to speak intelligently about optimising code for zillions of
hits, but I would approach this by taking all the article titles, and
creating an index with every significant word in the title.  After that, I'm
afraid I'd have to use SQL, and likely fall back onto using "LIKE" somewhere
if I was going to return partial-word matches (an "obvious and naive"
solution).  For whole word matches (but matching multiple words, or words
that are not the first word in the page title) you could query the index
directly.  I suppose one could build an index of partial words... would that
be faster than using "LIKE"?  Hmm...
I'm throwing queries at my category_links table that match category_a OR
category_b, then group by page and count results to perform a category math
/ category intersections function.  It seems pretty fast, but I'm not using
ajax, it's just a posted form.  It would be fun to write an ajax
implementation though :-)
I have always assumed that MySQL is internally optimised enough that if one
sticks to simple queries and whole word matches, you get pretty good
performance - but that's an assumption on my part.  I'm very interested in
that because I'd like to know more about optimising search queries, but of
course it's not your problem to teach me.
Best Regards and good luck with your project.
And Timwi wrote:
...
...
I set mine up to require at least 4 characters and to
to break on whitespace, so the SQL (pseudocode) is WHERE LIKE(wordone)
AND
...
LIKE(wordtwo) and so on.
This is, of course, the most obvious and most naive solution. For a data
set as large as Wikipedia's list of article titles, this is far too slow
and inefficient and would kill the site for everyone if it was used on
the live database.
Timwi, from your other posts I can see that you know a lot more about
computer programming than I do, but I would hope to be able to offer some
opinion without getting slapped with "obvious and naive".
Regards,
Aerik

Re: [Wikitech-l] Wikipedia Suggest