Re: [Wikitech-l] Wikipedia Suggest

5 Aug 2006


      Aerik Sylvan wrote:
...
I'm sorry I hadn't taken the time to read about how you are doing it - I
just looked up what a "trie" is, and I see why it's a hurdle - the whole
structure is based on the first x characters.  It sounds like you are using
a file to store the trie datastructure, and not using SQL at all?  I've got
to say that I only know basic SQL, PHP, and some C - I don't have a
background to speak intelligently about optimising code for zillions of
hits, but I would approach this by taking all the article titles, and
creating an index with every significant word in the title.  After that, I'm
afraid I'd have to use SQL, and likely fall back onto using "LIKE" somewhere
if I was going to return partial-word matches (an "obvious and naive"
solution).  For whole word matches (but matching multiple words, or words
that are not the first word in the page title) you could query the index
directly.  I suppose one could build an index of partial words... would that
be faster than using "LIKE"?  Hmm...
I'm throwing queries at my category_links table that match category_a OR
category_b, then group by page and count results to perform a category math
/ category intersections function.  It seems pretty fast, but I'm not using
ajax, it's just a posted form.  It would be fun to write an ajax
implementation though :-)
I have always assumed that MySQL is internally optimised enough that if one
sticks to simple queries and whole word matches, you get pretty good
performance - but that's an assumption on my part.  I'm very interested in
that because I'd like to know more about optimising search queries, but of
course it's not your problem to teach me.
Best Regards and good luck with your project.
Thank you.
I will try to give you a first draft of idea :
You will need a tokenizer to split the title in words and probably a 
stop list to remove some word (a, of, the, ...)
Compilation: create a trie with all the words of your titles (without 
words in stop list), each word will be associated to a number. This 
number will give the access position in the vector V. this vector store 
at a position, all the documents that contains this word.
Search : you receive one word,  you can do a search with regular 
expression in you trie (as for exemple '*word*') and store results in a 
vector of matchs (for speed, you will need to have a already allocated 
vector, and fill it with correct matchs : for example keep the 10 best 
matchs).
The problem: you algorithm will visit a lot of nodes in your trie.
The advantage: you will be really faster than the LIKE sql command, the 
like command will be applied on each word of list.
Best Regards.
Julien Lemoine

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Wikipedia Suggest