Hi All,
So, I'm still compiler impaired (I guess I need to actually learn about compiling and linking and stuff, and not just writing code), but I managed to get a CLucene based (Lucene in C++) search set up which you can test here:
http://aerik.com/cl_search.php
Uses underscores instead of spaces in a category name.
It passes the query to the CLucene executable and then displays the raw results, as well as calculate an overall processing time. Eventually I want to evolve this into a daemon where you send commands and queries to the indexing daemon (I'm thinking of doing it simply in terms of POSTs and GETs).
ANYWAY, try it out. It doesn't return the actual page, it returns the cl_from key from the categorylinks table. I also have a fulltext MySQL index set up and if anyone is interested in doing comparisons I'll throw up a page for querying that too.
Just a note: it looks like the queryparser breaks up dates, but I'm guessing it did that when it made the index too.... try a search for '1969_Births' for example.
Best Regarsds, Aerik
On Sat, Mar 8, 2008 at 11:27 AM, Aerik Sylvan aerik@thesylvans.com wrote:
So, I'm still compiler impaired (I guess I need to actually learn about compiling and linking and stuff, and not just writing code), but I managed to get a CLucene based (Lucene in C++) search set up which you can test here:
http://aerik.com/cl_search.php
Uses underscores instead of spaces in a category name.
"Completed in -0.799299 seconds"? I think something is wrong with your logic for that bit. :) What's the second column? Relevancy? I seem to mostly be getting either 0.0 or something like "-9223372036854775808.-922337203685477580".
One fairly important question as far as the backend goes is how to do sorting and limiting. "Relevancy" sorts aren't necessarily very, well, relevant here. Ideally we'd like to mimic the category page view as far as sorting and pagination goes, I assume. Does that look possible?
Simetrical <Simetrical+wikilist@...> writes:
On Sat, Mar 8, 2008 at 11:27 AM, Aerik Sylvan <aerik@...> wrote:
So, I'm still compiler impaired (I guess I need to actually learn about compiling and linking and stuff, and not just writing code), but I managed to get a CLucene based (Lucene in C++) search set up which you can test here:
http://aerik.com/cl_search.php
Uses underscores instead of spaces in a category name.
"Completed in -0.799299 seconds"? I think something is wrong with your logic for that bit. :) What's the second column? Relevancy? I seem to mostly be getting either 0.0 or something like "-9223372036854775808.-922337203685477580".
One fairly important question as far as the backend goes is how to do sorting and limiting. "Relevancy" sorts aren't necessarily very, well, relevant here. Ideally we'd like to mimic the category page view as far as sorting and pagination goes, I assume. Does that look possible?
What query did you run that got "Completed in -0.799299 seconds"? So far, I don't have any results like that...
Yes, the second number is the lucene "score" which I agree probably isn't relevant for our purposes. Let me quickly state that I did not publish this as any kind of final product, but merely a demonstration of the speed of one possible approach, using real data. I can see already we'd need to tweak the query parser for our purposes too.
Aerik
On Sun, Mar 9, 2008 at 1:02 PM, Aerik aerik@thesylvans.com wrote:
What query did you run that got "Completed in -0.799299 seconds"? So far, I don't have any results like that...
When I click "Submit Query" for the default query "+Living_People +Americans", I get a negative "Completed in" time the first one or two times I try it. Then it starts giving positive times for all the queries I try. I don't seem to be able to reproduce it.
Yes, the second number is the lucene "score" which I agree probably isn't relevant for our purposes. Let me quickly state that I did not publish this as any kind of final product, but merely a demonstration of the speed of one possible approach, using real data. I can see already we'd need to tweak the query parser for our purposes too.
Of course. I was wondering what ideas you had in the direction of improving this backend with pagination. It's not actually something I had thought of before, but it's pretty important. And it will hurt MySQL, I suspect, since it always has to do a filesort when retrieving by fulltext index and sorting by anything other than relevance.
Simetrical <Simetrical+wikilist@...> writes:
When I click "Submit Query" for the default query "+Living_People +Americans", I get a negative "Completed in" time the first one or two times I try it. Then it starts giving positive times for all the queries I try. I don't seem to be able to reproduce it.
<snip>
Of course. I was wondering what ideas you had in the direction of improving this backend with pagination. It's not actually something I had thought of before, but it's pretty important. And it will hurt MySQL, I suspect, since it always has to do a filesort when retrieving by fulltext index and sorting by anything other than relevance.
Okay, I was doing something stuipd with php's microtime - that's why the weird values. Fixed it now. As far as I can tell, Lucene doesn't really do pagination, you get the whole result set, then just display the pages you care about. I haven't messed with sort at all. Frankly, I think my next efforts will be to a) learn more about compiling and b) set this up as a daemon so the executable is already in memory... maybe go ahead and open the index too, I'm not sure. And if I get that far, I'll add some pagination while I'm at it.
If anybody is even vaguely interested, I'd love to share code, etc. I really think clucene as a daemon could be a cool thing in it's own right, and could also be be a nice enencumbered search solution for use on Wikipedia.
Aerik
On Sun, Mar 9, 2008 at 1:45 PM, Aerik aerik@thesylvans.com wrote:
If anybody is even vaguely interested, I'd love to share code, etc. I really think clucene as a daemon could be a cool thing in it's own right, and could also be be a nice enencumbered search solution for use on Wikipedia.
I'd be more interested in doing the MediaWiki front-end. I might get around to that sometime.
wikitech-l@lists.wikimedia.org