Hello,
I have wrote a "google suggest" like service for wikipedia under GPL licence [1]. By the way, I am not sure if this project will interest you, I am open to all comments from your community.
I wrote a simple web site for english/french on http://suggest.speedblue.org, all software are available in the download section.
Best Regards. Julien Lemoine
[1] If you want me to change the licence to something other than GPL, please let me know.
On 8/2/06, Julien Lemoine speedblue@happycoders.org wrote:
Hello,
I have wrote a "google suggest" like service for wikipedia under GPL licence [1]. By the way, I am not sure if this project will interest you, I am open to all comments from your community.
This is absolutely fantastic, and would greatly benefit Wikipedia if cleaned up a little and integrated. What data is it running on?
Also, what is the number on the right side, the number of incoming links to the page?
Again, brilliant - would make the search feature far, far more useful.
Steve
Hello,
Steve Bennett wrote:
This is absolutely fantastic, and would greatly benefit Wikipedia if cleaned up a little and integrated. What data is it running on?
First of all, I grabbed the whole content of english/french wikipedia in june 2006. Then, everything is compiled is a final state machine (about 200Mb for english wikipedia).
Also, what is the number on the right side, the number of incoming links to the page?
Yes, you guessed right :)
Again, brilliant - would make the search feature far, far more useful.
Thanks you. Best Regards. Julien Lemoine
On 8/2/06, Julien Lemoine speedblue@happycoders.org wrote:
First of all, I grabbed the whole content of english/french wikipedia in june 2006. Then, everything is compiled is a final state machine (about 200Mb for english wikipedia).
How long does the compiling take? How close to a real-time update could it be? This seems to be a recurring problem with search etc at Wikipedia - the fact that it's always at best a few days out of date, and sometimes months. Then with the problems with toolserver...
Steve
Hello Steve,
Steve Bennett wrote:
On 8/2/06, Julien Lemoine speedblue@happycoders.org wrote:
First of all, I grabbed the whole content of english/french wikipedia in june 2006. Then, everything is compiled is a final state machine (about 200Mb for english wikipedia).
How long does the compiling take? How close to a real-time update could it be? This seems to be a recurring problem with search etc at Wikipedia - the fact that it's always at best a few days out of date, and sometimes months. Then with the problems with toolserver...
On my computer (Pentium D930, SATA HD, 1G RAM), it took about half an hour for the 1.2M of english articles (I will give you the exact processing time if you want). This time includes the reading of all files on disk.
Best Regards. Julien Lemoine
On 8/2/06, Julien Lemoine speedblue@happycoders.org wrote:
On my computer (Pentium D930, SATA HD, 1G RAM), it took about half an hour for the 1.2M of english articles (I will give you the exact processing time if you want). This time includes the reading of all files on disk.
I would be interested to hear from the server admins on the feasibility of integrating this tool in Wikipedia - particularly that compilation time.
Steve
On 8/2/06, Steve Bennett stevage@gmail.com wrote:
I would be interested to hear from the server admins on the feasibility of integrating this tool in Wikipedia - particularly that compilation time.
I strongly suspect that the number of runtime cycles this could take is a much greater concern. Half an hour to compile the list isn't far from the length some special-page indices take to build, IIRC.
Hello,
Simetrical wrote:
On 8/2/06, Steve Bennett stevage@gmail.com wrote:
I would be interested to hear from the server admins on the feasibility of integrating this tool in Wikipedia - particularly that compilation time.
I strongly suspect that the number of runtime cycles this could take is a much greater concern. Half an hour to compile the list isn't far from the length some special-page indices take to build, IIRC.
Here is the time output of compiler for the analysis of wikipedia english articles : 622,65s user 53,69s system 70% cpu 15:58,91 total 95% of this time (16mins) was to read the files from disk (with a fresh reboot so nothing was in the cache).
The algorithm uses very few cpu power. Imagine for example that all the data needed are in memory, the compiled automaton could be obtained in 1 minutes.
Best Regards. Julien Lemoine
On 8/3/06, Julien Lemoine speedblue@happycoders.org wrote:
The algorithm uses very few cpu power. Imagine for example that all the data needed are in memory, the compiled automaton could be obtained in 1 minutes.
Okay, that's good. But what I meant was, how much CPU power (and other resources) will the suggestion feature eat on a day-to-day basis? I start typing "Metasyntactic variable" into the search box, it helpfully provides some suggestions based on its index — how much resources did that just use? When thousands of people are doing it every minute, how much will that slow down the servers compared to the normal load of page views and edits and so on?
Hello,
Simetrical wrote:
On 8/3/06, Julien Lemoine speedblue@happycoders.org wrote:
The algorithm uses very few cpu power. Imagine for example that all the data needed are in memory, the compiled automaton could be obtained in 1 minutes.
Okay, that's good. But what I meant was, how much CPU power (and other resources) will the suggestion feature eat on a day-to-day basis? I start typing "Metasyntactic variable" into the search box, it helpfully provides some suggestions based on its index — how much resources did that just use? When thousands of people are doing it every minute, how much will that slow down the servers compared to the normal load of page views and edits and so on?
The search uses very few resources (only RAM to store the index). The complexity in the worst case is O(log(A) * P) where A is the number of letter in Alphabet and P the size of the word to search.
I wrote two implementations : - monothreaded with index stored on disk (I implemented it for my server because I have few RAM available) - multithreaded version with all index in RAM
With the second one, you will be able to handle a lot of queries per second with very few cpu power.
Best Regards. Julien Lemoine
Julien Lemoine wrote:
Hello,
Steve Bennett wrote:
This is absolutely fantastic, and would greatly benefit Wikipedia if cleaned up a little and integrated. What data is it running on?
First of all, I grabbed the whole content of english/french wikipedia in june 2006. Then, everything is compiled is a final state machine (about 200Mb for english wikipedia).
Can you explain in greater detail what your finite-state machine looks like? Is it like a directed graph?
Timwi
Hello,
Timwi wrote:
Can you explain in greater detail what your finite-state machine looks like? Is it like a directed graph?
Yes this is a directed graph, each transition from one node to another is labeled by one UTF-8 character. This graph is deterministic, you can not have two identic transitions at the output of a node. For each article I added the name of the article in this automaton, with the final node containing a pointer to articles details (url, name, nb of links). After article names were added in automaton, I computed the table of best content for each node containing pointer to articles with the higher number of links.
I talked about pointers, these pointers are index to a vector storing all details about articles.
You can have a small example (figure) in my FAQ : http://suggest.speedblue.org/FAQ.php
I can give you more details about number of nodes/transitions in the automaton if you want. I hope this give you a more precise idea of how it works.
Please, don't hesitate to ask me more details. Best Regards. Julien Lemoine
Julien Lemoine wrote:
Hello,
Timwi wrote:
Can you explain in greater detail what your finite-state machine looks like? Is it like a directed graph?
Yes this is a directed graph [... etc. ...]
I'm very impressed! You've clearly thought a lot about this, and implemented it the right way.
I don't have any further questions, my curiosity is satisfied :)
On 8/3/06, Timwi timwi@gmx.net wrote:
Julien Lemoine wrote:
First of all, I grabbed the whole content of english/french wikipedia in june 2006. Then, everything is compiled is a final state machine (about 200Mb for english wikipedia).
Can you explain in greater detail what your finite-state machine looks like? Is it like a directed graph?
The right data structure for this sort of thing is called a "trie", and Googling it will turn up lots of information. Though I'm not sure if the original poster knows this term.
On 8/3/06, Evan Martin evanm@google.com wrote:
On 8/3/06, Timwi timwi@gmx.net wrote:
Julien Lemoine wrote:
First of all, I grabbed the whole content of english/french wikipedia in june 2006. Then, everything is compiled is a final state machine (about 200Mb for english wikipedia).
Can you explain in greater detail what your finite-state machine looks like? Is it like a directed graph?
The right data structure for this sort of thing is called a "trie", and Googling it will turn up lots of information. Though I'm not sure if the original poster knows this term.
*grumble reply-to*
Sorry, was meant to be an off-list reply.
Evan Martin wrote:
On 8/3/06, Timwi timwi@gmx.net wrote:
Julien Lemoine wrote:
First of all, I grabbed the whole content of english/french wikipedia in june 2006. Then, everything is compiled is a final state machine (about 200Mb for english wikipedia).
Can you explain in greater detail what your finite-state machine looks like? Is it like a directed graph?
The right data structure for this sort of thing is called a "trie", and Googling it will turn up lots of information. Though I'm not sure if the original poster knows this term.
I know, and his is indeed a trie. But that word is rarely used. Look at his work: He has perfectly implemented a trie with a lot of thought, but without knowing the word "trie" :-p
Timwi
On 8/3/06, Timwi timwi@gmx.net wrote:
I know, and his is indeed a trie. But that word is rarely used. Look at his work: He has perfectly implemented a trie with a lot of thought, but without knowing the word "trie" :-p
Oh, you're bringing back memories of computer science...I half expect someone to mention BST's or radox or something next...
(but please don't :))
Steve
Just a thought - if you want to *really* impress everyone, incorporate special handling for disambiguation pages. This could be tricky, but one of these two methods might work, for given search string "Friends":
First, check if "Friends (disambiguation)" exists
1) If it does, simply display it in a panel to the right and let users click on links in it
2) If it does, load it and attempt to parse it, showing the links in the list, differentiated somehow.
On second thoughts, 1) is probably better, as you really need the line of disambiguating text to know whether you want the 1971 film or the 2001 film, or even "Friend (film)". What to do about entries like the last two (no real link, just synonyms) is tricky...
Steve
Oh crap. This is seriously ****ing awesome Julien!
On 8/2/06, Julien Lemoine speedblue@happycoders.org wrote:
Hello,
I have wrote a "google suggest" like service for wikipedia under GPL licence [1]. By the way, I am not sure if this project will interest you, I am open to all comments from your community.
I wrote a simple web site for english/french on http://suggest.speedblue.org, all software are available in the download section.
Best Regards. Julien Lemoine
[1] If you want me to change the licence to something other than GPL, please let me know.
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Just one more request: what about if the search string matches *within* the phrase, but not necessarily at the start. Like "Hardy" matching "Laurel and Hardy"? Is that possible with this data structure?
Steve
On 8/2/06, Julien Lemoine speedblue@happycoders.org wrote:
Hello,
I have wrote a "google suggest" like service for wikipedia under GPL licence [1]. By the way, I am not sure if this project will interest you, I am open to all comments from your community.
I wrote a simple web site for english/french on http://suggest.speedblue.org, all software are available in the download section.
Best Regards. Julien Lemoine
[1] If you want me to change the licence to something other than GPL, please let me know.
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Steve Bennett wrote:
Just one more request: what about if the search string matches *within* the phrase, but not necessarily at the start. Like "Hardy" matching "Laurel and Hardy"? Is that possible with this data structure?
No it will not. But I don't think this will be really usefull : Imagine a query with a single letter ("a" or "e" for example), it will matches a least half articles without understanding the results.
Best Regards. Julien Lemoine
On Fri, Aug 04, 2006 at 08:03:22AM +0200, Julien Lemoine wrote:
Steve Bennett wrote:
Just one more request: what about if the search string matches *within* the phrase, but not necessarily at the start. Like "Hardy" matching "Laurel and Hardy"? Is that possible with this data structure?
No it will not. But I don't think this will be really usefull : Imagine a query with a single letter ("a" or "e" for example), it will matches a least half articles without understanding the results.
It would require you to apply some heuristics to determine what *not* to display, certainly...
but let me note that the incremental search feature in the phone book of my Blackberry -- which displays a list of first, the entries whose titles *start* with the substring, and then, the entries which contain it elsewhere -- is pretty useful. It interacts with the height of the pulldown, of course, but it is pretty useful.
Cheers, -- jra
Hardy is probably already redirected to Laurel And Hardy, i guess... Better keep it easy.
"Jay R. Ashworth" wrote:
On Fri, Aug 04, 2006 at 08:03:22AM +0200, Julien Lemoine wrote:
Steve Bennett wrote:
Just one more request: what about if the search string matches *within* the phrase, but not necessarily at the start. Like "Hardy" matching "Laurel and Hardy"? Is that possible with this data structure?
No it will not. But I don't think this will be really usefull : Imagine a query with a single letter ("a" or "e" for example), it will matches a least half articles without understanding the results.
It would require you to apply some heuristics to determine what *not* to display, certainly...
but let me note that the incremental search feature in the phone book of my Blackberry -- which displays a list of first, the entries whose titles *start* with the substring, and then, the entries which contain it elsewhere -- is pretty useful. It interacts with the height of the pulldown, of course, but it is pretty useful.
Cheers, -- jra
Julien Lemoine wrote:
Hi,
I have wrote a "google suggest" like service for wikipedia under GPL licence [1]. By the way, I am not sure if this project will interest you, I am open to all comments from your community.
I wrote a simple web site for english/french on http://suggest.speedblue.org, all software are available in the download section.
That reminds me of http://wiki.lumrix.net/en/. Don't know how that works, but maybe you'll have a look.
Jürgen
Julien Lemoine wrote:
Hi,
I have wrote a "google suggest" like service for wikipedia under GPL licence [1]. By the way, I am not sure if this project will interest you, I am open to all comments from your community.
I wrote a simple web site for english/french on http://suggest.speedblue.org, all software are available in the download section.
That reminds me of http://wiki.lumrix.net/en/. Don't know how that works, but maybe you'll have a look.
Jürgen
wikitech-l@lists.wikimedia.org