Hi Julien,
I have wrote a "google suggest" like service for wikipedia under GPL licence [1]. By the way, I am not sure if this project will interest you, I am open to all comments from your community.
Yes!! Good stuff!
== UI stuff ==
The only two constructive suggestions for the User-Interface that I would make are:
1) That if the user presses 'Enter' in the search textbox whilst typing out a query, that it automatically choose/open/redirect to the first item in the list. That way I can type out what I want, and press enter to open the first link when I've typed enough to specify it well enough to get it to the top of the list, all without using the mouse.
2) Allow the user to press the down/up arrows to select/highlight a specified entry on the list (including but not limited to the first item), and press enter to open it. That way again the user can be lazy and can select a link without using the mouse, and without typing out the full title.
== More Technical stuff ==
1) How do you handle pages with the same title, but different capitalization? They're rare, but they do occur. My suspicion from scanning Analyzer.cpp is that you just take the most popular. However, if it's for search, it would be best to include everything (I think).
2) Doesn't seem to get include redirects. For example, when I search for "Formula weight", it's not listed, but on the EN Wikipedia "Formula weight" is a redirect to "Atomic mass". It would definitely be better to include redirects (in my personal opinion).
However the downside of including these two things is that the amount of data that you need to store goes up. I've actually had a go at a very similar problem (storing a memory index of all article names, but meeting the two conditions specified above, plus for redirects I would also store the name of the article it redirected to [something which could potentially maybe also be useful for your suggest service if you wanted to show this information too]). However this was in PHP, and it a complete memory hog (think > 1 Gb for the memory index). My solution (since it was only for me) was to just "buy more RAM", however I like your approach of getting more efficient. By the way, the reason I was doing this was for suggesting links that could be made in wiki text - just so you know I'm not in competition with you, but that the problems we face are similar in some ways, and could maybe benefit from a common solution.
How big would be a memory index be that had these properties (i.e. including all NS:0 articles/redirects, and maybe including the targets for redirects)?
All the best, Nick.
Hi Nick,
* Nick Jenkins nickpj@gmail.com [2006-08-03 16:16:23 +1000]:
== UI stuff ==
The only two constructive suggestions for the User-Interface that I would make are:
- That if the user presses 'Enter' in the search textbox whilst
typing out a query, that it automatically choose/open/redirect to the first item in the list. That way I can type out what I want, and press enter to open the first link when I've typed enough to specify it well enough to get it to the top of the list, all without using the mouse.
- Allow the user to press the down/up arrows to select/highlight a
specified entry on the list (including but not limited to the first item), and press enter to open it. That way again the user can be lazy and can select a link without using the mouse, and without typing out the full title.
You idea are goods, I added them on my TODO list :)
== More Technical stuff ==
- How do you handle pages with the same title, but different
capitalization? They're rare, but they do occur. My suspicion from scanning Analyzer.cpp is that you just take the most popular. However, if it's for search, it would be best to include everything (I think).
Yeah you are right, the most popular is keept. I wanted to have a case insensitive search and for the moment I have only one article by final node. But it will be better, I also added it to my TODO list.
- Doesn't seem to get include redirects. For example, when I search
for "Formula weight", it's not listed, but on the EN Wikipedia "Formula weight" is a redirect to "Atomic mass". It would definitely be better to include redirects (in my personal opinion).
I will generates a new index with redirects, I will give you the size before/after this evening.
However the downside of including these two things is that the amount of data that you need to store goes up. I've actually had a go at a very similar problem (storing a memory index of all article names, but meeting the two conditions specified above, plus for redirects I would also store the name of the article it redirected to [something which could potentially maybe also be useful for your suggest service if you wanted to show this information too]). However this was in PHP, and it a complete memory hog (think > 1 Gb for the memory index). My solution (since it was only for me) was to just "buy more RAM", however I like your approach of getting more efficient. By the way, the reason I was doing this was for suggesting links that could be made in wiki text - just so you know I'm not in competition with you, but that the problems we face are similar in some ways, and could maybe benefit from a common solution. How big would be a memory index be that had these properties (i.e. including all NS:0 articles/redirects, and maybe including the targets for redirects)?
The current automaton (index) of all wikipedia articles (without redirects) needs 127Mb, I don't think adding redirects will increass a lot his size. I will give you the exact size with redirect this evening :) (Paris time).
Best Regards. Julien Lemoine
On 8/3/06, Julien Lemoine speedblue@happycoders.org wrote:
- How do you handle pages with the same title, but different
capitalization? They're rare, but they do occur. My suspicion from scanning Analyzer.cpp is that you just take the most popular. However, if it's for search, it would be best to include everything (I think).
Yeah you are right, the most popular is keept. I wanted to have a case insensitive search and for the moment I have only one article by final node. But it will be better, I also added it to my TODO list.
I didn't notice this - yes, a case-sensitive search is critical. Problems with casing are one of the most likely reasons a user will fail to find a given page. MediaWiki automatically finds the three most common casing options - all lower, ALL CAPS, Title Caps - but misses the likely "Caps of Important Words".
I will generates a new index with redirects, I will give you the size before/after this evening.
Redirects should be indicated with what they redirect to, like:
... Top-rope climbing -> top roping Top roping ...
Or something - but maybe with the proviso that it doesn't show a redirect if the actual target is currently being displayed.
Steve
Hi Steve,
I will generates a new index with redirects, I will give you the size before/after this evening.
Redirects should be indicated with what they redirect to, like:
... Top-rope climbing -> top roping Top roping ...
Or something - but maybe with the proviso that it doesn't show a redirect if the actual target is currently being displayed.
I updated the analyzer to add the redirects. I've done it on French wikipedia. The automaton size is now 42M instead of 32M. I will do the same on english tomorrow and give you the new automaton size.
Best Regards. Julien Lemoine
* Steve Bennett stevage@gmail.com [2006-08-03 09:17:40 +0200]:
I will generates a new index with redirects, I will give you the size before/after this evening.
Redirects should be indicated with what they redirect to, like:
... Top-rope climbing -> top roping Top roping ...
Or something - but maybe with the proviso that it doesn't show a redirect if the actual target is currently being displayed.
I uploaded the index with redirections (the automaton has now a size of 200M instead of 132M). But, with the query "a", I got 10 alias of 'United States' starting with "a". I will probably force the top 10 results to have unique articles. I will change this behavior this WE.
Best Regards. Julien Lemoine
Hi,
== UI stuff ==
The only two constructive suggestions for the User-Interface that I would make are:
- That if the user presses 'Enter' in the search textbox whilst
typing out a query, that it automatically choose/open/redirect to the first item in the list. That way I can type out what I want, and press enter to open the first link when I've typed enough to specify it well enough to get it to the top of the list, all without using the mouse.
- Allow the user to press the down/up arrows to select/highlight a
specified entry on the list (including but not limited to the first item), and press enter to open it. That way again the user can be lazy and can select a link without using the mouse, and without typing out the full title.
These two features are available now. You can use them :)
Best Regards. Julien Lemoine
Hi Julien,
== UI stuff ==
The only two constructive suggestions for the User-Interface that I would make are:
These two features are available now. You can use them :)
Excellent! Thank you, they work great.
A few more suggestions / comments / ideas after compiling it:
== Small addition for README file ==
Perhaps add this to the end of the README file (about lib requirements) - (these worked for me) :
# If you're on Debian/Ubuntu, you can probably install expat & icu with: apt-get install libicu34 libexpat1 libexpat1-dev libicu34-dev
# The compile with: ./configure make
== Pull data from MySQL? ==
Currently the data used by cmd/Analyzer appears to be generated from the XML data file (e.g. the 1.1 gigabyte XmlEnWikipedia.tar.bz2 file).
Could it be possible instead to generate the required data from a MySQL database? That way the whole XML step could be avoided by just connecting directly to the database, and generating the required data directly from the database.
Also this way, you _may_ be able to get away with reading far less data, and so it could potentially be quicker. Also it would be more up-to-date, because there would be intermediate step to add to the latency.
For example, this way you could get the number of links to each page by doing something like this: select pl_title, count(*) from pagelinks group by pl_title;
... and you could get the valid page names by doing something like this: select page_title from page where page_namespace = 0;
== Maybe include JS / HTML files ==
Maybe bundle the JS / HTML / PHP / Perl files that you're using at http://suggest.speedblue.org/ with the wikipedia-suggest.tar.gz archive? Basically I'd like to try and reproduce what you've got happening on your website so that I can play with it too ;-)
== Documentation for Gumbies ==
Potentially stupid question, but how do I use the resulting executable files in the 'cmd' directory? I couldn't see documentation on this in the wikipedia-suggest.tar.gz archive (but it is version 0.1, so it's not unexpected). Or to put it differently, what series of commands did you use to get your http://suggest.speedblue.org/ site working?
E.g. wget http://www2.speedblue.org/download/XmlEnWikipedia.tar.bz2 # How did you create XmlEnWikipedia.tar.bz2 by the way? E.g. Did it come from doing something to http://download.wikimedia.org/enwiki/20060717/enwiki-20060717-pages-articles... ?
cmd/Analyzer XmlEnWikipedia.tar.bz2 fsa.bin page.bin # (i.e. Not sure what creates fsa.bin & page.bin , or how to persuade it to do so )
Then presumably either cmd/Query or cmd/TcpQuery is invoked on fsa.bin and page.bin, and connected to somehow to query for results. # What's the difference between these two versions? E.g. is one the DiskQuery implementation, and the other the MemoryQuery implementation? Or is it just how you connect to them (e.g. one via TCP/IP, the other via some other method?)
Sorry to ask so many questions!
All the best, Nick.
Hello Nick,
* Nick Jenkins nickpj@gmail.com [2006-08-04 16:54:20 +1000]:
A few more suggestions / comments / ideas after compiling it:
== Small addition for README file ==
Perhaps add this to the end of the README file (about lib requirements) - (these worked for me) :
# If you're on Debian/Ubuntu, you can probably install expat & icu with: apt-get install libicu34 libexpat1 libexpat1-dev libicu34-dev
# The compile with: ./configure make
Thanks you for your contribution, I will add it in README file (you will also need to install cppunit-dev package for make check).
== Pull data from MySQL? ==
Currently the data used by cmd/Analyzer appears to be generated from the XML data file (e.g. the 1.1 gigabyte XmlEnWikipedia.tar.bz2 file).
Could it be possible instead to generate the required data from a MySQL database? That way the whole XML step could be avoided by just connecting directly to the database, and generating the required data directly from the database.
Also this way, you _may_ be able to get away with reading far less data, and so it could potentially be quicker. Also it would be more up-to-date, because there would be intermediate step to add to the latency.
For example, this way you could get the number of links to each page by doing something like this: select pl_title, count(*) from pagelinks group by pl_title;
... and you could get the valid page names by doing something like this: select page_title from page where page_namespace = 0;
It is definitively possible and it will probably be the best way to integrate it in Wikipedia. I preferred working with XML files to produce a proof of concept because I do not know details of wikipedia database architecture (and I have others proof of concept in mind :))
== Maybe include JS / HTML files ==
Maybe bundle the JS / HTML / PHP / Perl files that you're using at http://suggest.speedblue.org/ with the wikipedia-suggest.tar.gz archive? Basically I'd like to try and reproduce what you've got happening on your website so that I can play with it too ;-)
No problem, I will put them in the tarball.
== Documentation for Gumbies ==
Potentially stupid question, but how do I use the resulting executable files in the 'cmd' directory? I couldn't see documentation on this in the wikipedia-suggest.tar.gz archive (but it is version 0.1, so it's not unexpected). Or to put it differently, what series of commands did you use to get your http://suggest.speedblue.org/ site working?
The tarball contains only the analyzer/query binaries, the analyzer need to be called with two arguments. The first one is a file containing all xml filename to analyze (one filename per line) and the second is the path to the redirect.xml file (more details about how to obtains this XML and redirect.xm files are given below).
The query and tcpquery binaries load output of analyzer to perform queries.
E.g. wget http://www2.speedblue.org/download/XmlEnWikipedia.tar.bz2 # How did you create XmlEnWikipedia.tar.bz2 by the way? E.g. Did it come from doing something to http://download.wikimedia.org/enwiki/20060717/enwiki-20060717-pages-articles... ?
Hum, I did not know the availability of these xml archives. In fact, I wrote scripts to grabb the whole content of wikipedia. You will have details about grabbing in the history section (june 2006) : http://suggest.speedblue.org/history.php
cmd/Analyzer XmlEnWikipedia.tar.bz2 fsa.bin page.bin # (i.e. Not sure what creates fsa.bin & page.bin , or how to persuade it to do so )
fsa.bin and page.bin are created by analyzer.
Then presumably either cmd/Query or cmd/TcpQuery is invoked on fsa.bin and page.bin, and connected to somehow to query for results. # What's the difference between these two versions? E.g. is one the DiskQuery implementation, and the other the MemoryQuery implementation? Or is it just how you connect to them (e.g. one via TCP/IP, the other via some other method?)
* cmd/query is a command line version using DiskQuery * cmd/TcpQuery is a TCP/IP version using DiskQuery
* DiskQuery does not load automaton and articles in memory, it keep only a file descriptor and perform seek (I implemented it because I have few RAM available on my server). * MemoryQuery load automaton and articles in memory, it is multi-threaded safe and you can use it to perform a lot of query per seconds.
Sorry for the lack of documentation Best Regards. Julien Lemoine
Julien Lemoine schrieb:
Hum, I did not know the availability of these xml archives. In fact, I wrote scripts to grabb the whole content of wikipedia. You will have details about grabbing in the history section (june 2006) : http://suggest.speedblue.org/history.php
I suggest you remove those scripts from your page as this is definitely bad behaviour against Wikipedia and other people should not even be encouraged to do the same. That's what the database dumps are for.
Ciao, Michael.
Hello Mickael,
Michael Keppler wrote:
Julien Lemoine schrieb:
Hum, I did not know the availability of these xml archives. In fact, I wrote scripts to grabb the whole content of wikipedia. You will have details about grabbing in the history section (june 2006) : http://suggest.speedblue.org/history.php
I suggest you remove those scripts from your page as this is definitely bad behaviour against Wikipedia and other people should not even be encouraged to do the same. That's what the database dumps are for.
You are right, I remove this scripts from download/history section with a link to download.wikipedia.org. Sorry, I did known this dump service before.
Best Regards. Julien Lemoine.
Hello Michael,
Michael Keppler wrote:
Julien Lemoine schrieb:
Hum, I did not know the availability of these xml archives. In fact, I wrote scripts to grabb the whole content of wikipedia. You will have details about grabbing in the history section (june 2006) : http://suggest.speedblue.org/history.php
I suggest you remove those scripts from your page as this is definitely bad behaviour against Wikipedia and other people should not even be encouraged to do the same. That's what the database dumps are for.
You are right, I removed these scripts from download/history section with a link to download.wikipedia.org. Sorry, I did not known this dump service before.
Best Regards Julien Lemoine
Hello,
I have done some modifications this WE : moslty on the compiler side : - I added a small heuristic inside the compiler to have only unique articles in the best records ; - I added a backend in the compiler using slist to use less memory ; - I added all php/html/js pages in the tgz (in the extra directory) ; - I also improved a little the README (I added your comments).
You can view results on http://suggest.speedblue.org (with redirections) You can download the source from http://suggest.speedblue.org/tgz/wikipedia-suggest-0.2.tar.gz You can download compiled resources (for little endian only because my serialization is endianess dependant) : for English : http://www2.speedblue.org/download/WikipediaSuggestCompiledEN.tar.bz2 for French : http://www2.speedblue.org/download/WikipediaSuggestCompiledFR.tar.bz2
I will have a look at the sql dumps available at download.wikipedia.org and I will try to wrote a compiler using only sql resources. Which kind of database is used for wikipedia, mysql ? I will probably use OTL (otl.sourceforge.net), which will use unix-odbc for mysql.
I am open for any kind of suggestion/modification. Best Regards. Julien Lemoine
Nick Jenkins wrote:
Hi Julien,
== UI stuff ==
The only two constructive suggestions for the User-Interface that I would make are:
These two features are available now. You can use them :)
Excellent! Thank you, they work great.
A few more suggestions / comments / ideas after compiling it:
== Small addition for README file ==
Perhaps add this to the end of the README file (about lib requirements) - (these worked for me) :
# If you're on Debian/Ubuntu, you can probably install expat & icu with: apt-get install libicu34 libexpat1 libexpat1-dev libicu34-dev
# The compile with: ./configure make
== Pull data from MySQL? ==
Currently the data used by cmd/Analyzer appears to be generated from the XML data file (e.g. the 1.1 gigabyte XmlEnWikipedia.tar.bz2 file).
Could it be possible instead to generate the required data from a MySQL database? That way the whole XML step could be avoided by just connecting directly to the database, and generating the required data directly from the database.
Also this way, you _may_ be able to get away with reading far less data, and so it could potentially be quicker. Also it would be more up-to-date, because there would be intermediate step to add to the latency.
For example, this way you could get the number of links to each page by doing something like this: select pl_title, count(*) from pagelinks group by pl_title;
... and you could get the valid page names by doing something like this: select page_title from page where page_namespace = 0;
== Maybe include JS / HTML files ==
Maybe bundle the JS / HTML / PHP / Perl files that you're using at http://suggest.speedblue.org/ with the wikipedia-suggest.tar.gz archive? Basically I'd like to try and reproduce what you've got happening on your website so that I can play with it too ;-)
== Documentation for Gumbies ==
Potentially stupid question, but how do I use the resulting executable files in the 'cmd' directory? I couldn't see documentation on this in the wikipedia-suggest.tar.gz archive (but it is version 0.1, so it's not unexpected). Or to put it differently, what series of commands did you use to get your http://suggest.speedblue.org/ site working?
E.g. wget http://www2.speedblue.org/download/XmlEnWikipedia.tar.bz2 # How did you create XmlEnWikipedia.tar.bz2 by the way? E.g. Did it come from doing something to http://download.wikimedia.org/enwiki/20060717/enwiki-20060717-pages-articles... ?
cmd/Analyzer XmlEnWikipedia.tar.bz2 fsa.bin page.bin # (i.e. Not sure what creates fsa.bin & page.bin , or how to persuade it to do so )
Then presumably either cmd/Query or cmd/TcpQuery is invoked on fsa.bin and page.bin, and connected to somehow to query for results. # What's the difference between these two versions? E.g. is one the DiskQuery implementation, and the other the MemoryQuery implementation? Or is it just how you connect to them (e.g. one via TCP/IP, the other via some other method?)
Sorry to ask so many questions!
All the best, Nick. _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On 8/6/06, Julien Lemoine speedblue@happycoders.org wrote:
You can view results on http://suggest.speedblue.org (with redirections)
The redirection stuff is really great. I'll try and come up with something to quibble about soon.
Steve
On 8/6/06, Julien Lemoine speedblue@happycoders.org wrote:
I am open for any kind of suggestion/modification.
Ok, here's an example of something which doesn't work quite right: look up "IPA". There are two problems:
1) The most likely interpretation is "International Phonetic Alphabet", which is returned as the first link, but via a bizarre redirect: "IPAlpha". Is it possible to take into account how often a redirect is linked to? 2) The very useful link "IPA (disambiguation)" does not appear at all.
Thus my suggestions for the heuristic: * Any page that exactly matches the search term (redirect or not) should be #1 * Any page which is the search term + " (disambiguation)" [or equivalent for other languages] should be #2 * Ranks #3-10 should be sorted by links as present.
Steve
Hello steve,
Steve Bennett wrote:
On 8/6/06, Julien Lemoine speedblue@happycoders.org wrote:
I am open for any kind of suggestion/modification.
Ok, here's an example of something which doesn't work quite right: look up "IPA". There are two problems:
- The most likely interpretation is "International Phonetic
Alphabet", which is returned as the first link, but via a bizarre redirect: "IPAlpha". Is it possible to take into account how often a redirect is linked to? 2) The very useful link "IPA (disambiguation)" does not appear at all.
Thus my suggestions for the heuristic:
- Any page that exactly matches the search term (redirect or not) should be #1
- Any page which is the search term + " (disambiguation)" [or
equivalent for other languages] should be #2
- Ranks #3-10 should be sorted by links as present.
You are right, I decided to keep only one link for article but without taking into account links and matching. I will add a similarity measure similar to what you discribed above to fix that.
Thank you for your great comment. Best Regards. Julien Lemoine
wikitech-l@lists.wikimedia.org