Hi Julien,
You can view results on http://suggest.speedblue.org (with redirections)
I like it! E.g. "formula we" shows the results as "Formula Weight → Atomic mass", so straight away you can see it's a redirect, plus what it's a redirect to. That's very good indeed.
Handling of articles with different capitalization still concerns me a little; For example, in the English Wikipedia there is both "Adfa" (a town in Wales), and "ADFA" (a redirect to an Australian military training institute), but only the town in Wales is listed in the suggest search. Ideally, could it maybe list both, with the town first (since it has one incoming link, and the institute has none), perhaps unless the user types "ADFA" in all-caps, in which case maybe "ADFA" could come first because it would be a better match for what the user has typed? In other words, I still think the searching could be case-insensitive, just that maybe the ordering of the results could be altered by capitalization.
You can download the source from http://suggest.speedblue.org/tgz/wikipedia-suggest-0.2.tar.gz You can download compiled resources (for little endian only because my serialization is endianess dependant) :
Cool, I got it do stuff with the precompiled files: ========================================= root@bling:~/tmp/wikipedia-suggest/wikipedia-suggest-0.2/cmd# ./Query ../../EN/fsa.bin ../../EN/pages.bin Loaded position of 2171116 articles test
Title : [Test cricket] Freq : 5776 Title : [Testament, New → New Testament] Freq : 2017 Title : [Testament, Old → Old Testament] Freq : 1700 Title : [Testudines → Turtle] Freq : 527 Title : [Testosterone] Freq : 355 Title : [Test Pilot] Freq : 309 Title : [Test Messaging → Short message service] Freq : 289 Title : [Testicle] Freq : 276 Title : [Testprog → Development stage] Freq : 268 Title : [Testudinidae → Tortoise] Freq : 252 =========================================
Which kind of database is used for wikipedia, mysql ?
Yes, MySQL. There is a port for Postgres in progress (try saying that fast 10 times), but you probably want to target MySQL first because it's the database that's being used at the moment in most environments.
- MemoryQuery load automaton and articles in memory, it is
multi-threaded safe and you can use it to perform a lot of query per seconds.
Ok, then MemoryQuery is the one I want, as I am generally happy to trade RAM for speed. How do I tell it to MemoryQuery mode instead of DiskQuery mode? E.g. Do I need to compile cmd/query and cmd/TcpQuery with a special flag? Or is there a way to pass a command line argument to cmd/query and cmd/TcpQuery to tell it to start in MemoryQuery, rather than DiskQuery mode ? (This would have the advantage of letting the user choose the most suitable mode at runtime, and it could default to the disk-based-querying so that the default behaviour was the same as it is at the moment).
- I added all php/html/js pages in the tgz (in the extra directory) ;
Thank you!
One small thing I've noticed is that it won't find articles/redirects that start with a quote - for example: http://en.wikipedia.org/wiki/%22A%22_Device
One reason for this may be that the cmd/tcpQuery JavaScript response is not currently escaping quotes. For example, when I ask it for strings starting with " then tcpQuery returns this:
sendRes(""", new Array(), new Array(), new Array());
Whereas it probably wants to return this:
sendRes(""", new Array(), new Array(), new Array());
----
Also, I tried testing TcpQuery with random binary input, and it seemed sometimes to be much slower on some single-character input. For example, here is a log of it running :
Query of 2 chars : got 55 chars back : Time = 0.00678396224976 Query of 9 chars : got 62 chars back : Time = 0.00735902786255 Query of 1 chars : got 0 chars back : Time = 1.01217985153 ^^^^^ slow line ^^^^ Query of 9 chars : got 62 chars back : Time = 0.00748610496521 Query of 10 chars : got 63 chars back : Time = 0.00757598876953 Query of 8 chars : got 61 chars back : Time = 0.00975489616394
Then a simple loop to track down the slow 1-character tests: ======================================= <?php
error_reporting (E_ALL | E_STRICT);
// function internals taken from extra/query.php function getResults($content) { /* Create a TCP/IP socket. */ $socket = socket_create(AF_INET, SOCK_STREAM, SOL_TCP); if ($socket < 0) { echo "socket_create() failed: reason: " . socket_strerror($socket) . "\n"; }
$result = socket_connect($socket, "127.0.0.1", 666); if ($result < 0) { echo "socket_connect() failed.\nReason: ($result) " . socket_strerror($result) . "\n"; }
$in = "$content\r\n\r\n"; socket_write($socket, $in, strlen($in)); $res = ""; while ($out = socket_read($socket, 2048)) { $res .= $out; } socket_close($socket); return $res; }
for ($i=0; $i<=255; $i++) { $query = chr($i); print "i=$i "; $before = microtime(true); $results = getResults($query); $after = microtime(true); print "got " . strlen($results) . " chars back : Time = " . ($after - $before) . "\n"; }
?> =======================================
The slow ones to respond (with 0 chars back) were: ======================================= i=0 got 0 chars back : Time = 1.01103210449 ^^^^ i=1 got 54 chars back : Time = 0.00667715072632 i=2 got 54 chars back : Time = 0.00674390792847 i=3 got 54 chars back : Time = 0.00663805007935 i=4 got 54 chars back : Time = 0.00671792030334 i=5 got 54 chars back : Time = 0.00660610198975 i=6 got 54 chars back : Time = 0.00669598579407 i=7 got 54 chars back : Time = 0.00663185119629 i=8 got 54 chars back : Time = 0.00668096542358 i=9 got 54 chars back : Time = 0.0066249370575 i=10 got 0 chars back : Time = 1.01135993004 ^^^^ i=11 got 54 chars back : Time = 0.00664806365967 i=12 got 54 chars back : Time = 0.00664210319519 .... =======================================
I.e. the NULL character, and the LF / linefeed character.
No idea if that is something that could or should be fixed or not.
All the best, Nick.
Hello Nick,
Nick Jenkins wrote:
I like it! E.g. "formula we" shows the results as "Formula Weight → Atomic mass", so straight away you can see it's a redirect, plus what it's a redirect to. That's very good indeed.
Handling of articles with different capitalization still concerns me a little; For example, in the English Wikipedia there is both "Adfa" (a town in Wales), and "ADFA" (a redirect to an Australian military training institute), but only the town in Wales is listed in the suggest search. Ideally, could it maybe list both, with the town first (since it has one incoming link, and the institute has none), perhaps unless the user types "ADFA" in all-caps, in which case maybe "ADFA" could come first because it would be a better match for what the user has typed? In other words, I still think the searching could be case-insensitive, just that maybe the ordering of the results could be altered by capitalization.
I will keep the match case insensitive but will keep all differents capitalization, it will be in version 0.3 :)
Which kind of database is used for wikipedia, mysql ?
Yes, MySQL. There is a port for Postgres in progress (try saying that fast 10 times), but you probably want to target MySQL first because it's the database that's being used at the moment in most environments.
Ok, thanks for this precision. I will work on this next week.
- MemoryQuery load automaton and articles in memory, it is
multi-threaded safe and you can use it to perform a lot of query per seconds.
Ok, then MemoryQuery is the one I want, as I am generally happy to trade RAM for speed. How do I tell it to MemoryQuery mode instead of DiskQuery mode? E.g. Do I need to compile cmd/query and cmd/TcpQuery with a special flag? Or is there a way to pass a command line argument to cmd/query and cmd/TcpQuery to tell it to start in MemoryQuery, rather than DiskQuery mode ? (This would have the advantage of letting the user choose the most suitable mode at runtime, and it could default to the disk-based-querying so that the default behaviour was the same as it is at the moment).
You are right, I will add a command line argument to choose between disk or memory implementation.
- I added all php/html/js pages in the tgz (in the extra directory) ;
Thank you!
One small thing I've noticed is that it won't find articles/redirects that start with a quote - for example: http://en.wikipedia.org/wiki/%22A%22_Device
One reason for this may be that the cmd/tcpQuery JavaScript response is not currently escaping quotes. For example, when I ask it for strings starting with " then tcpQuery returns this:
sendRes(""", new Array(), new Array(), new Array());
Whereas it probably wants to return this:
sendRes(""", new Array(), new Array(), new Array());
Oops, I will also fix that :)
Also, I tried testing TcpQuery with random binary input, and it seemed sometimes to be much slower on some single-character input. For example, here is a log of it running :
Query of 2 chars : got 55 chars back : Time = 0.00678396224976 Query of 9 chars : got 62 chars back : Time = 0.00735902786255 Query of 1 chars : got 0 chars back : Time = 1.01217985153 ^^^^^ slow line ^^^^ Query of 9 chars : got 62 chars back : Time = 0.00748610496521 Query of 10 chars : got 63 chars back : Time = 0.00757598876953 Query of 8 chars : got 61 chars back : Time = 0.00975489616394
Then a simple loop to track down the slow 1-character tests:
<?php error_reporting (E_ALL | E_STRICT); // function internals taken from extra/query.php function getResults($content) { /* Create a TCP/IP socket. */ $socket = socket_create(AF_INET, SOCK_STREAM, SOL_TCP); if ($socket < 0) { echo "socket_create() failed: reason: " . socket_strerror($socket) . "\n"; } $result = socket_connect($socket, "127.0.0.1", 666); if ($result < 0) { echo "socket_connect() failed.\nReason: ($result) " . socket_strerror($result) . "\n"; } $in = "$content\r\n\r\n"; socket_write($socket, $in, strlen($in)); $res = ""; while ($out = socket_read($socket, 2048)) { $res .= $out; } socket_close($socket); return $res; } for ($i=0; $i<=255; $i++) { $query = chr($i); print "i=$i "; $before = microtime(true); $results = getResults($query); $after = microtime(true); print "got " . strlen($results) . " chars back : Time = " . ($after - $before) . "\n"; } ?>
=======================================
The slow ones to respond (with 0 chars back) were:
i=0 got 0 chars back : Time = 1.01103210449 ^^^^ i=1 got 54 chars back : Time = 0.00667715072632 i=2 got 54 chars back : Time = 0.00674390792847 i=3 got 54 chars back : Time = 0.00663805007935 i=4 got 54 chars back : Time = 0.00671792030334 i=5 got 54 chars back : Time = 0.00660610198975 i=6 got 54 chars back : Time = 0.00669598579407 i=7 got 54 chars back : Time = 0.00663185119629 i=8 got 54 chars back : Time = 0.00668096542358 i=9 got 54 chars back : Time = 0.0066249370575 i=10 got 0 chars back : Time = 1.01135993004 ^^^^ i=11 got 54 chars back : Time = 0.00664806365967 i=12 got 54 chars back : Time = 0.00664210319519 .... =======================================
I.e. the NULL character, and the LF / linefeed character.
No idea if that is something that could or should be fixed or not.
Strange, thank you for the report. I will have a look at this problem. Thank you very much for your comments, it will help me to improve wikipedia-suggest.
Best Regards. Julien Lemoine
Hi Julien,
I will keep the match case insensitive but will keep all differents capitalization, it will be in version 0.3 :) ... Ok, thanks for this precision. I will work on this next week. ... You are right, I will add a command line argument to choose between disk or memory implementation. ... Oops, I will also fix that :)
That all sounds good to me.
I.e. the NULL character, and the LF / linefeed character.
No idea if that is something that could or should be fixed or not.
Strange, thank you for the report. I will have a look at this problem.
I think this may just be the socket timing out. For example, if I do "telnet localhost <whatever-port-tcpQuery-is-on>", then after 1 second it times out and closes the connection. So no idea if it's a real problem or not to ignore NULL character + LF / linefeed character in the input, but I think that's what it's doing, and then the socket times out.
Also, I've made a few small changes to the web UI (extra directory), and the diff is at: http://files.nickj.org/MediaWiki/wikipedia-suggest-0.2-extra-diff.txt
It changes a few small things, namely: * Added strict mode (for PHP5), changed from 'var' to 'private' to keep PHP5 strict mode happy, and added quick accessor method for $res attribute. * Remove leading "/" from paths for image directory and query.php. This allows the web files to be placed in a subdirectory as well as the root directory, by using relative paths instead. * When the user made a search (e.g. "fish"), then highlighted all their search terms, then pressed delete (so that the search field was now blank), then pressed arrow up or arrow down, then it would show the old results (e.g. Fishing / FishBase / etc). To prevent this added an "if" clause to only show results when there is something in the query field. * Added one JS var declaration for 'i' loop variable. * Problem with quotes. For example, if I search for: "The devil", then it returns many links, including "The Devil's Rejects", the "The Devil's Advocate (film)", and so forth. However when I search for "The devil's", then the results become empty, but we know from before that there are suitable matches. The reason is because the query string comes through as "The Devil's" (e.g. can see this at http://suggest.speedblue.org/query.php?query=The%20Devil%27s ), so one way to avoid this is just to use stripslashes in query.php, so I added this. * I also tried experimenting with adding a "changeHighlight" function and HTML IDs to each table row (to save redrawing the table when the user presses the up or down arrow keys), but it doesn't seem to make much difference either way (doesn't seem any slower or faster), so you might not want to use that.
All the best, Nick.
* Nick Jenkins nickpj@gmail.com [2006-08-08 17:27:12 +1000]:
I.e. the NULL character, and the LF / linefeed character.
No idea if that is something that could or should be fixed or not.
Strange, thank you for the report. I will have a look at this problem.
I think this may just be the socket timing out. For example, if I do "telnet localhost <whatever-port-tcpQuery-is-on>", then after 1 second it times out and closes the connection. So no idea if it's a real problem or not to ignore NULL character + LF / linefeed character in the input, but I think that's what it's doing, and then the socket times out.
In fact, this is the way I got the query from socket. I used a library that wait for some delimitors. And these two delimitors are not in the list, I put a timeout of 1s to do not freeze the connection since it is not a multi-threaded application. I don't think it is a problem, I can add 0 and 10 to the list of delimitors.
Also, I've made a few small changes to the web UI (extra directory), and the diff is at: http://files.nickj.org/MediaWiki/wikipedia-suggest-0.2-extra-diff.txt
It changes a few small things, namely:
- Added strict mode (for PHP5), changed from 'var' to 'private' to
keep PHP5 strict mode happy, and added quick accessor method for $res attribute.
- Remove leading "/" from paths for image directory and query.php.
This allows the web files to be placed in a subdirectory as well as the root directory, by using relative paths instead.
- When the user made a search (e.g. "fish"), then highlighted all
their search terms, then pressed delete (so that the search field was now blank), then pressed arrow up or arrow down, then it would show the old results (e.g. Fishing / FishBase / etc). To prevent this added an "if" clause to only show results when there is something in the query field.
- Added one JS var declaration for 'i' loop variable.
- Problem with quotes. For example, if I search for: "The devil", then
it returns many links, including "The Devil's Rejects", the "The Devil's Advocate (film)", and so forth. However when I search for "The devil's", then the results become empty, but we know from before that there are suitable matches. The reason is because the query string comes through as "The Devil's" (e.g. can see this at http://suggest.speedblue.org/query.php?query=The%20Devil%27s ), so one way to avoid this is just to use stripslashes in query.php, so I added this.
- I also tried experimenting with adding a "changeHighlight" function
and HTML IDs to each table row (to save redrawing the table when the user presses the up or down arrow keys), but it doesn't seem to make much difference either way (doesn't seem any slower or faster), so you might not want to use that.
Thank you for all these great changes, I will add them this evening on suggest.speedblue.org and put them in the tarball.
Best Regards. Julien Lemoine.
Hello,
I wrote a new version of Wikipedia suggest (version 0.3) which includes : - an option to enable usage of MemoryQuery (-m) in TcpQuery command, you can also specify the number of threads that you want. - an heuristic to choose the correct redirection to keep (based on similarity with the query) - handling of articles with different capitalization (keep all different capitalizations) - includes the patch of Nick Jenkins.
I will regenerates the index for english/french on suggest.speedblue.org tomorrow. You can download the sources now on : http://suggest.speedblue.org/tgz/wikipedia-suggest-0.3.tar.gz
I also look the mysql tables (page and pageslink), but I have two questions : - is there a way to get the target of a redirection ? There is a is_redirected flag on the page table, but I do no see information about the redirected article - is the url available in a table ?
If I can have these two informations on the tables, I will write a SQL version of the analyzer. If these two informations are not available, what will be the best way to write a analyzer for Wikipedia ? (work on the pages-articles.xml http://download.wikipedia.org/enwiki/20060717/enwiki-20060717-pages-articles.xml.bz2 file ? ).
Best Regards. Julien Lemoine
wikitech-l@lists.wikimedia.org