Hi Julien,
You can view results on http://suggest.speedblue.org (with redirections)
I like it! E.g. "formula we" shows the results as "Formula Weight → Atomic mass", so straight away you can see it's a redirect, plus what it's a redirect to. That's very good indeed.
Handling of articles with different capitalization still concerns me a little; For example, in the English Wikipedia there is both "Adfa" (a town in Wales), and "ADFA" (a redirect to an Australian military training institute), but only the town in Wales is listed in the suggest search. Ideally, could it maybe list both, with the town first (since it has one incoming link, and the institute has none), perhaps unless the user types "ADFA" in all-caps, in which case maybe "ADFA" could come first because it would be a better match for what the user has typed? In other words, I still think the searching could be case-insensitive, just that maybe the ordering of the results could be altered by capitalization.
You can download the source from http://suggest.speedblue.org/tgz/wikipedia-suggest-0.2.tar.gz You can download compiled resources (for little endian only because my serialization is endianess dependant) :
Cool, I got it do stuff with the precompiled files: ========================================= root@bling:~/tmp/wikipedia-suggest/wikipedia-suggest-0.2/cmd# ./Query ../../EN/fsa.bin ../../EN/pages.bin Loaded position of 2171116 articles test
Title : [Test cricket] Freq : 5776 Title : [Testament, New → New Testament] Freq : 2017 Title : [Testament, Old → Old Testament] Freq : 1700 Title : [Testudines → Turtle] Freq : 527 Title : [Testosterone] Freq : 355 Title : [Test Pilot] Freq : 309 Title : [Test Messaging → Short message service] Freq : 289 Title : [Testicle] Freq : 276 Title : [Testprog → Development stage] Freq : 268 Title : [Testudinidae → Tortoise] Freq : 252 =========================================
Which kind of database is used for wikipedia, mysql ?
Yes, MySQL. There is a port for Postgres in progress (try saying that fast 10 times), but you probably want to target MySQL first because it's the database that's being used at the moment in most environments.
- MemoryQuery load automaton and articles in memory, it is
multi-threaded safe and you can use it to perform a lot of query per seconds.
Ok, then MemoryQuery is the one I want, as I am generally happy to trade RAM for speed. How do I tell it to MemoryQuery mode instead of DiskQuery mode? E.g. Do I need to compile cmd/query and cmd/TcpQuery with a special flag? Or is there a way to pass a command line argument to cmd/query and cmd/TcpQuery to tell it to start in MemoryQuery, rather than DiskQuery mode ? (This would have the advantage of letting the user choose the most suitable mode at runtime, and it could default to the disk-based-querying so that the default behaviour was the same as it is at the moment).
- I added all php/html/js pages in the tgz (in the extra directory) ;
Thank you!
One small thing I've noticed is that it won't find articles/redirects that start with a quote - for example: http://en.wikipedia.org/wiki/%22A%22_Device
One reason for this may be that the cmd/tcpQuery JavaScript response is not currently escaping quotes. For example, when I ask it for strings starting with " then tcpQuery returns this:
sendRes(""", new Array(), new Array(), new Array());
Whereas it probably wants to return this:
sendRes(""", new Array(), new Array(), new Array());
----
Also, I tried testing TcpQuery with random binary input, and it seemed sometimes to be much slower on some single-character input. For example, here is a log of it running :
Query of 2 chars : got 55 chars back : Time = 0.00678396224976 Query of 9 chars : got 62 chars back : Time = 0.00735902786255 Query of 1 chars : got 0 chars back : Time = 1.01217985153 ^^^^^ slow line ^^^^ Query of 9 chars : got 62 chars back : Time = 0.00748610496521 Query of 10 chars : got 63 chars back : Time = 0.00757598876953 Query of 8 chars : got 61 chars back : Time = 0.00975489616394
Then a simple loop to track down the slow 1-character tests: ======================================= <?php
error_reporting (E_ALL | E_STRICT);
// function internals taken from extra/query.php function getResults($content) { /* Create a TCP/IP socket. */ $socket = socket_create(AF_INET, SOCK_STREAM, SOL_TCP); if ($socket < 0) { echo "socket_create() failed: reason: " . socket_strerror($socket) . "\n"; }
$result = socket_connect($socket, "127.0.0.1", 666); if ($result < 0) { echo "socket_connect() failed.\nReason: ($result) " . socket_strerror($result) . "\n"; }
$in = "$content\r\n\r\n"; socket_write($socket, $in, strlen($in)); $res = ""; while ($out = socket_read($socket, 2048)) { $res .= $out; } socket_close($socket); return $res; }
for ($i=0; $i<=255; $i++) { $query = chr($i); print "i=$i "; $before = microtime(true); $results = getResults($query); $after = microtime(true); print "got " . strlen($results) . " chars back : Time = " . ($after - $before) . "\n"; }
?> =======================================
The slow ones to respond (with 0 chars back) were: ======================================= i=0 got 0 chars back : Time = 1.01103210449 ^^^^ i=1 got 54 chars back : Time = 0.00667715072632 i=2 got 54 chars back : Time = 0.00674390792847 i=3 got 54 chars back : Time = 0.00663805007935 i=4 got 54 chars back : Time = 0.00671792030334 i=5 got 54 chars back : Time = 0.00660610198975 i=6 got 54 chars back : Time = 0.00669598579407 i=7 got 54 chars back : Time = 0.00663185119629 i=8 got 54 chars back : Time = 0.00668096542358 i=9 got 54 chars back : Time = 0.0066249370575 i=10 got 0 chars back : Time = 1.01135993004 ^^^^ i=11 got 54 chars back : Time = 0.00664806365967 i=12 got 54 chars back : Time = 0.00664210319519 .... =======================================
I.e. the NULL character, and the LF / linefeed character.
No idea if that is something that could or should be fixed or not.
All the best, Nick.