New subject: Wikipedia Suggest

7 Aug 2006

      Hi Julien,
...
You can view results on http://suggest.speedblue.org (with redirections)
I like it! E.g. "formula we" shows the results as "Formula Weight
&rarr; Atomic mass", so straight away you can see it's a redirect,
plus what it's a redirect to. That's very good indeed.
Handling of articles with different capitalization still concerns me a
little; For example, in the English Wikipedia there is both "Adfa" (a
town in Wales), and "ADFA" (a redirect to an Australian military
training institute), but only the town in Wales is listed in the
suggest search. Ideally, could it maybe list both, with the town first
(since it has one incoming link, and the institute has none), perhaps
unless the user types "ADFA" in all-caps, in which case maybe "ADFA"
could come first because it would be a better match for what the user
has typed? In other words, I still think the searching could be
case-insensitive, just that maybe the ordering of the results could be
altered by capitalization.
...
You can download the source from
http://suggest.speedblue.org/tgz/wikipedia-suggest-0.2.tar.gz
You can download compiled resources (for little endian only because my
serialization is endianess dependant) :
Cool, I got it do stuff with the precompiled files:
=========================================
root@bling:~/tmp/wikipedia-suggest/wikipedia-suggest-0.2/cmd# ./Query
../../EN/fsa.bin ../../EN/pages.bin
Loaded position of 2171116 articles
test
Title : [Test cricket] Freq : 5776
Title : [Testament, New &rarr; New Testament] Freq : 2017
Title : [Testament, Old &rarr; Old Testament] Freq : 1700
Title : [Testudines &rarr; Turtle] Freq : 527
Title : [Testosterone] Freq : 355
Title : [Test Pilot] Freq : 309
Title : [Test Messaging &rarr; Short message service] Freq : 289
Title : [Testicle] Freq : 276
Title : [Testprog &rarr; Development stage] Freq : 268
Title : [Testudinidae &rarr; Tortoise] Freq : 252
=========================================
...
Which kind of database is used for wikipedia, mysql ?
Yes, MySQL. There is a port for Postgres in progress (try saying that
fast 10 times), but you probably want to target MySQL first because
it's the database that's being used at the moment in most
environments.
...

MemoryQuery load automaton and articles in memory, it is

multi-threaded safe and you can use it to perform a lot of query per seconds.
Ok, then MemoryQuery is the one I want, as I am generally happy to
trade RAM for speed. How do I tell it to MemoryQuery mode instead of
DiskQuery mode? E.g. Do I need to compile cmd/query and cmd/TcpQuery
with a special flag? Or is there a way to pass a command line argument
to cmd/query and cmd/TcpQuery to tell it to start in MemoryQuery,
rather than DiskQuery mode ? (This would have the advantage of letting
the user choose the most suitable mode at runtime, and it could
default to the disk-based-querying so that the default behaviour was
the same as it is at the moment).
...

I added all php/html/js pages in the tgz (in the extra directory) ;

Thank you!
One small thing I've noticed is that it won't find articles/redirects
that start with a quote - for example:
http://en.wikipedia.org/wiki/%22A%22_Device
One reason for this may be that the cmd/tcpQuery JavaScript response
is not currently escaping quotes. For example, when I ask it for
strings starting with " then tcpQuery returns this:
sendRes(""", new Array(), new Array(), new Array());
Whereas it probably wants to return this:
sendRes(""", new Array(), new Array(), new Array());
----
Also, I tried testing TcpQuery with random binary input, and it seemed
sometimes to be much slower on some single-character input. For
example, here is a log of it running :
Query of 2 chars : got 55 chars back : Time = 0.00678396224976
Query of 9 chars : got 62 chars back : Time = 0.00735902786255
Query of 1 chars : got 0 chars back : Time = 1.01217985153
^^^^^ slow line ^^^^
Query of 9 chars : got 62 chars back : Time = 0.00748610496521
Query of 10 chars : got 63 chars back : Time = 0.00757598876953
Query of 8 chars : got 61 chars back : Time = 0.00975489616394
Then a simple loop to track down the slow 1-character tests:
=======================================
<?php
error_reporting (E_ALL | E_STRICT);
// function internals taken from extra/query.php
function getResults($content) {
    /* Create a TCP/IP socket. */
    $socket = socket_create(AF_INET, SOCK_STREAM, SOL_TCP);
    if ($socket < 0) {
        echo "socket_create() failed: reason: " .
socket_strerror($socket) . "\n";
    }
$result = socket_connect($socket, "127.0.0.1", 666);
    if ($result < 0) {
        echo "socket_connect() failed.\nReason: ($result) " .
socket_strerror($result) . "\n";
    }
$in = "$content\r\n\r\n";
    socket_write($socket, $in, strlen($in));
    $res = "";
    while ($out = socket_read($socket, 2048)) {
        $res .= $out;
    }
    socket_close($socket);
    return $res;
}
for ($i=0; $i<=255; $i++) {
    $query = chr($i);
    print "i=$i ";
    $before = microtime(true);
    $results = getResults($query);
    $after = microtime(true);
    print "got " . strlen($results) . " chars back : Time = " .
($after - $before) . "\n";
}
?>
=======================================
The slow ones to respond (with 0 chars back) were:
=======================================
i=0 got 0 chars back : Time = 1.01103210449
^^^^
i=1 got 54 chars back : Time = 0.00667715072632
i=2 got 54 chars back : Time = 0.00674390792847
i=3 got 54 chars back : Time = 0.00663805007935
i=4 got 54 chars back : Time = 0.00671792030334
i=5 got 54 chars back : Time = 0.00660610198975
i=6 got 54 chars back : Time = 0.00669598579407
i=7 got 54 chars back : Time = 0.00663185119629
i=8 got 54 chars back : Time = 0.00668096542358
i=9 got 54 chars back : Time = 0.0066249370575
i=10 got 0 chars back : Time = 1.01135993004
^^^^
i=11 got 54 chars back : Time = 0.00664806365967
i=12 got 54 chars back : Time = 0.00664210319519
....
=======================================
I.e. the NULL character, and the LF / linefeed character.
No idea if that is something that could or should be fixed or not.
All the best,
Nick.

Re: [Wikitech-l] Wikipedia Suggest

Then a simple loop to track down the slow 1-character tests:

The slow ones to respond (with 0 chars back) were: