Re: [Wikitech-l] Wikipedia Suggest

7 Aug 2006

      Hello Nick,
Nick Jenkins wrote:
...
I like it! E.g. "formula we" shows the results as "Formula Weight
&rarr; Atomic mass", so straight away you can see it's a redirect,
plus what it's a redirect to. That's very good indeed.
Handling of articles with different capitalization still concerns me a
little; For example, in the English Wikipedia there is both "Adfa" (a
town in Wales), and "ADFA" (a redirect to an Australian military
training institute), but only the town in Wales is listed in the
suggest search. Ideally, could it maybe list both, with the town first
(since it has one incoming link, and the institute has none), perhaps
unless the user types "ADFA" in all-caps, in which case maybe "ADFA"
could come first because it would be a better match for what the user
has typed? In other words, I still think the searching could be
case-insensitive, just that maybe the ordering of the results could be
altered by capitalization.
I will keep the match case insensitive but will keep all differents 
capitalization,
it will be in version 0.3 :)
...
...
Which kind of database is used for wikipedia, mysql ?
Yes, MySQL. There is a port for Postgres in progress (try saying that
fast 10 times), but you probably want to target MySQL first because
it's the database that's being used at the moment in most
environments.
Ok, thanks for this precision. I will work on this next week.
...
...

MemoryQuery load automaton and articles in memory, it is

multi-threaded safe and you can use it to perform a lot of query per seconds.
Ok, then MemoryQuery is the one I want, as I am generally happy to
trade RAM for speed. How do I tell it to MemoryQuery mode instead of
DiskQuery mode? E.g. Do I need to compile cmd/query and cmd/TcpQuery
with a special flag? Or is there a way to pass a command line argument
to cmd/query and cmd/TcpQuery to tell it to start in MemoryQuery,
rather than DiskQuery mode ? (This would have the advantage of letting
the user choose the most suitable mode at runtime, and it could
default to the disk-based-querying so that the default behaviour was
the same as it is at the moment).
You are right, I will add a command line argument to choose between disk or
memory implementation.
...
...

I added all php/html/js pages in the tgz (in the extra directory) ;

Thank you!
One small thing I've noticed is that it won't find articles/redirects
that start with a quote - for example:
http://en.wikipedia.org/wiki/%22A%22_Device
One reason for this may be that the cmd/tcpQuery JavaScript response
is not currently escaping quotes. For example, when I ask it for
strings starting with " then tcpQuery returns this:
sendRes(""", new Array(), new Array(), new Array());
Whereas it probably wants to return this:
sendRes(""", new Array(), new Array(), new Array());
Oops, I will also fix that :)
...

Also, I tried testing TcpQuery with random binary input, and it seemed
sometimes to be much slower on some single-character input. For
example, here is a log of it running :
Query of 2 chars : got 55 chars back : Time = 0.00678396224976
Query of 9 chars : got 62 chars back : Time = 0.00735902786255
Query of 1 chars : got 0 chars back : Time = 1.01217985153
^^^^^ slow line ^^^^
Query of 9 chars : got 62 chars back : Time = 0.00748610496521
Query of 10 chars : got 63 chars back : Time = 0.00757598876953
Query of 8 chars : got 61 chars back : Time = 0.00975489616394
Then a simple loop to track down the slow 1-character tests:
<?php

error_reporting (E_ALL | E_STRICT);

// function internals taken from extra/query.php
function getResults($content) {
    /* Create a TCP/IP socket. */
    $socket = socket_create(AF_INET, SOCK_STREAM, SOL_TCP);
    if ($socket < 0) {
        echo "socket_create() failed: reason: " .
socket_strerror($socket) . "\n";
    }

    $result = socket_connect($socket, "127.0.0.1", 666);
    if ($result < 0) {
        echo "socket_connect() failed.\nReason: ($result) " .
socket_strerror($result) . "\n";
    }

    $in = "$content\r\n\r\n";
    socket_write($socket, $in, strlen($in));
    $res = "";
    while ($out = socket_read($socket, 2048)) {
        $res .= $out;
    }
    socket_close($socket);
    return $res;
}

for ($i=0; $i<=255; $i++) {
    $query = chr($i);
    print "i=$i ";
    $before = microtime(true);
    $results = getResults($query);
    $after = microtime(true);
    print "got " . strlen($results) . " chars back : Time = " .
($after - $before) . "\n";
}

?>
=======================================
The slow ones to respond (with 0 chars back) were:
i=0 got 0 chars back : Time = 1.01103210449
^^^^
i=1 got 54 chars back : Time = 0.00667715072632
i=2 got 54 chars back : Time = 0.00674390792847
i=3 got 54 chars back : Time = 0.00663805007935
i=4 got 54 chars back : Time = 0.00671792030334
i=5 got 54 chars back : Time = 0.00660610198975
i=6 got 54 chars back : Time = 0.00669598579407
i=7 got 54 chars back : Time = 0.00663185119629
i=8 got 54 chars back : Time = 0.00668096542358
i=9 got 54 chars back : Time = 0.0066249370575
i=10 got 0 chars back : Time = 1.01135993004
^^^^
i=11 got 54 chars back : Time = 0.00664806365967
i=12 got 54 chars back : Time = 0.00664210319519
....
=======================================
I.e. the NULL character, and the LF / linefeed character.
No idea if that is something that could or should be fixed or not.
Strange, thank you for the report. I will have a look at this problem.
Thank you very much for your comments, it will help me to improve 
wikipedia-suggest.
Best Regards.
Julien Lemoine

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Wikipedia Suggest

Then a simple loop to track down the slow 1-character tests:

The slow ones to respond (with 0 chars back) were: