Hello Nick,
Nick Jenkins wrote:
I like it! E.g. "formula we" shows the results as "Formula Weight → Atomic mass", so straight away you can see it's a redirect, plus what it's a redirect to. That's very good indeed.
Handling of articles with different capitalization still concerns me a little; For example, in the English Wikipedia there is both "Adfa" (a town in Wales), and "ADFA" (a redirect to an Australian military training institute), but only the town in Wales is listed in the suggest search. Ideally, could it maybe list both, with the town first (since it has one incoming link, and the institute has none), perhaps unless the user types "ADFA" in all-caps, in which case maybe "ADFA" could come first because it would be a better match for what the user has typed? In other words, I still think the searching could be case-insensitive, just that maybe the ordering of the results could be altered by capitalization.
I will keep the match case insensitive but will keep all differents capitalization, it will be in version 0.3 :)
Which kind of database is used for wikipedia, mysql ?
Yes, MySQL. There is a port for Postgres in progress (try saying that fast 10 times), but you probably want to target MySQL first because it's the database that's being used at the moment in most environments.
Ok, thanks for this precision. I will work on this next week.
- MemoryQuery load automaton and articles in memory, it is
multi-threaded safe and you can use it to perform a lot of query per seconds.
Ok, then MemoryQuery is the one I want, as I am generally happy to trade RAM for speed. How do I tell it to MemoryQuery mode instead of DiskQuery mode? E.g. Do I need to compile cmd/query and cmd/TcpQuery with a special flag? Or is there a way to pass a command line argument to cmd/query and cmd/TcpQuery to tell it to start in MemoryQuery, rather than DiskQuery mode ? (This would have the advantage of letting the user choose the most suitable mode at runtime, and it could default to the disk-based-querying so that the default behaviour was the same as it is at the moment).
You are right, I will add a command line argument to choose between disk or memory implementation.
- I added all php/html/js pages in the tgz (in the extra directory) ;
Thank you!
One small thing I've noticed is that it won't find articles/redirects that start with a quote - for example: http://en.wikipedia.org/wiki/%22A%22_Device
One reason for this may be that the cmd/tcpQuery JavaScript response is not currently escaping quotes. For example, when I ask it for strings starting with " then tcpQuery returns this:
sendRes(""", new Array(), new Array(), new Array());
Whereas it probably wants to return this:
sendRes(""", new Array(), new Array(), new Array());
Oops, I will also fix that :)
Also, I tried testing TcpQuery with random binary input, and it seemed sometimes to be much slower on some single-character input. For example, here is a log of it running :
Query of 2 chars : got 55 chars back : Time = 0.00678396224976 Query of 9 chars : got 62 chars back : Time = 0.00735902786255 Query of 1 chars : got 0 chars back : Time = 1.01217985153 ^^^^^ slow line ^^^^ Query of 9 chars : got 62 chars back : Time = 0.00748610496521 Query of 10 chars : got 63 chars back : Time = 0.00757598876953 Query of 8 chars : got 61 chars back : Time = 0.00975489616394
Then a simple loop to track down the slow 1-character tests:
<?php error_reporting (E_ALL | E_STRICT); // function internals taken from extra/query.php function getResults($content) { /* Create a TCP/IP socket. */ $socket = socket_create(AF_INET, SOCK_STREAM, SOL_TCP); if ($socket < 0) { echo "socket_create() failed: reason: " . socket_strerror($socket) . "\n"; } $result = socket_connect($socket, "127.0.0.1", 666); if ($result < 0) { echo "socket_connect() failed.\nReason: ($result) " . socket_strerror($result) . "\n"; } $in = "$content\r\n\r\n"; socket_write($socket, $in, strlen($in)); $res = ""; while ($out = socket_read($socket, 2048)) { $res .= $out; } socket_close($socket); return $res; } for ($i=0; $i<=255; $i++) { $query = chr($i); print "i=$i "; $before = microtime(true); $results = getResults($query); $after = microtime(true); print "got " . strlen($results) . " chars back : Time = " . ($after - $before) . "\n"; } ?>
=======================================
The slow ones to respond (with 0 chars back) were:
i=0 got 0 chars back : Time = 1.01103210449 ^^^^ i=1 got 54 chars back : Time = 0.00667715072632 i=2 got 54 chars back : Time = 0.00674390792847 i=3 got 54 chars back : Time = 0.00663805007935 i=4 got 54 chars back : Time = 0.00671792030334 i=5 got 54 chars back : Time = 0.00660610198975 i=6 got 54 chars back : Time = 0.00669598579407 i=7 got 54 chars back : Time = 0.00663185119629 i=8 got 54 chars back : Time = 0.00668096542358 i=9 got 54 chars back : Time = 0.0066249370575 i=10 got 0 chars back : Time = 1.01135993004 ^^^^ i=11 got 54 chars back : Time = 0.00664806365967 i=12 got 54 chars back : Time = 0.00664210319519 .... =======================================
I.e. the NULL character, and the LF / linefeed character.
No idea if that is something that could or should be fixed or not.
Strange, thank you for the report. I will have a look at this problem. Thank you very much for your comments, it will help me to improve wikipedia-suggest.
Best Regards. Julien Lemoine