Hello,
I transformed the TcpQuery to send output in json (without the url) and I added Nick Jenkins great autocomplete feature. Results for english and french are available at http://suggest.speedblue.org/ (includes the heuristic to choose the correct redirection and the handle of articles with different capitalization).
The format of fsa.bin and articles.bin is now a little bit different, so you need to redownload them from : English : http://www2.speedblue.org/download/WikipediaSuggestCompiledEN.tar.bz2 French : http://www2.speedblue.org/download/WikipediaSuggestCompiledFR.tar.bz2
The latest version of the sources with all these modifications is available at : http://suggest.speedblue.org/tgz/wikipedia-suggest-0.31.tar.gz
I hope you will enjoy all these modifications and I am open to any kind of suggestion/modification.
I have done some benchmark of TcpQuery with MemoryQuery backend and 5 threads on my computer (Pentium D930). I used 10 threads to simulate queries. I handled 154000 random queries in 24.7 seconds with CPU usage of 100% (about 6234 queries per second).
I plan to wrote an analyzer more dedicated to Wikipedia, but I do not know now to get titles/redirections/links for the moment. Do you know how to get the target of redirections in the sql database ? Do you think taking pages-articles.xml.bz2 and update index every month is acceptable ?
Best Regards. Julien Lemoine
Nick Jenkins wrote:
But the url need to be added since it is different of the title
Yep, but you can work out the url from the title:
function titleToUrl(title) { var chr, url = ""; for (var i=0; i<title.length; i++) { chr = title.charCodeAt(i); url += (chr == 32 ? "_" : escape(String.fromCharCode(chr))); } return url; }
// quick test: var test_data = ["Roman Catholic Church", "cat (disambig)", ""!@$^&*))(_--{}"]; for (var i=0; i<test_data.length; i++) { document.write(test_data[i] + " equals: " + titleToUrl(test_data[i]) + " <br>\n"); }
Output is:
Roman Catholic Church equals: Roman_Catholic_Church <br> cat (disambig) equals: cat_%28disambig%29 <br> "!@$^&*))(_--{} equals: %22%21@%24%5E%26*%29%29%28_--%7B%7D <br>
(which seems identical to what the Wikipedia gives too).
Don't have to do it this way though, and if you'd prefer to do it on the server side, then do that.
I just thought that transmitting less data and potentially storing less data might help.
Did you used json in EMCAsript/javascript ?
Nah, I just make this stuff up as I go along. ;-)
Should work fine though:
var json_data = eval("["cat",["Catholics", 7505, "Roman Catholic Church"],["Catholic Archibishop", 4484, "Bishop"],["Catholic", 4200, ],["Catholic", 3269, ]["CATV", 2347, "Cable television"],["Catalogue astrographique", 2095, "Star catalogue"],["Catholic Encyclopedia", 1956, ],["Catalonia", 1740, ],["Cattle", 1604, ],["Catholicism", 1527, ]]"); alert("length: " + json_data.length + " data: " + json_data);
I.e. you may have to get rid of the newlines in the data stream.
All the best, Nick.
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l