Hi Julien,
I totally rewrote the compiler to works with the XML format of download.wikipedia.org.
Sounds good, and I really like it having the top 10 languages.
I think _maybe_ though it could be useful to have two variations of the Analyzer (like you have two variations of TcpQuery - one that uses DiskQuery, and one that uses MemoryQuery). With Analyzer though, it could be good to have one that connects to MySQL and gets the data directly from the database, and one that uses the downloaded XML dumps. This way, people can use whichever one is most appropriate for them. For example, for someone running a big MediaWiki site who wanted to look at the possibility of using suggestion searching, they probably wouldn't want to create an XML dump, then run Analyzer on the XML dump (this would be too slow, and too many steps, and take a lot of disk space). Rather, if possible, in that situation it would be nice to create the compiled files directly from the database.
To try and help with this, I've modified a copy of Analyzer.cpp to add basic importing (but just of the article names, not redirects or article counts) from MySQL (i.e. does not use any downloaded files). The rough file (which still needs work for redirects + article counts) is here: http://files.nickj.org/MediaWiki/MysqlAnalyzerCmd.cpp Please note that I have not used C or C++ in a _very_ long time, so if looks like I have done something silly then that is almost certainly correct. :-)
To use compile and run this, on a Debian/Ubuntu system, I did this:
# Install required MySQL libraries apt-get install libmysqlclient15-dev cd cmd # Compile: g++ -DHAVE_CONFIG_H -I. -I. -I.. -I../expat/lib -g -O2 -O3 -MT MysqlAnalyzerCmd.o -MD -MP -MF ".deps/MysqlAnalyzerCmd.Tpo" -c -o MysqlAnalyzerCmd.o MysqlAnalyzerCmd.cpp # Link: (Note: needs " -lmysqlclient" parameter) g++ -g -O2 -O3 -o MysqlAnalyzer -L../tools -L../serialization -L../analyzer MysqlAnalyzerCmd.o -lanalyzer -lserialization -ltools -lexpat -lglib-2.0 -lmysqlclient # Run (change hostname / username / password / database-name params as required) : ./MysqlAnalyzer localhost wikiuser FakePasswd wikidb
If it is working, it should print out something like this: ----------------------------- Connection success Found 12345 articles ----------------------------- Then use the .bin files as per usual on TcpQuery.
Also there is a small diff for WSuggest.js to fix a small problem in my autocomplete stuff. For example, suppose the user typed "Aer", then moved the text cursor back to be between the 'A' and the 'e', typed 'm' (to make "Amer") then typed 'p' (to try and spell 'Amper'). However in-between typing the 'm' and the 'p', the cursor position will jump to the end of the text box to try and autocomplete "American", so the result of pressing 'p' will be 'Amerp', not 'Amper'. To prevent this, will now only try to autocomplete if the cursor position is at the end of the text field. Diff is here: http://files.nickj.org/MediaWiki/WSuggest.js-0.4-autocomplete-update.txt
All the best, Nick.