Hi Nick,
Nick Jenkins wrote:
Sounds good, and I really like it having the top 10
languages.
I think _maybe_ though it could be useful to have two variations of the Analyzer (like
you have two variations of TcpQuery - one
that uses DiskQuery, and one that uses MemoryQuery). With Analyzer though, it could be
good to have one that connects to MySQL and
gets the data directly from the database, and one that uses the downloaded XML dumps.
This way, people can use whichever one is most
appropriate for them. For example, for someone running a big MediaWiki site who wanted to
look at the possibility of using
suggestion searching, they probably wouldn't want to create an XML dump, then run
Analyzer on the XML dump (this would be too slow,
and too many steps, and take a lot of disk space). Rather, if possible, in that situation
it would be nice to create the compiled
files directly from the database.
To try and help with this, I've modified a copy of Analyzer.cpp to add basic
importing (but just of the article names, not redirects
or article counts) from MySQL (i.e. does not use any downloaded files). The rough file
(which still needs work for redirects +
article counts) is here:
http://files.nickj.org/MediaWiki/MysqlAnalyzerCmd.cpp
Please note that I have not used C or C++ in a _very_ long time, so if looks like I have
done something silly then that is almost
certainly correct. :-)
To use compile and run this, on a Debian/Ubuntu system, I did this:
# Install required MySQL libraries
apt-get install libmysqlclient15-dev
cd cmd
# Compile:
g++ -DHAVE_CONFIG_H -I. -I. -I.. -I../expat/lib -g -O2 -O3 -MT MysqlAnalyzerCmd.o -MD
-MP -MF ".deps/MysqlAnalyzerCmd.Tpo" -c -o
MysqlAnalyzerCmd.o MysqlAnalyzerCmd.cpp
# Link: (Note: needs " -lmysqlclient" parameter)
g++ -g -O2 -O3 -o MysqlAnalyzer -L../tools -L../serialization -L../analyzer
MysqlAnalyzerCmd.o -lanalyzer -lserialization -ltools -lexpat -lglib-2.0 -lmysqlclient
# Run (change hostname / username / password / database-name params as required) :
./MysqlAnalyzer localhost wikiuser FakePasswd wikidb
If it is working, it should print out something like this:
-----------------------------
Connection success
Found 12345 articles
-----------------------------
Then use the .bin files as per usual on TcpQuery.
Thank you very much for your contribution Nick. It is better indeed to
have the two versions, I preferred working on xml dumps at the beginning
since it was faster and easier for me to update wikipedia-suggest and
need less time/memory/cpu power (I have the analyzer and the sql server
on the same computer).
But the next step of wikipedia-suggest for me is to wrote a sql analyzer
(probably using OTL :
otl.sf.net and unixodbc), but unfortunately I will
not be able to wrote it before september.
Also there is a small diff for WSuggest.js to fix a
small problem in my autocomplete stuff. For example, suppose the user typed
"Aer", then moved the text cursor back to be between the 'A' and the
'e', typed 'm' (to make "Amer") then typed 'p' (to try
and
spell 'Amper'). However in-between typing the 'm' and the 'p',
the cursor position will jump to the end of the text box to try and
autocomplete "American", so the result of pressing 'p' will be
'Amerp', not 'Amper'. To prevent this, will now only try to
autocomplete if the cursor position is at the end of the text field. Diff is here:
http://files.nickj.org/MediaWiki/WSuggest.js-0.4-autocomplete-update.txt
Thank you, I release the wikipedia-suggest 0.41 with your contribution :
http://suggest.speedblue.org/tgz/wikipedia-suggest-0.41.tar.gz
Best Regards.
Julien Lemoine