Hello Nick,
* Nick Jenkins <nickpj(a)gmail.com> [2006-08-04 16:54:20 +1000]:
A few more suggestions / comments / ideas after
compiling it:
== Small addition for README file ==
Perhaps add this to the end of the README file (about lib
requirements) - (these worked for me) :
# If you're on Debian/Ubuntu, you can probably install expat & icu with:
apt-get install libicu34 libexpat1 libexpat1-dev libicu34-dev
# The compile with:
./configure
make
Thanks you for your contribution, I will add it in README file (you
will also need to install cppunit-dev package for make check).
== Pull data from MySQL? ==
Currently the data used by cmd/Analyzer appears to be generated from
the XML data file (e.g. the 1.1 gigabyte XmlEnWikipedia.tar.bz2 file).
Could it be possible instead to generate the required data from a
MySQL database? That way the whole XML step could be avoided by just
connecting directly to the database, and generating the required data
directly from the database.
Also this way, you _may_ be able to get away with reading far less
data, and so it could potentially be quicker. Also it would be more
up-to-date, because there would be intermediate step to add to the
latency.
For example, this way you could get the number of links to each page
by doing something like this:
select pl_title, count(*) from pagelinks group by pl_title;
... and you could get the valid page names by doing something like this:
select page_title from page where page_namespace = 0;
It is definitively possible and it will probably be the best way to
integrate it in Wikipedia. I preferred working with XML files to produce
a proof of concept because I do not know details of wikipedia database
architecture (and I have others proof of concept in mind :))
== Maybe include JS / HTML files ==
Maybe bundle the JS / HTML / PHP / Perl files that you're using at
http://suggest.speedblue.org/ with the wikipedia-suggest.tar.gz
archive? Basically I'd like to try and reproduce what you've got
happening on your website so that I can play with it too ;-)
No problem, I will put them in the tarball.
== Documentation for Gumbies ==
Potentially stupid question, but how do I use the resulting executable
files in the 'cmd' directory? I couldn't see documentation on this in
the wikipedia-suggest.tar.gz archive (but it is version 0.1, so it's
not unexpected). Or to put it differently, what series of commands did
you use to get your
http://suggest.speedblue.org/ site working?
The tarball contains only the analyzer/query binaries, the analyzer need
to be called with two arguments. The first one is a file containing all
xml filename to analyze (one filename per line) and the second is the
path to the redirect.xml file (more details about how to obtains this
XML and redirect.xm files are given below).
The query and tcpquery binaries load output of analyzer to perform
queries.
Hum, I did not know the availability of these xml archives. In fact, I
wrote scripts to grabb the whole content of wikipedia. You will have
details about grabbing in the history section (june 2006) :
http://suggest.speedblue.org/history.php
cmd/Analyzer XmlEnWikipedia.tar.bz2 fsa.bin page.bin
# (i.e. Not sure what creates fsa.bin & page.bin , or how to persuade
it to do so )
fsa.bin and page.bin are created by analyzer.
Then presumably either cmd/Query or cmd/TcpQuery is
invoked on fsa.bin
and page.bin, and connected to somehow to query for results.
# What's the difference between these two versions? E.g. is one the
DiskQuery implementation, and the other the MemoryQuery
implementation? Or is it just how you connect to them (e.g. one via
TCP/IP, the other via some other method?)
* cmd/query is a command line version using DiskQuery
* cmd/TcpQuery is a TCP/IP version using DiskQuery
* DiskQuery does not load automaton and articles in memory, it keep
only a file descriptor and perform seek (I implemented it because I have few
RAM available on my server).
* MemoryQuery load automaton and articles in memory, it is
multi-threaded safe and you can use it to perform a lot of query per seconds.
Sorry for the lack of documentation
Best Regards.
Julien Lemoine