Hello Nick,
* Nick Jenkins nickpj@gmail.com [2006-08-04 16:54:20 +1000]:
A few more suggestions / comments / ideas after compiling it:
== Small addition for README file ==
Perhaps add this to the end of the README file (about lib requirements) - (these worked for me) :
# If you're on Debian/Ubuntu, you can probably install expat & icu with: apt-get install libicu34 libexpat1 libexpat1-dev libicu34-dev
# The compile with: ./configure make
Thanks you for your contribution, I will add it in README file (you will also need to install cppunit-dev package for make check).
== Pull data from MySQL? ==
Currently the data used by cmd/Analyzer appears to be generated from the XML data file (e.g. the 1.1 gigabyte XmlEnWikipedia.tar.bz2 file).
Could it be possible instead to generate the required data from a MySQL database? That way the whole XML step could be avoided by just connecting directly to the database, and generating the required data directly from the database.
Also this way, you _may_ be able to get away with reading far less data, and so it could potentially be quicker. Also it would be more up-to-date, because there would be intermediate step to add to the latency.
For example, this way you could get the number of links to each page by doing something like this: select pl_title, count(*) from pagelinks group by pl_title;
... and you could get the valid page names by doing something like this: select page_title from page where page_namespace = 0;
It is definitively possible and it will probably be the best way to integrate it in Wikipedia. I preferred working with XML files to produce a proof of concept because I do not know details of wikipedia database architecture (and I have others proof of concept in mind :))
== Maybe include JS / HTML files ==
Maybe bundle the JS / HTML / PHP / Perl files that you're using at http://suggest.speedblue.org/ with the wikipedia-suggest.tar.gz archive? Basically I'd like to try and reproduce what you've got happening on your website so that I can play with it too ;-)
No problem, I will put them in the tarball.
== Documentation for Gumbies ==
Potentially stupid question, but how do I use the resulting executable files in the 'cmd' directory? I couldn't see documentation on this in the wikipedia-suggest.tar.gz archive (but it is version 0.1, so it's not unexpected). Or to put it differently, what series of commands did you use to get your http://suggest.speedblue.org/ site working?
The tarball contains only the analyzer/query binaries, the analyzer need to be called with two arguments. The first one is a file containing all xml filename to analyze (one filename per line) and the second is the path to the redirect.xml file (more details about how to obtains this XML and redirect.xm files are given below).
The query and tcpquery binaries load output of analyzer to perform queries.
E.g. wget http://www2.speedblue.org/download/XmlEnWikipedia.tar.bz2 # How did you create XmlEnWikipedia.tar.bz2 by the way? E.g. Did it come from doing something to http://download.wikimedia.org/enwiki/20060717/enwiki-20060717-pages-articles... ?
Hum, I did not know the availability of these xml archives. In fact, I wrote scripts to grabb the whole content of wikipedia. You will have details about grabbing in the history section (june 2006) : http://suggest.speedblue.org/history.php
cmd/Analyzer XmlEnWikipedia.tar.bz2 fsa.bin page.bin # (i.e. Not sure what creates fsa.bin & page.bin , or how to persuade it to do so )
fsa.bin and page.bin are created by analyzer.
Then presumably either cmd/Query or cmd/TcpQuery is invoked on fsa.bin and page.bin, and connected to somehow to query for results. # What's the difference between these two versions? E.g. is one the DiskQuery implementation, and the other the MemoryQuery implementation? Or is it just how you connect to them (e.g. one via TCP/IP, the other via some other method?)
* cmd/query is a command line version using DiskQuery * cmd/TcpQuery is a TCP/IP version using DiskQuery
* DiskQuery does not load automaton and articles in memory, it keep only a file descriptor and perform seek (I implemented it because I have few RAM available on my server). * MemoryQuery load automaton and articles in memory, it is multi-threaded safe and you can use it to perform a lot of query per seconds.
Sorry for the lack of documentation Best Regards. Julien Lemoine