Re: [Wikitech-l] Wikipedia Suggest

4 Aug 2006


      Hello Nick,
* Nick Jenkins nickpj@gmail.com [2006-08-04 16:54:20 +1000]:
...
A few more suggestions / comments / ideas after compiling it:
== Small addition for README file ==
Perhaps add this to the end of the README file (about lib
requirements) - (these worked for me) :
# If you're on Debian/Ubuntu, you can probably install expat & icu with:
apt-get install libicu34 libexpat1 libexpat1-dev libicu34-dev
# The compile with:
./configure
make
Thanks you for your contribution, I will add it in README file (you 
will also need to install cppunit-dev package for make check).
...
== Pull data from MySQL? ==
Currently the data used by cmd/Analyzer appears to be generated from
the XML data file (e.g. the 1.1 gigabyte XmlEnWikipedia.tar.bz2 file).
Could it be possible instead to generate the required data from a
MySQL database? That way the whole XML step could be avoided by just
connecting directly to the database, and generating the required data
directly from the database.
Also this way, you _may_ be able to get away with reading far less
data, and so it could potentially be quicker. Also it would be more
up-to-date, because there would be intermediate step to add to the
latency.
For example, this way you could get the number of links to each page
by doing something like this:
select pl_title, count(*) from pagelinks group by pl_title;
... and you could get the valid page names by doing something like this:
select page_title from page where page_namespace = 0;
It is definitively possible and it will probably be the best way to 
integrate it in Wikipedia. I preferred working with XML files to produce 
a proof of concept because I do not know details of wikipedia database 
architecture (and I have others proof of concept in mind :))
...
== Maybe include JS / HTML files ==
Maybe bundle the JS / HTML / PHP / Perl files that you're using at
http://suggest.speedblue.org/ with the wikipedia-suggest.tar.gz
archive? Basically I'd like to try and reproduce what you've got
happening on your website so that I can play with it too ;-)
No problem, I will put them in the tarball.
...
== Documentation for Gumbies ==
Potentially stupid question, but how do I use the resulting executable
files in the 'cmd' directory? I couldn't see documentation on this in
the wikipedia-suggest.tar.gz archive (but it is version 0.1, so it's
not unexpected). Or to put it differently, what series of commands did
you use to get your http://suggest.speedblue.org/ site working?
The tarball contains only the analyzer/query binaries, the analyzer need 
to be called with two arguments. The first one is a file containing all 
xml filename to analyze (one filename per line) and the second is the 
path to the redirect.xml file (more details about how to obtains this 
XML and redirect.xm files are given below).
The query and tcpquery binaries load output of analyzer to perform 
queries.
...
E.g.
wget http://www2.speedblue.org/download/XmlEnWikipedia.tar.bz2
# How did you create XmlEnWikipedia.tar.bz2 by the way? E.g. Did it
come from doing something to
http://download.wikimedia.org/enwiki/20060717/enwiki-20060717-pages-articles...
?
Hum, I did not know the availability of these xml archives. In fact, I 
wrote scripts to grabb the whole content of wikipedia. You will have 
details about grabbing in the history section (june 2006) : 
http://suggest.speedblue.org/history.php
...
cmd/Analyzer XmlEnWikipedia.tar.bz2 fsa.bin page.bin
# (i.e. Not sure what creates fsa.bin & page.bin , or how to persuade
it to do so )
fsa.bin and page.bin are created by analyzer.
...
Then presumably either cmd/Query or cmd/TcpQuery is invoked on fsa.bin
and page.bin, and connected to somehow to query for results.
# What's the difference between these two versions? E.g. is one the
DiskQuery implementation, and the other the MemoryQuery
implementation? Or is it just how you connect to them (e.g. one via
TCP/IP, the other via some other method?)
* cmd/query is a command line version using DiskQuery
 * cmd/TcpQuery is a TCP/IP version using DiskQuery
* DiskQuery does not load automaton and articles in memory, it keep 
only a file descriptor and perform seek (I implemented it because I have few 
RAM available on my server).
 * MemoryQuery load automaton and articles in memory, it is 
multi-threaded safe and you can use it to perform a lot of query per seconds.
Sorry for the lack of documentation
Best Regards.
Julien Lemoine

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Wikipedia Suggest