On Sun, Jan 10, 2010 at 9:52 PM, William Pietri <william(a)scissor.com> wrote:
On 01/10/2010 06:12 PM, Gregory Maxwell wrote:
Ooh, that looks fun. If I wanted to investigate, I'd start here, yes?
http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/
Is the click data available, too?
It's not— but progress on this subject would probably be a good
justification for making some available.
Without the click data available, I'd suggest simply using the
stats.grok.se page view data: It won't allow the system to learn how
preferences change as a function of query text, but it would let you
try out all the machinery.
I'd expect that static page popularity would be the obvious fill-in
data you'd use where click through information is not available, in
any case. So for example, If query X returns A,B,C,D,E and you
only know the user clicked B then you can assume that B>[A,C,D,E], but
by mixing in the static popularity you can could also decide that
B>D>E>A>C (because d,e,a,c is the popularity of the remaining pages).
In order to use this kind of predictive modelling you need to create
some feature extraction. Basically, you take your input and convert it
into a feature-vector: a multidimensional value which represents the
input as a finite set of floating point numbers which (hopefully)
exposes relevant information and ignores irrelevant information.
I've never used rank-svm before, but for text classification with SVM
it is pretty common to use the presence of words to construct a sparse
vector. E.g. after stripping out markup every input word (or work
pair, or word fragment or...) gets assigned a dimension. The vector
for a text has the value 1. in that dimension if the text contains the
word, 0 if it doesn't.
So, "the blue cat" might be [14:1.0 258:1.0 982:1.0], presuming that
the was assigned dimension 14, blue 258, cat 982. The zillion other
possible dimensions are zero. Typical linear SVM classifiers work
reasonable well on highly sparse data like this, even if there are
hundreds of thousands of dimensions.
Full text indexers like lucene also do basically the same kind thing
internally, usually after some folding/stemming (i.e.
[girls,gals,dames,female,lady,girl,womens] -> women) and elimination
of common words (e.g. "the"), so the lucene tools may already be doing
most or all of the work you'd need for basic feature extraction.
It looks like for this rank SVM I'd run the feature-extraction on both
the query and the article and combine them into one vector for the
SVM. For example, you could do something like assign a different
value for the word dimension (i.e. 2 if a word is in both vectors, -1
if its in the query but not the article, 0.5 if its only in the
article... etc), or give query-words different dimension values than
article words (i.e. if you're tracking 100,000 words, add 100,000 to
the query word dimension numbers). I have no clue which of the
infinite possible ways would work best, there may be some suggestions
in the literature but there is no replacement for simply trying a lot
of approaches.
95% of the magic in making machine learning work well is coming up
with good feature extraction. For Wikipedia data in addition to the
word-existence metric which is often used for free-text the presence
of categories (i.e. each cat is mapped to a dimension number), and
link structure information (perhaps different values for words which
are linked?, only using wikilinked words as the article keys) are
obvious things which could be added.