On Sun, Jan 10, 2010 at 9:52 PM, William Pietri william@scissor.com wrote:
On 01/10/2010 06:12 PM, Gregory Maxwell wrote:
If anyone feels adventurous:
Ooh, that looks fun. If I wanted to investigate, I'd start here, yes?
http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/
Is the click data available, too?
It's not— but progress on this subject would probably be a good justification for making some available.
Without the click data available, I'd suggest simply using the stats.grok.se page view data: It won't allow the system to learn how preferences change as a function of query text, but it would let you try out all the machinery.
I'd expect that static page popularity would be the obvious fill-in data you'd use where click through information is not available, in any case. So for example, If query X returns A,B,C,D,E and you only know the user clicked B then you can assume that B>[A,C,D,E], but by mixing in the static popularity you can could also decide that B>D>E>A>C (because d,e,a,c is the popularity of the remaining pages).
In order to use this kind of predictive modelling you need to create some feature extraction. Basically, you take your input and convert it into a feature-vector: a multidimensional value which represents the input as a finite set of floating point numbers which (hopefully) exposes relevant information and ignores irrelevant information.
I've never used rank-svm before, but for text classification with SVM it is pretty common to use the presence of words to construct a sparse vector. E.g. after stripping out markup every input word (or work pair, or word fragment or...) gets assigned a dimension. The vector for a text has the value 1. in that dimension if the text contains the word, 0 if it doesn't.
So, "the blue cat" might be [14:1.0 258:1.0 982:1.0], presuming that the was assigned dimension 14, blue 258, cat 982. The zillion other possible dimensions are zero. Typical linear SVM classifiers work reasonable well on highly sparse data like this, even if there are hundreds of thousands of dimensions.
Full text indexers like lucene also do basically the same kind thing internally, usually after some folding/stemming (i.e. [girls,gals,dames,female,lady,girl,womens] -> women) and elimination of common words (e.g. "the"), so the lucene tools may already be doing most or all of the work you'd need for basic feature extraction.
It looks like for this rank SVM I'd run the feature-extraction on both the query and the article and combine them into one vector for the SVM. For example, you could do something like assign a different value for the word dimension (i.e. 2 if a word is in both vectors, -1 if its in the query but not the article, 0.5 if its only in the article... etc), or give query-words different dimension values than article words (i.e. if you're tracking 100,000 words, add 100,000 to the query word dimension numbers). I have no clue which of the infinite possible ways would work best, there may be some suggestions in the literature but there is no replacement for simply trying a lot of approaches.
95% of the magic in making machine learning work well is coming up with good feature extraction. For Wikipedia data in addition to the word-existence metric which is often used for free-text the presence of categories (i.e. each cat is mapped to a dimension number), and link structure information (perhaps different values for words which are linked?, only using wikilinked words as the article keys) are obvious things which could be added.