Late last week while looking over our existing scoring methods i was thinking that while counting incoming links is nice, a couple guys dominated search with (among other things) a better way to judge the quality of incoming links, aka PageRank.

PageRank takes a very simple input, it just needs a list of all links between pages. We happen to already store all of these in elasticsearch. I wrote a few scripts to suck out the full enwiki graph (~400M edges), ship it over to stat1002, throw it into hadoop, and crunch it with a few hundred cores. The end result is a score for every NS_MAIN page in enwiki based on the quality of incoming links.

I've taken these calculated pagerank's and used them as the scoring method for search-as-you-type for http://en-suggesty.wmflabs.org.

Overall this seems promising as another scoring metric to integrate to our search results. Not sure yet how to figure out things like how much weight does pagerank have in the score? This might be yet another thing where building out our relevance lab would enable us to make more informed decisions.

Overall i think some sort of pipeline from hadoop into our scoring system could be quite useful. The initial idea seems to be to crunch data in hadoop, stuff it into a read-only api, and then query it back out at indexing time in elasticsearch to be held within the ES docs. I'm not sure what the best way will be, but having a simple and repeatable way to calculate scoring info in hadoop and ship that into ES will probably become more and more important.