Late last week while looking over our existing scoring methods i was thinking that while counting incoming links is nice, a couple guys dominated search with (among other things) a better way to judge the quality of incoming links, aka PageRank.
PageRank takes a very simple input, it just needs a list of all links between pages. We happen to already store all of these in elasticsearch. I wrote a few scripts to suck out the full enwiki graph (~400M edges), ship it over to stat1002, throw it into hadoop, and crunch it with a few hundred cores. The end result is a score for every NS_MAIN page in enwiki based on the quality of incoming links.
I've taken these calculated pagerank's and used them as the scoring method for search-as-you-type for http://en-suggesty.wmflabs.org.
Overall this seems promising as another scoring metric to integrate to our search results. Not sure yet how to figure out things like how much weight does pagerank have in the score? This might be yet another thing where building out our relevance lab would enable us to make more informed decisions.
Overall i think some sort of pipeline from hadoop into our scoring system could be quite useful. The initial idea seems to be to crunch data in hadoop, stuff it into a read-only api, and then query it back out at indexing time in elasticsearch to be held within the ES docs. I'm not sure what the best way will be, but having a simple and repeatable way to calculate scoring info in hadoop and ship that into ES will probably become more and more important.
Thanks for the summary, Erik! It sounds very promising to me, and logical that we should use page views to affect the weight of the results. But, of course, we should be careful that we don't weight the page views so high that we end up giving the system criticality and creating a positive feedback loop, where random fluctuations in page views push up irrelevant results in the scoring, which gets them more page views, which pushes it up further, and so on.
Dan
On 21 September 2015 at 08:07, Erik Bernhardson ebernhardson@wikimedia.org wrote:
Late last week while looking over our existing scoring methods i was thinking that while counting incoming links is nice, a couple guys dominated search with (among other things) a better way to judge the quality of incoming links, aka PageRank.
PageRank takes a very simple input, it just needs a list of all links between pages. We happen to already store all of these in elasticsearch. I wrote a few scripts to suck out the full enwiki graph (~400M edges), ship it over to stat1002, throw it into hadoop, and crunch it with a few hundred cores. The end result is a score for every NS_MAIN page in enwiki based on the quality of incoming links.
I've taken these calculated pagerank's and used them as the scoring method for search-as-you-type for http://en-suggesty.wmflabs.org.
Overall this seems promising as another scoring metric to integrate to our search results. Not sure yet how to figure out things like how much weight does pagerank have in the score? This might be yet another thing where building out our relevance lab would enable us to make more informed decisions.
Overall i think some sort of pipeline from hadoop into our scoring system could be quite useful. The initial idea seems to be to crunch data in hadoop, stuff it into a read-only api, and then query it back out at indexing time in elasticsearch to be held within the ES docs. I'm not sure what the best way will be, but having a simple and repeatable way to calculate scoring info in hadoop and ship that into ES will probably become more and more important.
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Sorry, got myself confused here. Page rank and page views are different concepts; I was thinking about a related task https://phabricator.wikimedia.org/T113439 to use page views to improve scoring when writing this.
Thanks, Dan
On 24 September 2015 at 20:22, Dan Garry dgarry@wikimedia.org wrote:
Thanks for the summary, Erik! It sounds very promising to me, and logical that we should use page views to affect the weight of the results. But, of course, we should be careful that we don't weight the page views so high that we end up giving the system criticality and creating a positive feedback loop, where random fluctuations in page views push up irrelevant results in the scoring, which gets them more page views, which pushes it up further, and so on.
Dan
On 21 September 2015 at 08:07, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
Late last week while looking over our existing scoring methods i was thinking that while counting incoming links is nice, a couple guys dominated search with (among other things) a better way to judge the quality of incoming links, aka PageRank.
PageRank takes a very simple input, it just needs a list of all links between pages. We happen to already store all of these in elasticsearch. I wrote a few scripts to suck out the full enwiki graph (~400M edges), ship it over to stat1002, throw it into hadoop, and crunch it with a few hundred cores. The end result is a score for every NS_MAIN page in enwiki based on the quality of incoming links.
I've taken these calculated pagerank's and used them as the scoring method for search-as-you-type for http://en-suggesty.wmflabs.org.
Overall this seems promising as another scoring metric to integrate to our search results. Not sure yet how to figure out things like how much weight does pagerank have in the score? This might be yet another thing where building out our relevance lab would enable us to make more informed decisions.
Overall i think some sort of pipeline from hadoop into our scoring system could be quite useful. The initial idea seems to be to crunch data in hadoop, stuff it into a read-only api, and then query it back out at indexing time in elasticsearch to be held within the ES docs. I'm not sure what the best way will be, but having a simple and repeatable way to calculate scoring info in hadoop and ship that into ES will probably become more and more important.
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
We will certainly have to be careful with how the page views work into things, i'm not sure what kind of weighting we will put on page rank or on the page views, it will be interesting to figure out!
On Thu, Sep 24, 2015 at 8:22 PM, Dan Garry dgarry@wikimedia.org wrote:
Thanks for the summary, Erik! It sounds very promising to me, and logical that we should use page views to affect the weight of the results. But, of course, we should be careful that we don't weight the page views so high that we end up giving the system criticality and creating a positive feedback loop, where random fluctuations in page views push up irrelevant results in the scoring, which gets them more page views, which pushes it up further, and so on.
Dan
On 21 September 2015 at 08:07, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
Late last week while looking over our existing scoring methods i was thinking that while counting incoming links is nice, a couple guys dominated search with (among other things) a better way to judge the quality of incoming links, aka PageRank.
PageRank takes a very simple input, it just needs a list of all links between pages. We happen to already store all of these in elasticsearch. I wrote a few scripts to suck out the full enwiki graph (~400M edges), ship it over to stat1002, throw it into hadoop, and crunch it with a few hundred cores. The end result is a score for every NS_MAIN page in enwiki based on the quality of incoming links.
I've taken these calculated pagerank's and used them as the scoring method for search-as-you-type for http://en-suggesty.wmflabs.org.
Overall this seems promising as another scoring metric to integrate to our search results. Not sure yet how to figure out things like how much weight does pagerank have in the score? This might be yet another thing where building out our relevance lab would enable us to make more informed decisions.
Overall i think some sort of pipeline from hadoop into our scoring system could be quite useful. The initial idea seems to be to crunch data in hadoop, stuff it into a read-only api, and then query it back out at indexing time in elasticsearch to be held within the ES docs. I'm not sure what the best way will be, but having a simple and repeatable way to calculate scoring info in hadoop and ship that into ES will probably become more and more important.
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery