Thank you for all your suggestions, very inspiring!
[response inline]

Le 10/02/2016 21:33, Justin Ormont a écrit :
Good hits on page two:

There's a few cases where good results could exist only on page two. 

One case is when incorrectly searching for a homophone or other misspelling. Eg: "their red hot" instead of "they're red hot" (expected result -- wikipedia (pos 22), google (pos 1), bing (pos 2), ddg (pos 2)).

Indeed, we do a pretty bad job for this kind of queries. But I still don't know how to address that correctly. We don't use any synonym resources yet. This is usually addressed by the list of curated redirects, in this example we're able to catch only "theyre red hot" but we fail for their/there/....



Another case is when you get an exact string match on incorrect pages, but only non-exact string match on the correct page. Eg: "Cities in the San Francisco Bay Area" (expected result -- wikipedia (pos 122), google (pos 1), bing (pos 1), ddg (pos 1)).

This style occurs mostly for a navigation query (only one correct result). For explorative queries, odds are one of the relevant results will be on page 1.

There's a couple less direct cases, for instance if/once you integrate a popularity score, freshness score, importance score, page query score, or personalization (eg. ranking by physical distance from user or user's interests), you'll find some examples where incorrect results are non-helpfully boosted.

You're completely right and this is exactly the case here. We always rescore the top 8000 documents (per node) with the number of incoming links (which is far from ideal). By disabling all the top-N rescoring features the expected result is now #2:

https://en.wikipedia.org/w/index.php?search=Cities+in+the+San+Francisco+Bay+Area&title=Special%3ASearch&go=Go&cirrusBoostLinks=no&cirrusPhraseWinwdow=1&cirrusPhraseWindow=1

We don't do anything smart here, it's always the same plan whatever the query is...


Investigating queries which lead to clicks on page two may find interesting things popping out.

--

Knowing the SAT/DSAT-click-rate-vs.-position will tell you if good clicks often occur beyond position 10. Then running an experiment of 10 SERP results vs. 20 SERP results may give interesting insights when watching a session-success-rate metric (and maybe a time-to-success metric). Aka, checking if a click on position 11+ is almost ever useful, or just leads to a requery or abandonment. If you run result size experiments, you can normalize for the query latency effects by generating 20 and displaying 10.

The need of scrolling can cause a faster fall off of the click rates listed. On my web browser, as it's currently sized, there are only three results above the fold (my open advanced facet block takes a lot of space, scrolling required for result 4+). Knowing how-much/if the click rate drops for results below the fold will also help optimize the number of results to display, snippet length, and UI design. Could instrument number of results above the fold.  

--

Side note: possible bug, I can't find the page "List of New York University alumni" when querying "New York University alumni" (screenshot).

Yes... I usually find the params to tweak the query and push good results near the top but not here...
I'll have to dig into more details to see what's going on, the best I can do is a rank around pos 120 :(

Thank you very much for your help!

--
David