Thanks for the response. Mine inline below.
On Wed, Apr 2, 2014 at 8:11 AM, Nikolas Everett <neverett(a)wikimedia.org>wrote;wrote:
On Tue, Apr 1, 2014 at 7:19 PM, Adam Baso <abaso(a)wikimedia.org> wrote:
+mobile-l
mobile-l recipients, if replying, if you would please reply-all in case
any people on the CC: line aren't on mobile-l, it would be appreciated.
Nik,
Thanks for the update. Glad to hear there's even faster performance
coming and also that there's no need to structure too much fallback stuff
depending on whether the reflex time is okay. With any luck, it would be
just fast enough. I don't think there'd be too much hammering on the
suggest term; only if the resultset is insufficient does it seem like it
would make sense to orchestrate the client side (or server side, for that
matter) call. The apps do have a key tap timer thing on them to help avoid
spurious searching, so that should help. I think I understand the ellipse
related stuff - parsing the snippet text is no problem, but if there's an
even simpler way to get text condensed to the point where there's no work
to avoid wrapping on most form factors, cool!...and if I misunderstood,
well, we'll get to the bottom of that on Friday.
I suppose this is close to my heart because I just worked on it, but if
you chop the snippet on the client side it defeats the logic used to pick
the "best snippet". That logic isn't that great in Cirrus now but
it'll
get a whole lot better when we deploy the new highlighter. Right now the
snippet is always 150 characters or something +/- 20ish characters on each
side to find a word break. We pick the best snippet based on hits in the
150 character window. At minimum we should let you configure it to
something that'll fit better. I suppose the best option would be to
configure up font widths that matter and then use them to chop really
accurately. For the most part that sounds pretty simple and quick to
implement so long as we're ok with estimates that ignore stuff like
ligatures.
Yeah, I think the snippet data is sweet. Two options - short (50 characters
wide) and normal (150 characters) seems sufficient. That way there are
fewer cached objects using the 3 or 5 char or n-gram thing discussed later.
Given non-monospaced rendered ligatures (that is, as far as I can tell) I
think your hint to ignore ligature byte width is a pretty pragmatic
approach.
Do the type timers fire the hit on the leading character or on a slight
hesitation? Prefix search on the site is leading character and then it
<cancels> the request if the user types more. That is silly because I
can't cancel the request on the backend.... If it triggers on a hesitation
then we should just plow ahead, I think. If it triggers on leading
characters then we should totally cache requests shorter then N character.
3 or 5 or something.
Oops, I may have spoke too soon.
On Android, after keypress it fires if 300ms have elapsed and another
keypress hasn't occurred.
On iOS, it looks like it is actually firing immediately on each keypress
(will double check that, though). The iOS client-side code does try to
cancel fired search events such that the latest search string is used; that
may or may not mean requests go unfired at the server. But, as you say,
once a proper HTTP request has been received at the origin, server-side
processing is likely to occur. From this I think we should probably look
into making the iOS stuff have the similar 300ms behavior if I've not
misread the code.
Caching on an upper bound of 3 or 5 characters (or some upperbounded n-gram
value?) could be a very solid way to make search screaming fast for the
common cases for these list=search searches, particularly with
CirrusSearch. I really like that idea.
On a related matter, I'll try to remember to bring up title versus
non-title statistical bias in the search ranking. Speaking of such
algorithms...
From the code:
// If we were constraining the namespace set, we would probably use
// 0|1|2|3|4|5|6|7|12|13|14|15|100|101|108|109|446|447
// to keep it related more to article, article talk, policy,
// policy talk, help, and help talk types of resources.
// The odd numbered Talk pages could even be witheld, but that's
// sort of pointless when the number of backlinks to them is
// likely to be small, meaning they won't turn up too much
// unless they're a result(set) of last resort, or the user
// went to the trouble of prefix namespace searching such as
Talk:Cats.
// But realistically, it's probably easier to just stick to
// not defining a namespace constrainint set, and thereby (likely)
// getting more pre-cached responses, due to other consumers
leading
// or follwing suit. There's a school of thought, or there could
be,
// that says only namespace 0 should be searched here, as it's
// the core article content. But users may practically want
// categories, too. And such logic spirals out from there.
// If we were instead using the opensearch API and were seeking
// parity with the desktop and mobile web experience, we should
// indeed as of 27-March-2014 only be searching namespace 0.
// But as CirrusSearch will be the norm and server load is expected
// to handle things just fine (no fallback is necessary per Search
team),
// higher quality search results now can be obtained anyway.
Cirrus searches all wgContentNamespaces by default and it is optimized to
do so. All non-content namespaces are in another index so we don't have to
pay attention to it during the request. We also don't have to filter by
namespace at all.
Each namespace has a weight factor that influences its position. That
factor often ends up being more important then links. Links are "score *
log(incoming_links + 2)" and the weights vary from "score * 1" (MAIN) to
"score * 0.0025" (TEMPLATE_TALK). Our power users expect these because
lsearchd did it. Mobile users, who knows.
Cool. Understood. If the main namespace is biased higher, I think mobile
users will usually be pretty happy. And for power users on mobile I think
they'll be pleasantly surprised that they can Namespace: search.
// With all of this considered, we want a request of the following
format
// //
en.m.wikipedia.org/w/api.php?action=query&list=search&srsearch=cats…
// Note that MobileFrontend's use of opensearch has its result
// set limited at 15. Note also that the 'srprop' only keeps
'snippet',
// and 'sectiontitle' plus the 'title' field which is always
implicit.
// This buys us some additional features once we're ready for them,
// all the while populating the cache.
// We probably also will want to add 'srinterwiki=1' in some future
// state so that users don't have to change their language to
// search setting. As it is, 'srinterwiki' is not yet in place
// and the format of such results may look a bit different,
// so it's probably best to hold off on 'srinterwiki=1'. We are
// not yet using the snippets and section titles, but let's get the
// cache populated for our sake and everyone else's sake.
Interwiki is coming but I'd give it a few months, I think.
Cool. Something to look at later, then, in terms of whether it's on by
default, how results are biased based on current-wiki primary language,
availability of articles on other wikis, charset, etc.
// NOTE:
// Although as of 27-March-2014 it seems that suggestions may not
be coming
// back for CirrusSearch as frequently as for Lucene, that's
probably
// just an artifact of relatively lower training of suggestions.
// In other words, it's likely that the suggestion pairing will
grow.
// Currently, we're not examining
[@"query"][@"searchinfo"][@"suggestion],
// but we could. There are two cases for the suggestion.
// 1. When the result set is of length 0, just fire off a search
with the suggestion.
// This is the case where the user probably mis-spelled
something.
// 2. When the result set is of short length (less than 5?), fire
another search with
// the suggestion, and then collate those search results
/after/ the first result set.
The suggestion is actually better then you give it credit for: even if it
lots of results show up if we provide a suggestion it might useful. It
comes from redirect and title names and it'll suggest combinations that
work. So if the user searches for "picket's charge" it'll suggest
"pickett's charge" even though there are plenty of results for the first
term. The results for the second term are better.
Okay, let's discuss on Friday!
The reason you get different results is because the implementations are
vastly different. The Cirrus implementation has less tuning but is "more
modern". Whatever that is worth.
Magic!
Nik
Thanks.