Thank you Trey!
These are all excellent ideas and I just added my 2 cents inline :)
Le 22/07/2015 21:54, Trey Jones a écrit :
Hey Wikimedia-search!
I’m Trey Jones, and I’m a new to WMF (this is only my third week), and
I started this thread, though David really got it going.
There’s lots to digest here, and I’m sure I’ll retread certain ground
already covered, but below are my initial thoughts. Let me know if you
think any of these notes should end up in a wiki or Phab ticket
somewhere—I'm still trying to grok where to best document things. (And
think about everyone's comments, too, and whether they should be
copied elsewhere—it’s always a shame to lose track of good ideas.)
You're right, I think there's some phab tickets where you can put the
ideas you described here.
=Meta stuff=
Sorry this message is so long. I didn’t have time to write a short
one. (Alas, this is my greatest weakness, but at least I can admit it.)
I’ve tried to label ideas that could use some additional discussion
with (L)etters at the beginning of the first relevant paragraph.
=Results from other wikis=
I agree with the general consensus that n-grams aren’t great for
language detection on short strings. A quick skim of literature
related to Oliver’s cite (Kolkus and Rehurek 2009) points to Naive
Bayes as a good method on short strings.
I did notice that the slides attached to the old Cybozu lang-detect
project home page mention that short strings are a problem—but the
slides are from 2010. David also mentioned that in his comments on
T104505. Is Cybozu lang-detect still a contender? Has anyone had a
chance to run either the latest version or the ES plugin on anything?
I never used cybozu inside the elasticsearch plugin (but I can confirm
that it works poorly on small texts like tweets) and I don't know if
it's still a contender but if we refer to
(citation extracted from
http://ceur-ws.org/Vol-1228/tweetlid-1-gamallo.pdf)
"This is in accordance with Rehurek and Kolkus (2009), who tried to
prove that dictionary-based methods are more reliable than
character-based systems for language identification with noisy short
texts among similar languages."
I understand that the method used by Kolkus and Rehurek is
dictionnary-based (word unigrams)? It will outperform cybozu (char ngram
based) on small texts. I think it's true if the text is like tweets with
short phrases but may not work properly for names? This certainly
deserves some test on real data.
(A) I like the idea of running a cross-wiki test, though I can think
of a couple more ways to analyze the results than listed in T104505. I
assume there are plenty of repeats in the top-N “no-results” queries,
and probably a Zipf/power law distribution. (I’m very curious to see
what the distribution actually looks like. What’s the max frequency /
percentage over a day for a given zero-results query?)
So, it would make sense to me to track not only raw numbers, but also
weighted numbers if the distribution in the top-N is very unequal.[1]
And of course, the “zero result” decrease should be weighted. It might
also make sense to look at the distribution of “zero result decrease”
by number of additional wiki’s searched. For example, what if all 234
results from the French wiki for English queries (in David’s example
table in T104505) are subsumed by the 324 German wiki results. Is it
still worth searching in French?
Yes you're right I didn't thought about that and it's hard to tell... I
guess it will depend on the idea you described below related to
interwiki links.
This raises another question as we add more fall-back methods to
decrease the zero result rate. How will we prioritize the fall-back
methods? I mean if I can re-run a "Did you mean" query and if I know
that running the original query against another wiki has good chances to
give results which one should I try first?
[1] Caveat: it wouldn’t hurt to review the very top queries in any
sample by hand to look for trending topics that could skew the results
over a small time period. During the Women’s World Cup, I bet there
were more searches for names of various players, for example, than
there normally would be.
I think it's worth running this test regularly and see how results change.
On the other hand—I read French much better than I read German—so I’d
prefer French results even if all the French results are duplicates of
the German results. Are results in a language I can’t read really any
better than no results?
This leads to a few new (to me) ideas:
(B) Make multilingual results configurable—If we know, say, the top
four wikis likely to give good results for queries from the English
wiki are Sp, Fr, DE, and JP, we could have a expanding section (excuse
an UI ugliness—someone with UI smarts can help us figure out how to
make it pretty, right?) to enable multi-lingual searching, so on
English Wikipedia I could ask for “back up results” in Spanish and
French, but not German and Japanese. Store those settings in a cookie
for later, too, possibly with some UI indicator that multilingual
backup results are enabled. (Also, if the cookie is available at query
time, we could save unnecessary cross-wiki searches the user couldn’t
possibly use.)
There is maybe sensible defaults per language?
(C) And/or, multilingual results could be an extra click—“we didn’t
find English wiki results, but we found results that match your query
in Spanish and German, would you like to see them?” with links on
“Spanish” and “German”. I’d click the Spanish link, not the German link.
(D) Another sneakier idea that came to mind—which may not be
technically plausible—would be to find good results in another
language and then check for links back to wiki articles in the wiki
the search came from. I do this manually when I find something Google
translate can’t handle in a confidence-inspiring way: I search on
Russian or Arabic Wikipedia, then look on the nav bar for the
“English” link. There are lots of options here—showing just the
English results with a link back to the language it went through, or
showing summaries for both, etc.
A silly example: search for “Виллальверния” in en wiki gives no
results. But there is a ru wiki page with that exact title. It has a
link to the English wiki page for “Villalvernia”. (Don’t ask why
someone is searching for the Russian name of a tiny Italian commune on
the English Wikipedia. The answer is “because multilingulaism”.)
Search: Виллальверния
Results: Villalvernia (crosswiki link from *Виллальверния*)
I don't know if it's technically plausible but AFAIK we have the
wikibase id in the index so it's should be pretty simple to extract it.
Interwiki links are stored in wikidata, could we use WDQS for that
purpose? With the entity ID it should be easy to request the interwiki
link for a specific language. Is WDQS designed for this usage (high
number of query/sec on rather simple queries)?
(E) Another simpler idea than language detection would be basic
character set detection. A query in Cyrillic might get better results
from the Russian, Ukrainian, and Bulgarian wikis than the French and
German ones, even if French and German do better overall. Similarly
Arabic script and perhaps the Arabic, Persian, and Urdu wikis.
This might also be a reason why decent language detection is okay if
it is computationally much cheaper than excellent detection—we don’t
have to commit to “the one true answer”; maybe we could search the top
two or three other wikis.
Yes, I think cybozu can help here to do what you describe and will be
relatively "cheap".
=Misspellings=
(F) I had a good chat with Erik earlier this afternoon, and I just
mentioned his “saerch” example that’s in T104468. Having recently
looking at the ES suggester docs at David’s suggestion, I asked Erik
about the prefix length… he was able to quickly find that it’s set to
2.. so only words that start with the two letters “sa” could ever be
suggested. As Erik suggested in T104468, this would be a great
less-performant option to try if we get no results (or crappy
results)—we could loosen the params, for example going back to
prefix=1. For zero results, this may make sense—but the old suggestion
Erik noted, /saeqeh,/ and the current one, /samech,/ both seem kinda
unlikely—we could probably quantify that, esp. with some user feedback.
And we should definitely look at the various params and decide what
are reasonable settings for “cheap and good” and what’s “more
expensive but better”.
Reducing to 1 char the prefix length can hurt perfs and it's certainly a
good idea to do this in 2 passes as Erik suggested.
While working on prefixes I tried to analyze data on simple wiki dump
and extracted the distribution of term frequency by prefix length. I
failed to make any good usage of the data yet but I'm sure you will :)
I described a way to analyze the content we have in the index here:
https://wikitech.wikimedia.org/wiki/User:DCausse/Term_Stats_With_Cirrus_Dump
It's still on a very small dataset but if you find it useful maybe we
could try on a larger one?
David’s idea of a spelling dictionary makes sense, in that it limits
the scope of possibilities to compare against. But it probably won’t
handle names, or, probably, technical terms (e.g., “phonestheme”—or,
in hard mode, its plural).
It would be interesting to see the results of dropping the long tail
from what ES considers a match—min_doc_freq (
https://www.elastic.co/guide/en/elasticsearch/reference/1.6/search-suggeste…
) would help with that.
(How concerned are we with finding spelling errors in the wiki based
on a properly spelled search term? I used hunt for and correct
commonly misspelled words in en wiki as a hobby.)
My point here is (in the long term): maybe it's difficult to build good
suggestions from data directly so why not build a custom
dictionary/index to handle "Did you mean suggestions"?. According to
https://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s they learn from
search queries to build these suggestions. Is this something worth trying?
=Misc=
(G) Another interesting question: if we end up implementing several
option for improving search results, we will have to figure out how to
stage them and in what order to try/test for them.
And of course almost all of these will make more sense once we've
looked at some query data. That's my next task—to get access myself
and start trying to decide what seems most likely to have most impact.
Okay.. I’m running out of steam a little, so I’m going to wrap it up
for now. I’ll think more about David’s comments on the three Epics and
maybe some other replies later.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
[removed the old message because it was too big]