This thread started between a few of us, but has some good ideas and thoughts. Forwarding into the search mailing list (where we will endeavour to have these conversations in the future).
Erik B ---------- Forwarded message ---------- From: Oliver Keyes okeyes@wikimedia.org Date: Wed, Jul 22, 2015 at 8:31 AM Subject: Re: Zero search results—how can I help? To: David Causse dcausse@wikimedia.org Cc: Trey Jones tjones@wikimedia.org, Erik Bernhardson < ebernhardson@wikimedia.org>
Whoops; I guess point 4 is the second list ;p.
On 22 July 2015 at 11:30, Oliver Keyes okeyes@wikimedia.org wrote:
On 22 July 2015 at 10:55, David Causse dcausse@wikimedia.org wrote:
Le 22/07/2015 15:21, Oliver Keyes a écrit :
Thanks; much appreciated. Point 3 directly relates to my work so it's good to be CCd :).
FWIW, this kind of detail on the specific things we're doing is missing from the main search mailing list and could be used very much there to inform people.
I agree, my intent right now is still to learn from each others and build/use a friendly environment where engineer with NLP background like Trey can work efficiently. When things will be clearer it'd be great to share our plan.
Oliver is already handling the executor IDs and distinguishing full and prefix search, so nyah ;p.
Great!
Just to be sure : does this means that a search count will be reduced to
its
executorID :
- all request with the same executorID return 0 zero result -> add 1 to
the
zero result counter
- if one of the request returns a result -> do not increment the zero
result
counter If yes I think this will be the killer patch for Q1 :)
Executor IDs are stored and if a match is found in executor IDs <=120 seconds after that one, the later outcome is considered "the outcome". If not, we assume no second round-trip was made and so go with whatever happened first.
So if you make a request and it round-trips once and fails, failure. Round-trip once and succeeds, success. Round-trip twice and fail both times, failure. Round-trip twice and fail the first time and succeed the second - one success, zero failures :). Erik wrote it, and I grok the logic.
On the language detection - actually Kolkus and Rehurek published a work in 2009 that handles small amounts of text really really well (n-gram based approaches /suck at this/) and there's a Java implementation I've been playing with. Want me to run it across some search strings and we can look at the results? Or just send the code across.
If you ask I'd say both! ;)
We evaluated this kind of dictionary-based language detection (but this
not
this one specifically), problem for us was mostly due to performance: it takes time to tokenize the input string correctly and the dictionary we
used
was rather big. But we worked mainly on large content (webnews, press articles). In our case input strings should be very small so it makes more sense. We should be able to train the dictionary against the "all title in ns0"
dumps
though.
This is also a great example to explain why I feel stuck sometimes: How will you be able to test it?
- I'm not allowed to download search logs locally.
- I think I won't be able to install java and play with this kind of
tools
on fluorine.
Ahh, but! You're NDAd, your laptop is a work laptop, and you have FDE, right? If yes to all three, I don't see a problem with me squirting you a sample of logs (and the Java). I figure if we find the methodology works we can look at speedups to the code, which is a lot easier a task than looking at fast code and trying to improve the methodology.
Another point: concerning the following tasks described below, I think it overlaps analytics tasks (because it's mainly related to learning from search
logs).
I don't know how you work today and maybe this is something you've
already
done or is obviously wrong. I think you're one of the best person today to help us to sort this out,
so
your feedback concerning the following lines will be greatly appreciated
:)
Thanks!
Yes! Okay, thoughts on the below:
- Build a search log parser - we sort of have that through the
streaming python script. It depends whether you mean a literal parser or something to pick out all the "important" bits. See point 4. 2. Big machine: I'd love this. But see point 4. 3. Improve search logs for us: when we say improve for us do we mean for analytics/improvements purposes? Because if so we've been talking about having the logs in HDFS which would make things pretty easy for all and sundry and avoid the need for a parser.
One way of neatly handling all of this would be:
- Get the logs in a format that has the fields we want and stream it
into Hadoop. No parser necessary. 2. Stick the big-ass machine in the analytics cluster, where it has default access to Hadoop and can grab data trivially, but doesn't have to break anyone else's stuff. 3. Fin.
What am I missing? Other than "setting up a MediaWiki kafka client is going to be kind of a bit of work".
Le 22/07/2015 14:38, David Causse a écrit :
It's still not very clear in my mind but things could look like :
- Epic: Build a toolbox to learn from search logs
- Create a script to run search queries against the production
index - Build search logs parser that provide all the needed details : time, search type, wiki origin, target search index, search query, search query ID, number of results, offset of the results (search page) (side note : Erik will it be possible to pass the queryID
from
page to page when user clicks "next page"?) - Have a descent machine (64g RAM would be great) in the
production
cluster where we can - download production search logs - install the tools we want - stress it not being afraid to kill it - do all the stuff we want to learn from data and search logs
- Epic: Improve search logs for us
- Add an "incognito parameter" to cirrus that could be used by
the
toolbox script not to pollute our search logs when running our "search script". - Add a log when the user click on a search result to have a mapping between the queryID, the result choosen and the offset of the chosen link in the result list. - This task is certainly complex and highly depends on the client, I don't know if we will be able to track this down on all clients but it'd be great for us. - More things will be added as we learn
- Epic: start to measure and control relevance
- Create a corpus of search queries for each wiki with their
expected results - Run these queries weekly/monthly and compute the F1-Score for each wiki - Continuously enhance the search queries corpus - Provide a weekly/monthly perf score for each wiki
As you can see this is mostly about tools, I propose to start with
batch
tools and think later of how we could make this more real-time.
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Oliver Keyes Research Analyst Wikimedia Foundation
Hey Wikimedia-search!
I’m Trey Jones, and I’m a new to WMF (this is only my third week), and I started this thread, though David really got it going.
There’s lots to digest here, and I’m sure I’ll retread certain ground already covered, but below are my initial thoughts. Let me know if you think any of these notes should end up in a wiki or Phab ticket somewhere—I'm still trying to grok where to best document things. (And think about everyone's comments, too, and whether they should be copied elsewhere—it’s always a shame to lose track of good ideas.)
=Meta stuff=
Sorry this message is so long. I didn’t have time to write a short one. (Alas, this is my greatest weakness, but at least I can admit it.)
I’ve tried to label ideas that could use some additional discussion with (L)etters at the beginning of the first relevant paragraph.
=Results from other wikis=
I agree with the general consensus that n-grams aren’t great for language detection on short strings. A quick skim of literature related to Oliver’s cite (Kolkus and Rehurek 2009) points to Naive Bayes as a good method on short strings.
I did notice that the slides attached to the old Cybozu lang-detect project home page mention that short strings are a problem—but the slides are from 2010. David also mentioned that in his comments on T104505. Is Cybozu lang-detect still a contender? Has anyone had a chance to run either the latest version or the ES plugin on anything?
(A) I like the idea of running a cross-wiki test, though I can think of a couple more ways to analyze the results than listed in T104505. I assume there are plenty of repeats in the top-N “no-results” queries, and probably a Zipf/power law distribution. (I’m very curious to see what the distribution actually looks like. What’s the max frequency / percentage over a day for a given zero-results query?)
So, it would make sense to me to track not only raw numbers, but also weighted numbers if the distribution in the top-N is very unequal.[1] And of course, the “zero result” decrease should be weighted. It might also make sense to look at the distribution of “zero result decrease” by number of additional wiki’s searched. For example, what if all 234 results from the French wiki for English queries (in David’s example table in T104505) are subsumed by the 324 German wiki results. Is it still worth searching in French?
[1] Caveat: it wouldn’t hurt to review the very top queries in any sample by hand to look for trending topics that could skew the results over a small time period. During the Women’s World Cup, I bet there were more searches for names of various players, for example, than there normally would be.
On the other hand—I read French much better than I read German—so I’d prefer French results even if all the French results are duplicates of the German results. Are results in a language I can’t read really any better than no results?
This leads to a few new (to me) ideas:
(B) Make multilingual results configurable—If we know, say, the top four wikis likely to give good results for queries from the English wiki are Sp, Fr, DE, and JP, we could have a expanding section (excuse an UI ugliness—someone with UI smarts can help us figure out how to make it pretty, right?) to enable multi-lingual searching, so on English Wikipedia I could ask for “back up results” in Spanish and French, but not German and Japanese. Store those settings in a cookie for later, too, possibly with some UI indicator that multilingual backup results are enabled. (Also, if the cookie is available at query time, we could save unnecessary cross-wiki searches the user couldn’t possibly use.)
(C) And/or, multilingual results could be an extra click—“we didn’t find English wiki results, but we found results that match your query in Spanish and German, would you like to see them?” with links on “Spanish” and “German”. I’d click the Spanish link, not the German link.
(D) Another sneakier idea that came to mind—which may not be technically plausible—would be to find good results in another language and then check for links back to wiki articles in the wiki the search came from. I do this manually when I find something Google translate can’t handle in a confidence-inspiring way: I search on Russian or Arabic Wikipedia, then look on the nav bar for the “English” link. There are lots of options here—showing just the English results with a link back to the language it went through, or showing summaries for both, etc.
A silly example: search for “Виллальверния” in en wiki gives no results. But there is a ru wiki page with that exact title. It has a link to the English wiki page for “Villalvernia”. (Don’t ask why someone is searching for the Russian name of a tiny Italian commune on the English Wikipedia. The answer is “because multilingulaism”.)
Search: Виллальверния Results: Villalvernia (crosswiki link from *Виллальверния*)
(E) Another simpler idea than language detection would be basic character set detection. A query in Cyrillic might get better results from the Russian, Ukrainian, and Bulgarian wikis than the French and German ones, even if French and German do better overall. Similarly Arabic script and perhaps the Arabic, Persian, and Urdu wikis.
This might also be a reason why decent language detection is okay if it is computationally much cheaper than excellent detection—we don’t have to commit to “the one true answer”; maybe we could search the top two or three other wikis.
=Misspellings=
(F) I had a good chat with Erik earlier this afternoon, and I just mentioned his “saerch” example that’s in T104468. Having recently looking at the ES suggester docs at David’s suggestion, I asked Erik about the prefix length… he was able to quickly find that it’s set to 2.. so only words that start with the two letters “sa” could ever be suggested. As Erik suggested in T104468, this would be a great less-performant option to try if we get no results (or crappy results)—we could loosen the params, for example going back to prefix=1. For zero results, this may make sense—but the old suggestion Erik noted, *saeqeh,* and the current one, *samech,* both seem kinda unlikely—we could probably quantify that, esp. with some user feedback.
And we should definitely look at the various params and decide what are reasonable settings for “cheap and good” and what’s “more expensive but better”.
David’s idea of a spelling dictionary makes sense, in that it limits the scope of possibilities to compare against. But it probably won’t handle names, or, probably, technical terms (e.g., “phonestheme”—or, in hard mode, its plural).
It would be interesting to see the results of dropping the long tail from what ES considers a match—min_doc_freq ( https://www.elastic.co/guide/en/elasticsearch/reference/1.6/search-suggester... ) would help with that.
(How concerned are we with finding spelling errors in the wiki based on a properly spelled search term? I used hunt for and correct commonly misspelled words in en wiki as a hobby.)
=Misc=
(G) Another interesting question: if we end up implementing several option for improving search results, we will have to figure out how to stage them and in what order to try/test for them.
And of course almost all of these will make more sense once we've looked at some query data. That's my next task—to get access myself and start trying to decide what seems most likely to have most impact.
Okay.. I’m running out of steam a little, so I’m going to wrap it up for now. I’ll think more about David’s comments on the three Epics and maybe some other replies later.
—Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Wed, Jul 22, 2015 at 2:57 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
This thread started between a few of us, but has some good ideas and thoughts. Forwarding into the search mailing list (where we will endeavour to have these conversations in the future).
Erik B ---------- Forwarded message ---------- From: Oliver Keyes okeyes@wikimedia.org Date: Wed, Jul 22, 2015 at 8:31 AM Subject: Re: Zero search results—how can I help? To: David Causse dcausse@wikimedia.org Cc: Trey Jones tjones@wikimedia.org, Erik Bernhardson < ebernhardson@wikimedia.org>
Whoops; I guess point 4 is the second list ;p.
On 22 July 2015 at 11:30, Oliver Keyes okeyes@wikimedia.org wrote:
On 22 July 2015 at 10:55, David Causse dcausse@wikimedia.org wrote:
Le 22/07/2015 15:21, Oliver Keyes a écrit :
Thanks; much appreciated. Point 3 directly relates to my work so it's good to be CCd :).
FWIW, this kind of detail on the specific things we're doing is missing from the main search mailing list and could be used very much there to inform people.
I agree, my intent right now is still to learn from each others and build/use a friendly environment where engineer with NLP background like Trey can work efficiently. When things will be clearer it'd be great to share our plan.
Oliver is already handling the executor IDs and distinguishing full and prefix search, so nyah ;p.
Great!
Just to be sure : does this means that a search count will be reduced
to its
executorID :
- all request with the same executorID return 0 zero result -> add 1 to
the
zero result counter
- if one of the request returns a result -> do not increment the zero
result
counter If yes I think this will be the killer patch for Q1 :)
Executor IDs are stored and if a match is found in executor IDs <=120 seconds after that one, the later outcome is considered "the outcome". If not, we assume no second round-trip was made and so go with whatever happened first.
So if you make a request and it round-trips once and fails, failure. Round-trip once and succeeds, success. Round-trip twice and fail both times, failure. Round-trip twice and fail the first time and succeed the second - one success, zero failures :). Erik wrote it, and I grok the logic.
On the language detection - actually Kolkus and Rehurek published a work in 2009 that handles small amounts of text really really well (n-gram based approaches /suck at this/) and there's a Java implementation I've been playing with. Want me to run it across some search strings and we can look at the results? Or just send the code across.
If you ask I'd say both! ;)
We evaluated this kind of dictionary-based language detection (but this
not
this one specifically), problem for us was mostly due to performance: it takes time to tokenize the input string correctly and the dictionary we
used
was rather big. But we worked mainly on large content (webnews, press articles). In our case input strings should be very small so it makes more sense.
We
should be able to train the dictionary against the "all title in ns0"
dumps
though.
This is also a great example to explain why I feel stuck sometimes: How will you be able to test it?
- I'm not allowed to download search logs locally.
- I think I won't be able to install java and play with this kind of
tools
on fluorine.
Ahh, but! You're NDAd, your laptop is a work laptop, and you have FDE, right? If yes to all three, I don't see a problem with me squirting you a sample of logs (and the Java). I figure if we find the methodology works we can look at speedups to the code, which is a lot easier a task than looking at fast code and trying to improve the methodology.
Another point: concerning the following tasks described below, I think it overlaps analytics tasks (because it's mainly related to learning from search
logs).
I don't know how you work today and maybe this is something you've
already
done or is obviously wrong. I think you're one of the best person today to help us to sort this
out, so
your feedback concerning the following lines will be greatly
appreciated :)
Thanks!
Yes! Okay, thoughts on the below:
- Build a search log parser - we sort of have that through the
streaming python script. It depends whether you mean a literal parser or something to pick out all the "important" bits. See point 4. 2. Big machine: I'd love this. But see point 4. 3. Improve search logs for us: when we say improve for us do we mean for analytics/improvements purposes? Because if so we've been talking about having the logs in HDFS which would make things pretty easy for all and sundry and avoid the need for a parser.
One way of neatly handling all of this would be:
- Get the logs in a format that has the fields we want and stream it
into Hadoop. No parser necessary. 2. Stick the big-ass machine in the analytics cluster, where it has default access to Hadoop and can grab data trivially, but doesn't have to break anyone else's stuff. 3. Fin.
What am I missing? Other than "setting up a MediaWiki kafka client is going to be kind of a bit of work".
Le 22/07/2015 14:38, David Causse a écrit :
It's still not very clear in my mind but things could look like :
- Epic: Build a toolbox to learn from search logs
- Create a script to run search queries against the production
index - Build search logs parser that provide all the needed details : time, search type, wiki origin, target search index, search query, search query ID, number of results, offset of the results (search page) (side note : Erik will it be possible to pass the queryID
from
page to page when user clicks "next page"?) - Have a descent machine (64g RAM would be great) in the
production
cluster where we can - download production search logs - install the tools we want - stress it not being afraid to kill it - do all the stuff we want to learn from data and search
logs
- Epic: Improve search logs for us
- Add an "incognito parameter" to cirrus that could be used by
the
toolbox script not to pollute our search logs when running our
"search
script". - Add a log when the user click on a search result to have a mapping between the queryID, the result choosen and the offset of the chosen link in the result list. - This task is certainly complex and highly depends on the client, I don't know if we will be able to track this down on all clients but it'd be great for us. - More things will be added as we learn
- Epic: start to measure and control relevance
- Create a corpus of search queries for each wiki with their
expected results - Run these queries weekly/monthly and compute the F1-Score for each wiki - Continuously enhance the search queries corpus - Provide a weekly/monthly perf score for each wiki
As you can see this is mostly about tools, I propose to start with
batch
tools and think later of how we could make this more real-time.
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Thank you Trey! These are all excellent ideas and I just added my 2 cents inline :)
Le 22/07/2015 21:54, Trey Jones a écrit :
Hey Wikimedia-search!
I’m Trey Jones, and I’m a new to WMF (this is only my third week), and I started this thread, though David really got it going.
There’s lots to digest here, and I’m sure I’ll retread certain ground already covered, but below are my initial thoughts. Let me know if you think any of these notes should end up in a wiki or Phab ticket somewhere—I'm still trying to grok where to best document things. (And think about everyone's comments, too, and whether they should be copied elsewhere—it’s always a shame to lose track of good ideas.)
You're right, I think there's some phab tickets where you can put the ideas you described here.
=Meta stuff=
Sorry this message is so long. I didn’t have time to write a short one. (Alas, this is my greatest weakness, but at least I can admit it.)
I’ve tried to label ideas that could use some additional discussion with (L)etters at the beginning of the first relevant paragraph.
=Results from other wikis=
I agree with the general consensus that n-grams aren’t great for language detection on short strings. A quick skim of literature related to Oliver’s cite (Kolkus and Rehurek 2009) points to Naive Bayes as a good method on short strings.
I did notice that the slides attached to the old Cybozu lang-detect project home page mention that short strings are a problem—but the slides are from 2010. David also mentioned that in his comments on T104505. Is Cybozu lang-detect still a contender? Has anyone had a chance to run either the latest version or the ES plugin on anything?
I never used cybozu inside the elasticsearch plugin (but I can confirm that it works poorly on small texts like tweets) and I don't know if it's still a contender but if we refer to
(citation extracted from http://ceur-ws.org/Vol-1228/tweetlid-1-gamallo.pdf) "This is in accordance with Rehurek and Kolkus (2009), who tried to prove that dictionary-based methods are more reliable than character-based systems for language identification with noisy short texts among similar languages."
I understand that the method used by Kolkus and Rehurek is dictionnary-based (word unigrams)? It will outperform cybozu (char ngram based) on small texts. I think it's true if the text is like tweets with short phrases but may not work properly for names? This certainly deserves some test on real data.
(A) I like the idea of running a cross-wiki test, though I can think of a couple more ways to analyze the results than listed in T104505. I assume there are plenty of repeats in the top-N “no-results” queries, and probably a Zipf/power law distribution. (I’m very curious to see what the distribution actually looks like. What’s the max frequency / percentage over a day for a given zero-results query?)
So, it would make sense to me to track not only raw numbers, but also weighted numbers if the distribution in the top-N is very unequal.[1] And of course, the “zero result” decrease should be weighted. It might also make sense to look at the distribution of “zero result decrease” by number of additional wiki’s searched. For example, what if all 234 results from the French wiki for English queries (in David’s example table in T104505) are subsumed by the 324 German wiki results. Is it still worth searching in French?
Yes you're right I didn't thought about that and it's hard to tell... I guess it will depend on the idea you described below related to interwiki links. This raises another question as we add more fall-back methods to decrease the zero result rate. How will we prioritize the fall-back methods? I mean if I can re-run a "Did you mean" query and if I know that running the original query against another wiki has good chances to give results which one should I try first?
[1] Caveat: it wouldn’t hurt to review the very top queries in any sample by hand to look for trending topics that could skew the results over a small time period. During the Women’s World Cup, I bet there were more searches for names of various players, for example, than there normally would be.
I think it's worth running this test regularly and see how results change.
On the other hand—I read French much better than I read German—so I’d prefer French results even if all the French results are duplicates of the German results. Are results in a language I can’t read really any better than no results?
This leads to a few new (to me) ideas:
(B) Make multilingual results configurable—If we know, say, the top four wikis likely to give good results for queries from the English wiki are Sp, Fr, DE, and JP, we could have a expanding section (excuse an UI ugliness—someone with UI smarts can help us figure out how to make it pretty, right?) to enable multi-lingual searching, so on English Wikipedia I could ask for “back up results” in Spanish and French, but not German and Japanese. Store those settings in a cookie for later, too, possibly with some UI indicator that multilingual backup results are enabled. (Also, if the cookie is available at query time, we could save unnecessary cross-wiki searches the user couldn’t possibly use.)
There is maybe sensible defaults per language?
(C) And/or, multilingual results could be an extra click—“we didn’t find English wiki results, but we found results that match your query in Spanish and German, would you like to see them?” with links on “Spanish” and “German”. I’d click the Spanish link, not the German link.
(D) Another sneakier idea that came to mind—which may not be technically plausible—would be to find good results in another language and then check for links back to wiki articles in the wiki the search came from. I do this manually when I find something Google translate can’t handle in a confidence-inspiring way: I search on Russian or Arabic Wikipedia, then look on the nav bar for the “English” link. There are lots of options here—showing just the English results with a link back to the language it went through, or showing summaries for both, etc.
A silly example: search for “Виллальверния” in en wiki gives no results. But there is a ru wiki page with that exact title. It has a link to the English wiki page for “Villalvernia”. (Don’t ask why someone is searching for the Russian name of a tiny Italian commune on the English Wikipedia. The answer is “because multilingulaism”.)
Search: Виллальверния Results: Villalvernia (crosswiki link from *Виллальверния*)
I don't know if it's technically plausible but AFAIK we have the wikibase id in the index so it's should be pretty simple to extract it. Interwiki links are stored in wikidata, could we use WDQS for that purpose? With the entity ID it should be easy to request the interwiki link for a specific language. Is WDQS designed for this usage (high number of query/sec on rather simple queries)?
(E) Another simpler idea than language detection would be basic character set detection. A query in Cyrillic might get better results from the Russian, Ukrainian, and Bulgarian wikis than the French and German ones, even if French and German do better overall. Similarly Arabic script and perhaps the Arabic, Persian, and Urdu wikis.
This might also be a reason why decent language detection is okay if it is computationally much cheaper than excellent detection—we don’t have to commit to “the one true answer”; maybe we could search the top two or three other wikis.
Yes, I think cybozu can help here to do what you describe and will be relatively "cheap".
=Misspellings=
(F) I had a good chat with Erik earlier this afternoon, and I just mentioned his “saerch” example that’s in T104468. Having recently looking at the ES suggester docs at David’s suggestion, I asked Erik about the prefix length… he was able to quickly find that it’s set to 2.. so only words that start with the two letters “sa” could ever be suggested. As Erik suggested in T104468, this would be a great less-performant option to try if we get no results (or crappy results)—we could loosen the params, for example going back to prefix=1. For zero results, this may make sense—but the old suggestion Erik noted, /saeqeh,/ and the current one, /samech,/ both seem kinda unlikely—we could probably quantify that, esp. with some user feedback.
And we should definitely look at the various params and decide what are reasonable settings for “cheap and good” and what’s “more expensive but better”.
Reducing to 1 char the prefix length can hurt perfs and it's certainly a good idea to do this in 2 passes as Erik suggested.
While working on prefixes I tried to analyze data on simple wiki dump and extracted the distribution of term frequency by prefix length. I failed to make any good usage of the data yet but I'm sure you will :) I described a way to analyze the content we have in the index here: https://wikitech.wikimedia.org/wiki/User:DCausse/Term_Stats_With_Cirrus_Dump
It's still on a very small dataset but if you find it useful maybe we could try on a larger one?
David’s idea of a spelling dictionary makes sense, in that it limits the scope of possibilities to compare against. But it probably won’t handle names, or, probably, technical terms (e.g., “phonestheme”—or, in hard mode, its plural).
It would be interesting to see the results of dropping the long tail from what ES considers a match—min_doc_freq ( https://www.elastic.co/guide/en/elasticsearch/reference/1.6/search-suggester... ) would help with that.
(How concerned are we with finding spelling errors in the wiki based on a properly spelled search term? I used hunt for and correct commonly misspelled words in en wiki as a hobby.)
My point here is (in the long term): maybe it's difficult to build good suggestions from data directly so why not build a custom dictionary/index to handle "Did you mean suggestions"?. According to https://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s they learn from search queries to build these suggestions. Is this something worth trying?
=Misc=
(G) Another interesting question: if we end up implementing several option for improving search results, we will have to figure out how to stage them and in what order to try/test for them.
And of course almost all of these will make more sense once we've looked at some query data. That's my next task—to get access myself and start trying to decide what seems most likely to have most impact.
Okay.. I’m running out of steam a little, so I’m going to wrap it up for now. I’ll think more about David’s comments on the three Epics and maybe some other replies later.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
[removed the old message because it was too big]
To keep the message size down, I'm going to trim heavily..
=Results from other wikis=
I understand that the method used by Kolkus and Rehurek is dictionary-based (word unigrams)? It will outperform cybozu (char ngram based) on small texts. I think it's true if the text is like tweets with short phrases but may not work properly for names? This certainly deserves some test on real data.
Yeah, names are a pain for lots of reasons. n-grams may help categorize them ethnolinguistically, similarly to language identification, but that doesn't tell you where to search. For example, Célia Šašić is a German footballer with a French first name and Croatian last name (by marriage)—and she's not in the Croatian wiki, though she is in English, German, French, and others. Did I mention that names are a pain?
A couple more interesting ideas from a couple of papers (though there's always a danger of falling down the literature rabbit hole):
Looking at tweets: http://ceur-ws.org/Vol-1228/tweetlid-1-gamallo.pdf - good results on tweets with Naive Bayes classifier built on words, and decent results with a simple ranked list of the top N words - in both cases they added simple suffix scoring to get what I think of as the best bit of n-grams
Looking at "query-style" texts: http://www.uni-weimar.de/medien/webis/publications/papers/lipka_2010.pdf - claim good results with a Naive Bayes classifier built on n-grams—though they use 4-grams and 5-grams
But, yeah, everything comes down to is it fast, easy to implement, and how does it perform on real data.
This raises another question as we add more fall-back methods to decrease the zero result rate. How will we prioritize the fall-back methods? I mean if I can re-run a "Did you mean" query and if I know that running the original query against another wiki has good chances to give results which one should I try first?
Yep, that was my point (G):
(G) Another interesting question: if we end up implementing several option
for improving search results, we will have to figure out how to stage them and in what order to try/test for them.
I think it's worth running [the cross-wiki] test regularly and see how results change.
I agree.
(B) Make multilingual results configurable—If we know, say, the top four wikis likely to give good results for queries from the English wiki are Sp, Fr, DE, and JP, we could have a expanding section (excuse an UI ugliness—someone with UI smarts can help us figure out how to make it pretty, right?) to enable multi-lingual searching, so on English Wikipedia I could ask for “back up results” in Spanish and French, but not German and Japanese. Store those settings in a cookie for later, too, possibly with some UI indicator that multilingual backup results are enabled. (Also, if the cookie is available at query time, we could save unnecessary cross-wiki searches the user couldn’t possibly use.)
There is maybe sensible defaults per language?
I think we can look for defaults per language in terms of where it makes sense to look based on the fact that we're likely to find something. No point looking in language X—even if the user can read it—if we never find anything in X.
But what languages to search really make the most sense per user, don't they? At least for ranking. I'd much rather have a mediocre result in a language I can read than a perfect result in a language I can't read. We could limit it by where we think we'll find something based on our tests, but the user should be able to further limit results based on whether they can use them.
(D) Another sneakier idea that came to mind—which may not be technically plausible—would be to find good results in another language and then check for links back to wiki articles in the wiki the search came from.
I don't know if it's technically plausible but AFAIK we have the wikibase id in the index so it's should be pretty simple to extract it. Interwiki links are stored in wikidata, could we use WDQS for that purpose? With the entity ID it should be easy to request the interwiki link for a specific language. Is WDQS designed for this usage (high number of query/sec on rather simple queries)?
I also thought of WDQS for this. We should ask Stas.
=Misspellings=
Reducing to 1 char the prefix length can hurt perfs and it's certainly a good idea to do this in 2 passes as Erik suggested.
Yeah, I worry about performance with prefix=1—but we can test it in small scale and see what it costs and how much it helps.
While working on prefixes I tried to analyze data on simple wiki dump and extracted the distribution of term frequency by prefix length. I failed to make any good usage of the data yet but I'm sure you will :)
I described a way to analyze the content we have in the index here:
https://wikitech.wikimedia.org/wiki/User:DCausse/Term_Stats_With_Cirrus_Dump It's still on a very small dataset but if you find it useful maybe we could try on a larger one?
I will take a look! (Two caveats: I don't really have superpowers, so maybe there's not much there. I'll add it to my stack, which is getting bigger every day.)
My point here is (in the long term): maybe it's difficult to build good suggestions from data directly so why not build a custom dictionary/index to handle "Did you mean suggestions"?. According to https://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s they learn from search queries to build these suggestions. Is this something worth trying?
Definitely worth trying. Erik's got a patch in for adding a session id ( https://gerrit.wikimedia.org/r/#/c/226466/ ). In addition to identifying prefix searches that come right before the user finishes typing their full text query, this would be good for looking for zero-results (or even low-results) queries followed by a similarly-spelled successful query from the same search session. "saerch" followed by "search" gives us hints that the latter is a good suggestion for the former—especially if it happens a lot.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
Would it be possible, and if so would it be desirable, to provide links to wiktionary for single-word searches? That might be a way to provide content in the user's current language, when it isn't available on the current wikipedia.
(And thanks very much for bringing this discussion to the public list!)
Kevin Smith Agile Coach Wikimedia Foundation
*Imagine a world in which every single human being can freely share in the sum of all knowledge. That's our commitment. Help us make it a reality.*
On Thu, Jul 23, 2015 at 10:28 AM, Trey Jones tjones@wikimedia.org wrote:
To keep the message size down, I'm going to trim heavily..
=Results from other wikis=
I understand that the method used by Kolkus and Rehurek is dictionary-based (word unigrams)? It will outperform cybozu (char ngram based) on small texts. I think it's true if the text is like tweets with short phrases but may not work properly for names? This certainly deserves some test on real data.
Yeah, names are a pain for lots of reasons. n-grams may help categorize them ethnolinguistically, similarly to language identification, but that doesn't tell you where to search. For example, Célia Šašić is a German footballer with a French first name and Croatian last name (by marriage)—and she's not in the Croatian wiki, though she is in English, German, French, and others. Did I mention that names are a pain?
A couple more interesting ideas from a couple of papers (though there's always a danger of falling down the literature rabbit hole):
Looking at tweets: http://ceur-ws.org/Vol-1228/tweetlid-1-gamallo.pdf - good results on tweets with Naive Bayes classifier built on words, and decent results with a simple ranked list of the top N words - in both cases they added simple suffix scoring to get what I think of as the best bit of n-grams
Looking at "query-style" texts: http://www.uni-weimar.de/medien/webis/publications/papers/lipka_2010.pdf - claim good results with a Naive Bayes classifier built on n-grams—though they use 4-grams and 5-grams
But, yeah, everything comes down to is it fast, easy to implement, and how does it perform on real data.
This raises another question as we add more fall-back methods to decrease the zero result rate. How will we prioritize the fall-back methods? I mean if I can re-run a "Did you mean" query and if I know that running the original query against another wiki has good chances to give results which one should I try first?
Yep, that was my point (G):
(G) Another interesting question: if we end up implementing several
option for improving search results, we will have to figure out how to stage them and in what order to try/test for them.
I think it's worth running [the cross-wiki] test regularly and see how results change.
I agree.
(B) Make multilingual results configurable—If we know, say, the top four wikis likely to give good results for queries from the English wiki are Sp, Fr, DE, and JP, we could have a expanding section (excuse an UI ugliness—someone with UI smarts can help us figure out how to make it pretty, right?) to enable multi-lingual searching, so on English Wikipedia I could ask for “back up results” in Spanish and French, but not German and Japanese. Store those settings in a cookie for later, too, possibly with some UI indicator that multilingual backup results are enabled. (Also, if the cookie is available at query time, we could save unnecessary cross-wiki searches the user couldn’t possibly use.)
There is maybe sensible defaults per language?
I think we can look for defaults per language in terms of where it makes sense to look based on the fact that we're likely to find something. No point looking in language X—even if the user can read it—if we never find anything in X.
But what languages to search really make the most sense per user, don't they? At least for ranking. I'd much rather have a mediocre result in a language I can read than a perfect result in a language I can't read. We could limit it by where we think we'll find something based on our tests, but the user should be able to further limit results based on whether they can use them.
(D) Another sneakier idea that came to mind—which may not be technically plausible—would be to find good results in another language and then check for links back to wiki articles in the wiki the search came from.
I don't know if it's technically plausible but AFAIK we have the wikibase id in the index so it's should be pretty simple to extract it. Interwiki links are stored in wikidata, could we use WDQS for that purpose? With the entity ID it should be easy to request the interwiki link for a specific language. Is WDQS designed for this usage (high number of query/sec on rather simple queries)?
I also thought of WDQS for this. We should ask Stas.
=Misspellings=
Reducing to 1 char the prefix length can hurt perfs and it's certainly a good idea to do this in 2 passes as Erik suggested.
Yeah, I worry about performance with prefix=1—but we can test it in small scale and see what it costs and how much it helps.
While working on prefixes I tried to analyze data on simple wiki dump and extracted the distribution of term frequency by prefix length. I failed to make any good usage of the data yet but I'm sure you will :)
I described a way to analyze the content we have in the index here:
https://wikitech.wikimedia.org/wiki/User:DCausse/Term_Stats_With_Cirrus_Dump It's still on a very small dataset but if you find it useful maybe we could try on a larger one?
I will take a look! (Two caveats: I don't really have superpowers, so maybe there's not much there. I'll add it to my stack, which is getting bigger every day.)
My point here is (in the long term): maybe it's difficult to build good suggestions from data directly so why not build a custom dictionary/index to handle "Did you mean suggestions"?. According to https://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s they learn from search queries to build these suggestions. Is this something worth trying?
Definitely worth trying. Erik's got a patch in for adding a session id ( https://gerrit.wikimedia.org/r/#/c/226466/ ). In addition to identifying prefix searches that come right before the user finishes typing their full text query, this would be good for looking for zero-results (or even low-results) queries followed by a similarly-spelled successful query from the same search session. "saerch" followed by "search" gives us hints that the latter is a good suggestion for the former—especially if it happens a lot.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
On Thu, Jul 23, 2015 at 2:03 PM, Kevin Smith ksmith@wikimedia.org wrote:
Would it be possible, and if so would it be desirable, to provide links to wiktionary for single-word searches? That might be a way to provide content in the user's current language, when it isn't available on the current wikipedia.
Desirable is a philosophical question, but it seems reasonable to me. Possible certainly seems possible, if it helps. Once again, what we really need to do is look through the data and see how often something that looks like this comes up.
A few more ideas to toss on the pile, some of which have potential philosophical implications. (Thanks to Moiz for inspiring these during a recent chat.)
- "trending typos"—here's the philosophical bit—do we want/need to solve all zero searches with improved search engine results, or are redirects acceptable? If they are, we could publish a list of the top zero-results searches and allow human editors to fix the ones that are obvious typos with redirects. Célia Šašić comes to mind again. One announcer repeatedly said her name like it was "Celia Sausage". I don't know if any generic search engine technique is going to take care of that. If it was the top zero-results query, though, a redirect from Celia Sausage to Célia Šašić would be helpful.
Even if we don't like redirects, we could also try to map (possibly via more computationally expensive techniques, permanently or temporarily) the top-N most common zero-results queries to the top-P most common queries (across search sessions)—similar to mapping typos to corrected typos (within a search session). This would allow us to catch trending topics that are hard to spell.
—Trey
On Thu, Jul 23, 2015 at 12:45 PM, Trey Jones tjones@wikimedia.org wrote:
Desirable is a philosophical question, but it seems reasonable to me. Possible certainly seems possible, if it helps. Once again, what we really need to do is look through the data and see how often something that looks like this comes up.
So at this point we need to pick 1 or 2 ideas that we think have the highest chance of meeting our zero results goal and run experiments. I don't anticipate hitting the goal with our first efforts but I also don't want us to get stuck on finding the perfect solution. Half of being able to hit this goal is to identify changes, test them, and quickly iterate within the quarter.
What have we narrowed down to and what are their relative impact?
--tomasz
Right now I'm brainstorming as I chat with various people and documenting and discussing ideas here. My next zero-results goal is to get a hold of actual data, histogram it, and see which approach seems most promising based on how the data looks. I'm tying up loose ends on some WDQS tasks first, then, assuming my SSH access works as advertised, this will be my primary task.
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Thu, Jul 23, 2015 at 3:57 PM, Tomasz Finc tfinc@wikimedia.org wrote:
On Thu, Jul 23, 2015 at 12:45 PM, Trey Jones tjones@wikimedia.org wrote:
Desirable is a philosophical question, but it seems reasonable to me. Possible certainly seems possible, if it helps. Once again, what we
really
need to do is look through the data and see how often something that
looks
like this comes up.
So at this point we need to pick 1 or 2 ideas that we think have the highest chance of meeting our zero results goal and run experiments. I don't anticipate hitting the goal with our first efforts but I also don't want us to get stuck on finding the perfect solution. Half of being able to hit this goal is to identify changes, test them, and quickly iterate within the quarter.
What have we narrowed down to and what are their relative impact?
--tomasz
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
On Thu, Jul 23, 2015 at 1:09 PM, Trey Jones tjones@wikimedia.org wrote:
Right now I'm brainstorming as I chat with various people and documenting and discussing ideas here. My next zero-results goal is to get a hold of actual data, histogram it, and see which approach seems most promising based on how the data looks. I'm tying up loose ends on some WDQS tasks first, then, assuming my SSH access works as advertised, this will be my primary task.
Great,
Lets get these into Phab to track progress
--tomasz
Le 23/07/2015 20:03, Kevin Smith a écrit :
=Misspellings=
Reducing to 1 char the prefix length can hurt perfs and it's certainly a good idea to do this in 2 passes as Erik suggested.
Yeah, I worry about performance with prefix=1—but we can test it in small scale and see what it costs and how much it helps.
I had a look at the current mapping and it looks like (I have to carefully check before) that there is two unused suggest fields (title.suggest & redirect.title.suggest) in the index. I think It's related to https://gerrit.wikimedia.org/r/#/c/118650/ & https://gerrit.wikimedia.org/r/#/c/118651/. I guess the old fields were kept to switch back rapidly to the old config? If we confirm that these fields are unused and prefix length is an issue we could reclaim this unused space to add another suggest field with a reverse filter.
This requires a reindex so it's worth checking if prefix is really an issue.
On Thu, Jul 23, 2015 at 1:28 PM, Trey Jones tjones@wikimedia.org wrote:
To keep the message size down, I'm going to trim heavily..
=Results from other wikis=
I understand that the method used by Kolkus and Rehurek is dictionary-based (word unigrams)? It will outperform cybozu (char ngram based) on small texts. I think it's true if the text is like tweets with short phrases but may not work properly for names? This certainly deserves some test on real data.
Yeah, names are a pain for lots of reasons. n-grams may help categorize them ethnolinguistically, similarly to language identification, but that doesn't tell you where to search. For example, Célia Šašić is a German footballer with a French first name and Croatian last name (by marriage)—and she's not in the Croatian wiki, though she is in English, German, French, and others. Did I mention that names are a pain?
A couple more interesting ideas from a couple of papers (though there's always a danger of falling down the literature rabbit hole):
Looking at tweets: http://ceur-ws.org/Vol-1228/tweetlid-1-gamallo.pdf - good results on tweets with Naive Bayes classifier built on words, and decent results with a simple ranked list of the top N words - in both cases they added simple suffix scoring to get what I think of as the best bit of n-grams
Looking at "query-style" texts: http://www.uni-weimar.de/medien/webis/publications/papers/lipka_2010.pdf - claim good results with a Naive Bayes classifier built on n-grams—though they use 4-grams and 5-grams
But, yeah, everything comes down to is it fast, easy to implement, and how does it perform on real data.
This raises another question as we add more fall-back methods to decrease the zero result rate. How will we prioritize the fall-back methods? I mean if I can re-run a "Did you mean" query and if I know that running the original query against another wiki has good chances to give results which one should I try first?
Yep, that was my point (G):
(G) Another interesting question: if we end up implementing several
option for improving search results, we will have to figure out how to stage them and in what order to try/test for them.
I think it's worth running [the cross-wiki] test regularly and see how results change.
I agree.
(B) Make multilingual results configurable—If we know, say, the top four wikis likely to give good results for queries from the English wiki are Sp, Fr, DE, and JP, we could have a expanding section (excuse an UI ugliness—someone with UI smarts can help us figure out how to make it pretty, right?) to enable multi-lingual searching, so on English Wikipedia I could ask for “back up results” in Spanish and French, but not German and Japanese. Store those settings in a cookie for later, too, possibly with some UI indicator that multilingual backup results are enabled. (Also, if the cookie is available at query time, we could save unnecessary cross-wiki searches the user couldn’t possibly use.)
There is maybe sensible defaults per language?
I think we can look for defaults per language in terms of where it makes sense to look based on the fact that we're likely to find something. No point looking in language X—even if the user can read it—if we never find anything in X.
But what languages to search really make the most sense per user, don't they? At least for ranking. I'd much rather have a mediocre result in a language I can read than a perfect result in a language I can't read. We could limit it by where we think we'll find something based on our tests, but the user should be able to further limit results based on whether they can use them.
(D) Another sneakier idea that came to mind—which may not be technically plausible—would be to find good results in another language and then check for links back to wiki articles in the wiki the search came from.
I don't know if it's technically plausible but AFAIK we have the wikibase id in the index so it's should be pretty simple to extract it. Interwiki links are stored in wikidata, could we use WDQS for that purpose? With the entity ID it should be easy to request the interwiki link for a specific language. Is WDQS designed for this usage (high number of query/sec on rather simple queries)?
I also thought of WDQS for this. We should ask Stas.
What do you mean by "interwiki link" and from where would we be requesting it? from Cirrus? interwiki links are in parser output of all client wikis. (e.g. wikipedias)
otherwise the wbgetentities api module in Wikidata has them also, and can't see volume of requests being a problem there.
Cheers, Katie
=Misspellings=
Reducing to 1 char the prefix length can hurt perfs and it's certainly a good idea to do this in 2 passes as Erik suggested.
Yeah, I worry about performance with prefix=1—but we can test it in small scale and see what it costs and how much it helps.
While working on prefixes I tried to analyze data on simple wiki dump and extracted the distribution of term frequency by prefix length. I failed to make any good usage of the data yet but I'm sure you will :)
I described a way to analyze the content we have in the index here:
https://wikitech.wikimedia.org/wiki/User:DCausse/Term_Stats_With_Cirrus_Dump It's still on a very small dataset but if you find it useful maybe we could try on a larger one?
I will take a look! (Two caveats: I don't really have superpowers, so maybe there's not much there. I'll add it to my stack, which is getting bigger every day.)
My point here is (in the long term): maybe it's difficult to build good suggestions from data directly so why not build a custom dictionary/index to handle "Did you mean suggestions"?. According to https://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s they learn from search queries to build these suggestions. Is this something worth trying?
Definitely worth trying. Erik's got a patch in for adding a session id ( https://gerrit.wikimedia.org/r/#/c/226466/ ). In addition to identifying prefix searches that come right before the user finishes typing their full text query, this would be good for looking for zero-results (or even low-results) queries followed by a similarly-spelled successful query from the same search session. "saerch" followed by "search" gives us hints that the latter is a good suggestion for the former—especially if it happens a lot.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Le 23/07/2015 20:25, aude a écrit :
(D) Another sneakier idea that came to mind—which may not be technically plausible—would be to find good results in another language and then check for links back to wiki articles in the wiki the search came from.
I don't know if it's technically plausible but AFAIK we have the wikibase id in the index so it's should be pretty simple to extract it. Interwiki links are stored in wikidata, could we use WDQS for that purpose? With the entity ID it should be easy to request the interwiki link for a specific language. Is WDQS designed for this usage (high number of query/sec on rather simple queries)? I also thought of WDQS for this. We should ask Stas.
What do you mean by "interwiki link" and from where would we be requesting it? from Cirrus? interwiki links are in parser output of all client wikis. (e.g. wikipedias)
otherwise the wbgetentities api module in Wikidata has them also, and can't see volume of requests being a problem there.
Thanks, so if the test shows that it's worth performing cross-wiki searches I conclude there's nothing that prevent us to implement this feature.
My point here is (in the long term): maybe it's difficult to build good
suggestions from data directly so why not build a custom dictionary/index to handle "Did you mean suggestions"?. According to https://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s they learn from search queries to build these suggestions. Is this something worth trying?
Definitely worth trying. Erik's got a patch in for adding a session id ( https://gerrit.wikimedia.org/r/#/c/226466/ ). In addition to identifying prefix searches that come right before the user finishes typing their full text query, this would be good for looking for zero-results (or even low-results) queries followed by a similarly-spelled successful query from the same search session. "saerch" followed by "search" gives us hints that the latter is a good suggestion for the former—especially if it happens a lot.
figured out that session id as written in that patch isn't going to work, brainstorming in https://phabricator.wikimedia.org/T106552 about ways we can get this information without generating sessions for the user.
Erik B