Wikimedia-search

wikimedia-search@lists.wikimedia.org

1 participants
108 discussions

Fwd: Maximum search query length coming soon
by Dan Garry 10 Aug '15

10 Aug '15

Cross-posting from wikitech-l. Please discuss there. Dan ---------- Forwarded message ---------- From: Dan Garry <dgarry(a)wikimedia.org> Date: 10 August 2015 at 14:36 Subject: Maximum search query length coming soon To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Hello! The Search Team in the Discovery Department is implementing a maximum search query length <https://phabricator.wikimedia.org/T107947>. There are two main reasons to do this: 1. Extremely long queries are almost always gibberish from things like malfunctioning scrapers. These queries skew our statistics about the usefulness of our search. Implementing a limit will reduce the magnitude of skew. 2. Extremely long queries have a disproportionate impact on performance. On its own this isn't enough, but considering point 1 above, limiting them is unlikely to impact any actual users. Implementing a limit will improve performance. We've chosen a hard limit of 300 characters. If your query exceeds this, you will be told that your query exceeds the maximum length. Based on our analysis of typical query lengths <https://phabricator.wikimedia.org/T107947#1515387>, this change should impact almost nobody. If you think you'll be adversely affected, please reach out to us and we'll work with you to figure something out. Thanks! Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

1 0

Fwd: Discovery Department running A/B tests for search suggestions
by Dan Garry 07 Aug '15

07 Aug '15

Cross-posting from wikitech-l. If you have any questions or comments, please post them there. Thanks, Dan ---------- Forwarded message ---------- From: Dan Garry <dgarry(a)wikimedia.org> Date: 7 August 2015 at 13:19 Subject: Discovery Department running A/B tests for search suggestions To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Hello! As part of our goal to reduce the zero results rate <https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q1_Goals#Search>, the Discovery Department is currently running an A/B test to try different parameters for the search suggester. We're hoping that our new parameters will give users more suggestions without decreasing their quality. The reason we've chosen to tweak the suggestions is because of our recent work <https://phabricator.wikimedia.org/T105202> to automatically run queries for the user if they get zero results but have a suggestion. The purpose of this A/B test is to determine whether this has significant impact towards achieving our goal or not. This is the first A/B test that the Discovery Department has run, so we're still ironing out the process. We hope to run many more A/B tests in the future. For further information on this, please review the associated Phabricator task <https://phabricator.wikimedia.org/T108103>. If you have any questions, I'd be happy to answer them. Thanks, Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

1 0

Dashboard updates: KPI & linking
by Mikhail Popov 06 Aug '15

06 Aug '15

Pleased to announce that the dashboard now has a KPI module that should be the first thing y'all see when you go to http://searchdata.wmflabs.org/metrics/ - The currently-functional widgets (load time, zero results rate, api usage) adjust their visual style to reflect good or bad changes since yesterday. - The bar showing breakdown of API usage is staying here for now until we find a better place for it. One more thing! The individual dashboard tabs can now be linked to. So if you need to show somebody the zero results summary page, you can navigate to it and find a link at the bottom that you can copy and paste like this: http://searchdata.wmflabs.org/metrics/#failure_rate Cheers~ Mikhail -- *Mikhail Popov* // Data Scientist, Discovery <https://www.mediawiki.org/wiki/Wikimedia_Discovery> https://wikimediafoundation.org/ *Imagine a world in which every single human being can freely share in the **sum of all knowledge. That's our commitment.* Donate <https://donate.wikimedia.org/>.

6 8

Fwd: "Morelike" suggestions - the results are in!
by Dmitry Brant 04 Aug '15

04 Aug '15

moving to mobile-l, and cc Search & Discovery. ---------- Forwarded message ---------- From: Dmitry Brant <dbrant(a)wikimedia.org> Date: Wed, Jul 29, 2015 at 3:38 PM Subject: "Morelike" suggestions - the results are in! To: Internal communication for WMF Reading team < reading-wmf(a)lists.wikimedia.org> Hi all, For the last few weeks, we've had an A/B test in the Android app where we measure user engagement with the "read more" suggestions that we show at the bottom of each article. We display three suggestions for further reading, based on either (A) a plain full-text search query based on the title of the current article, or (B) a query using the "morelike" feature in CirrusSearch. And the winner is... (perhaps not entirely surprisingly) "morelike"! Users who saw suggestions based on "morelike" were over 20% more likely to click on one of the suggestions. Here's a quick analysis and chart of the data from the last 10 days: *https://docs.google.com/spreadsheets/d/1BFsrAcPgexQyNVemmJ3k3IX5rtPvJ_5vdYOyGgS5R6Y/edit?usp=sharing <https://docs.google.com/spreadsheets/d/1BFsrAcPgexQyNVemmJ3k3IX5rtPvJ_5vdYO…>* -Dmitry

4 4

Scaleable Event Systems recap
by Oliver Keyes 04 Aug '15

04 Aug '15

Heyo, Discovery team! (Analytics CCd) This is just a quick writeup of the Scaleable Event Systems meeting that Erik, Dan, Stas and I went to (although just from my perspective). For people not in the initial thread, this is a proposal to replace the internal architecture of EventLogging and similar services with Apache Kafka brokers (http://www.confluent.io/blog/stream-data-platform-1/ ). What that means in practice is that the current 1-2k events/second limit on EventLogging will disappear and we can stop worrying about sampling and accidentally bringing down the system. We can be a lot less cautious about our schemas and a lot less cautious about our sampling rate! It also offers up a lot of opportunities around streaming data and making it available in a layered fashion - while we don't want to explore that right now, I don't think, it's nice to have as an option when we better understand our search data and how we can safely distribute it. I'd like to thank the Analytics team, particularly Andrew, for putting this together; it was a super-helpful discussion to be in and this sort of product is precisely what I, at least, have been hoping for out of the AnEng brain trust. Full speed ahead! -- Oliver Keyes Count Logula Wikimedia Foundation

3 3

Where do zero results queries come from?
by Oliver Keyes 03 Aug '15

03 Aug '15

Hey all, This Friday, Trey Jones (our awesome Relevance Engineer) and I spent some time playing detective with the sampled request logs and a list of the most common queries resulting in zero results. We found a lot of interesting things. In particular: 1. A common pattern in which queries, for no particular reason, had a UNIX timestamp preceding them (example: "1436336857594:2019 FIFA Women's World Cup"). This is responsible, on its own, for 3% of zero results queries - and it appears to be caused by the Wikimedia Apps. 2. A search for strings in quotes followed by 'film' (example: "\"Seventh Son\" film"). This is caused by a media player and is responsible for around 0.5% of zero results queries. 3. A search for "quot" strings (example: " quot James Tree quot"). This is from the National Library of Australia and is again around 0.5% of zero results queries. 4. A search for a page title and the name of a page that appears as a link within that page (example: "\"2C-T-19\" AND \"JWH-081\""). This is about 6% of queries and appears to come from a German IP address. We're unaware of who this person is or what they're trying, so if anyone knows what on earth this is, we'd appreciate the hint ;). https://phabricator.wikimedia.org/T107724 is a card representing the need to reach out to these people, where possible (obviously this will be easier for the app team than anyone else ;p). If we can get all of these solved for, we could drop the zero results rate for full text by about 10% Obviously cutting /all/ of it out is improbable, but we're hopeful that we can drop this number and get a better understanding of what third-party users are trying to achieve, to boot. -- Oliver Keyes Count Logula Wikimedia Foundation

5 8

500K multilingual wikipedia zero-results queries
by Trey Jones 31 Jul '15

31 Jul '15

Hi everyone, I've broadened my analysis from enwiki to the other larger wikis, looking at the same phenomena I found in enwiki. While the DOI searches are definitely an issue across 25 wikis, with the other earlier-identified issues some are cross-wiki and some are not. *TL;DR: After DOI searches, "unix timestamp" searches are the biggest cross-wikipedia issue. Weird AND queries and quot queries are big contributors on enwiki, which make them important overall. We could easily fix the unix timestamp queries (either auto correct or make suggestions), and we could fix lots of the quot queries. All of these could be included in the category of "automata" that could potentially be separated from regular queries, and it wouldn't hurt to track down their sources and help people search better.* The <unix-timestamp-looking number>:<wiki title> format (with a small number with a space after the colon) is spread across 45 wikis, with 28,089 instances out of 500K (~5.6%). More than half of the results are enwiki (15,961), but there are 3133 on ru, 2986 on it, 1889 on ja, and hundreds on tr, fa, nl, ar, he, hi, id, and cs. At a cursory glance, all seem to be largely named entities or queries in the appropriate language. Removing the "14###########:", tracking down the source, or putting this on the automata list would help a lot. The boolean AND queries are largely in enwiki (17607: ~3.5% overall, ~7.9% in enwiki), and they are a mixed bag, but many (626) appear with quot, and most (16657) are of the form "article_title_with_underscore" AND "article title without underscores" where the first half is repeated over and over and the second half is something linked to in the first article. Find the source and add to the automata list. In plwiki (263), the AND queries are all of the form *<musical thing>* AND (muzyk* OR Dyskografia) where <musical thing> seems to be an artist, band, album, or something similar. This looks like an automaton, but may not be worth pursuing. Similarly the ones from nl. Globally, OR queries are much more common. 46,035 (~9.2%), spread much more evenly over all the wikis. These are almost all the DOI queries. quot is totally an enwiki thing. It's ~1.2% overall and ~2.8% in enwiki in this sample, which is a lot for one small thing. We should either create a secondary search with filtered quot or track down the source and help them figure out how to do better. TV episodes and films ("<title> S#E#" film) are mostly on enwiki (~1.1% overall, ~2.4% of enwiki queries), with some on ja, fr, and de, and single digits on it and ru. I'd count this as automata, though finding a source would be nice. Strings of numbers do happen everywhere, but are only common on enwiki, with less on jawiki, and much less on de, fr, ru, vi, and nl. My last bit of analysis will later this week, and I'll try to look at non-English and/or cross-wiki stuff, write it all up in Phabricator, and move on. On Tue, Jul 28, 2015 at 9:51 AM, Trey Jones <tjones(a)wikimedia.org> wrote: > Okay, I have a slightly better sample this morning. (I accidentally left > out Wikipedias with abbreviations longer than 2 letters). > > My new sample: > 500K zero-result full_text queries (web and API) across the Wikipedias > with 100K+ articles > 383,433 unique search strings (that's a long, long tail) > The sample covers a little over an hour: 2015-07-23 07:51:29 to 2015-07-23 > 08:55:42 > The top 10 (en, de, pt, ja, ru, es, it, fr, zh, nl), account for >83% of > queries > > Top 10 counts, for reference: > 221618 enwiki > 51936 dewiki > 25500 ptwiki > 24206 jawiki > 21891 ruwiki > 19913 eswiki > 18303 itwiki > 14443 frwiki > 11730 zhwiki > 7685 nlwiki > ----- > 417225 > > The DOI searches that appear to come from Lagotto installations hit 25 > wikis (as the Lagotto docs said they would), with en getting a lot more, > and ru getting fewer in this sample, and the rest *very* evenly > distributed. (I missed ceb and war before—apologies). The total is just > over 50K queries, or >10% of the full text queries against larger wikis > that result in zero results. > > ===DOI > 6050 enwiki > 1904 nlwiki > 1902 cebwiki > 1901 warwiki > 1900 viwiki > 1900 svwiki > 1900 jawiki > 1899 frwiki > 1899 eswiki > 1899 dewiki > 1898 zhwiki > 1898 ukwiki > 1898 plwiki > 1898 itwiki > 1897 ptwiki > 1897 nowiki > 1897 fiwiki > 1896 huwiki > 1896 fawiki > 1896 cswiki > 1896 cawiki > 1895 kowiki > 1895 idwiki > 1895 arwiki > 475 ruwiki > ----- > 50181 > > On Mon, Jul 27, 2015 at 5:04 PM, Trey Jones <tjones(a)wikimedia.org> wrote: > >> > I've started looking at a 500K sample from 7/24 across all wikis. I'll >> have more results tomorrow, but right now it's already clear that someone >> is spamming useless DOI searches across wikis—and it's 9% of the wiki >> zero-results queries. >> >>

5 8

Enable or disable full text search query rewriting by default for API clients.
by Erik Bernhardson 31 Jul '15

31 Jul '15

We have a new feature for web requests that rewrites zero result queries into a new search that might have results. I've started porting this same feature over to API clients so it has a larger effect on our zero results rate, but code review has turned up some indecision on if this should be enabled or disabled by default in the API. Either way the feature will be toggleable. I thought we should open this up to a larger audience, are there any opinions? Erik B.

6 8

Re: [Wikimedia-search] testing the value of a reverse index
by David Causse 30 Jul '15

30 Jul '15

Le 29/07/2015 19:26, Trey Jones a écrit : > (Thoughts are cloudy with a chance of brainstorming) > > Hey guys I saw part of your discussion on IRC about testing whether > reverse indexes help. I couldn’t reply there at the time, so I started > thinking about it. This unfortunately long email is the result. (Sorry.) No problem, I like reading your mails :) > > While it would be good to know how the reverse index helps on a wiki > of more manageable size like frwiki, I wouldn’t necessarily expect the > patterns of typos to be the same between enwiki and frwiki (or any > other language wiki)—language phonotactics & orthography, keyboard > layout, mobile use, and user demographics could all have an effect on > the type and frequency of typos. So a reverse index could generally be > useful in one language and not in another—in theory it wouldn’t hurt > to test specifically on any large wiki where the cost of adding the > reverse index is non-trivial. We have some technical restrictions here, if we activate this settings on one wiki we'll need to reindex most of the wikis because we have cross-wiki searches. wikiA can query wikiB's index, if wikiB index is not updated with correct settings the query will fail. The cross wiki queries I know so far are : - all wikis can query commons.wikimedia.org index - itwiki will query all its sister projects (itwiktionary, itwikivoyage, itwikibooks ...) - maybe more So it's hard to work with mixed settings with the current architecture :( > > I’m trying to think of ways to extrapolate from a sample of some sort. > I’m spit-balling and thinking through as I type—I don’t know if any of > these are good ideas, but maybe one will lead to a better idea. > > Do we know what percentage of searches (in enwiki or in general) match > article titles? We could extract article titles and search against > those with and without a reverse index as a test. > > Or, is it possible to get a reasonably sized random subset of enwiki, > say 10-20%? If so, you could run a sample of non-zero queries against > it and determine that, say, 47% of queries that get results on the > full wiki also get results on this partial wiki… and the run the zero > queries with a reverse index and extrapolate. We can dump a subset of enwiki, the dump tool we use has a --limit param. Unfortunately I have absolutely no idea if the subset will be representative. There is likely a phenomenon similar to db dumps: old docs will be dumped first, for lucene old docs generally means docs that has never been updated, in other words it will be pages that are not very interesting. > > Hmm… if none of the relevant search elements rely on anything other > than the presence of terms in a document, then you could make a > “compact” version of enwiki, where each document keeps only one > instance of each word in it. A quick hacky test on a handful of medium > to longish documents gives compression of 30-50% per document, if > that’s enough to matter. Of course, term frequency, proximity, and > other things would be wildly skewed—but “is it in the index?” would work. It's a good idea but I don't know how to dump this info, there's no easy way to dump the index lexicon in production. Another (similar idea) would be to dump only the fields needed for the suggester to work. The suggester works with title and redirect only, in theory we could dump only these fields which would result in something like 200Mb gzip files for enwiki. Unfortunately I don't have this option in the dump script :( I think it's the best way to go but : - we need to change the dump tool to filter a selected set of fields - I never tested this tool in production, I don't know if it'll hurt perf. I guess it's OK because it's somewhat the same process that is done with inplace reindex. > > Actually, of all you need is “is it in the index?” you could just dump > a list of words in the index and run searches against that. That's a bit trickier, we need to run the phrase suggester query, it'd be hard to simulate its behaviour. Hopefull we can run this "phrase suggester" by hand with an elasticsearch request. > > Okay… here’s an idea: tokenize the zero-result queries and search > individual tokens against a list of terms indexed in enwiki, with and > without a reverse index. The suggester works with shingles (word grams of size 1, 2 and 3). Maybe it makes sense to run the queries against the word unigrams... but this will definitely be harder than running the elasticsearch suggest query. > > None of these will give exact results, but various incarnations would > give upper and lower bounds on the usefulness of the reverse index. > For example, if only 0.05% of query tokens, in 0.07% of queries, are > found only by the reverse index, it probably isn’t going to help. If > 75% of them are, then it probably is. Agreed, To sum up, here is a reasonable process to check if the reverse field is worth a try: - Add an option to filter a subset of fields to dumpIndex - Extract a subset of full text searches that returned zero result and no suggestions (en, fr, de, it and es would be a good start?) - Dump title and redirect fields from these wikis - Import this data into an elasticsearch instance with the reverse field activated (on labs?) - write a small script that runs phrase suggester queries - run the phrase suggester query and count Note that we will not be able to measure things like : search is a better than samech for the query saerch. This seems impossible to check without human review. We could do another run with queries where a suggestion was found and generate a diff that will be reviewed by hand: user_query: saerch prod_suggestion: samech with_reverse: search

3 4

Data backfilling
by Oliver Keyes 28 Jul '15

28 Jul '15

A (belated) note that the dashboard backfilling is complete and should be up to date :). -- Oliver Keyes Research Analyst Wikimedia Foundation

1 0

← Newer
1
2
3
4
5
6
7
8
9
10
11
Older →

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Wikimedia-search