Discovery

discovery@lists.wikimedia.org

1 participants
755 discussions

[Wikimedia-search] Fwd: "Morelike" suggestions - the results are in!
by Dmitry Brant 05 Aug '15

05 Aug '15

moving to mobile-l, and cc Search & Discovery. ---------- Forwarded message ---------- From: Dmitry Brant <dbrant(a)wikimedia.org> Date: Wed, Jul 29, 2015 at 3:38 PM Subject: "Morelike" suggestions - the results are in! To: Internal communication for WMF Reading team < reading-wmf(a)lists.wikimedia.org> Hi all, For the last few weeks, we've had an A/B test in the Android app where we measure user engagement with the "read more" suggestions that we show at the bottom of each article. We display three suggestions for further reading, based on either (A) a plain full-text search query based on the title of the current article, or (B) a query using the "morelike" feature in CirrusSearch. And the winner is... (perhaps not entirely surprisingly) "morelike"! Users who saw suggestions based on "morelike" were over 20% more likely to click on one of the suggestions. Here's a quick analysis and chart of the data from the last 10 days: *https://docs.google.com/spreadsheets/d/1BFsrAcPgexQyNVemmJ3k3IX5rtPvJ_5vdYOyGgS5R6Y/edit?usp=sharing <https://docs.google.com/spreadsheets/d/1BFsrAcPgexQyNVemmJ3k3IX5rtPvJ_5vdYO…>* -Dmitry

4 4

[Wikimedia-search] Scaleable Event Systems recap
by Oliver Keyes 04 Aug '15

04 Aug '15

Heyo, Discovery team! (Analytics CCd) This is just a quick writeup of the Scaleable Event Systems meeting that Erik, Dan, Stas and I went to (although just from my perspective). For people not in the initial thread, this is a proposal to replace the internal architecture of EventLogging and similar services with Apache Kafka brokers (http://www.confluent.io/blog/stream-data-platform-1/ ). What that means in practice is that the current 1-2k events/second limit on EventLogging will disappear and we can stop worrying about sampling and accidentally bringing down the system. We can be a lot less cautious about our schemas and a lot less cautious about our sampling rate! It also offers up a lot of opportunities around streaming data and making it available in a layered fashion - while we don't want to explore that right now, I don't think, it's nice to have as an option when we better understand our search data and how we can safely distribute it. I'd like to thank the Analytics team, particularly Andrew, for putting this together; it was a super-helpful discussion to be in and this sort of product is precisely what I, at least, have been hoping for out of the AnEng brain trust. Full speed ahead! -- Oliver Keyes Count Logula Wikimedia Foundation

3 3

[Wikimedia-search] Where do zero results queries come from?
by Oliver Keyes 04 Aug '15

04 Aug '15

Hey all, This Friday, Trey Jones (our awesome Relevance Engineer) and I spent some time playing detective with the sampled request logs and a list of the most common queries resulting in zero results. We found a lot of interesting things. In particular: 1. A common pattern in which queries, for no particular reason, had a UNIX timestamp preceding them (example: "1436336857594:2019 FIFA Women's World Cup"). This is responsible, on its own, for 3% of zero results queries - and it appears to be caused by the Wikimedia Apps. 2. A search for strings in quotes followed by 'film' (example: "\"Seventh Son\" film"). This is caused by a media player and is responsible for around 0.5% of zero results queries. 3. A search for "quot" strings (example: " quot James Tree quot"). This is from the National Library of Australia and is again around 0.5% of zero results queries. 4. A search for a page title and the name of a page that appears as a link within that page (example: "\"2C-T-19\" AND \"JWH-081\""). This is about 6% of queries and appears to come from a German IP address. We're unaware of who this person is or what they're trying, so if anyone knows what on earth this is, we'd appreciate the hint ;). https://phabricator.wikimedia.org/T107724 is a card representing the need to reach out to these people, where possible (obviously this will be easier for the app team than anyone else ;p). If we can get all of these solved for, we could drop the zero results rate for full text by about 10% Obviously cutting /all/ of it out is improbable, but we're hopeful that we can drop this number and get a better understanding of what third-party users are trying to achieve, to boot. -- Oliver Keyes Count Logula Wikimedia Foundation

5 8

[Wikimedia-search] 500K multilingual wikipedia zero-results queries
by Trey Jones 01 Aug '15

01 Aug '15

Hi everyone, I've broadened my analysis from enwiki to the other larger wikis, looking at the same phenomena I found in enwiki. While the DOI searches are definitely an issue across 25 wikis, with the other earlier-identified issues some are cross-wiki and some are not. *TL;DR: After DOI searches, "unix timestamp" searches are the biggest cross-wikipedia issue. Weird AND queries and quot queries are big contributors on enwiki, which make them important overall. We could easily fix the unix timestamp queries (either auto correct or make suggestions), and we could fix lots of the quot queries. All of these could be included in the category of "automata" that could potentially be separated from regular queries, and it wouldn't hurt to track down their sources and help people search better.* The <unix-timestamp-looking number>:<wiki title> format (with a small number with a space after the colon) is spread across 45 wikis, with 28,089 instances out of 500K (~5.6%). More than half of the results are enwiki (15,961), but there are 3133 on ru, 2986 on it, 1889 on ja, and hundreds on tr, fa, nl, ar, he, hi, id, and cs. At a cursory glance, all seem to be largely named entities or queries in the appropriate language. Removing the "14###########:", tracking down the source, or putting this on the automata list would help a lot. The boolean AND queries are largely in enwiki (17607: ~3.5% overall, ~7.9% in enwiki), and they are a mixed bag, but many (626) appear with quot, and most (16657) are of the form "article_title_with_underscore" AND "article title without underscores" where the first half is repeated over and over and the second half is something linked to in the first article. Find the source and add to the automata list. In plwiki (263), the AND queries are all of the form *<musical thing>* AND (muzyk* OR Dyskografia) where <musical thing> seems to be an artist, band, album, or something similar. This looks like an automaton, but may not be worth pursuing. Similarly the ones from nl. Globally, OR queries are much more common. 46,035 (~9.2%), spread much more evenly over all the wikis. These are almost all the DOI queries. quot is totally an enwiki thing. It's ~1.2% overall and ~2.8% in enwiki in this sample, which is a lot for one small thing. We should either create a secondary search with filtered quot or track down the source and help them figure out how to do better. TV episodes and films ("<title> S#E#" film) are mostly on enwiki (~1.1% overall, ~2.4% of enwiki queries), with some on ja, fr, and de, and single digits on it and ru. I'd count this as automata, though finding a source would be nice. Strings of numbers do happen everywhere, but are only common on enwiki, with less on jawiki, and much less on de, fr, ru, vi, and nl. My last bit of analysis will later this week, and I'll try to look at non-English and/or cross-wiki stuff, write it all up in Phabricator, and move on. On Tue, Jul 28, 2015 at 9:51 AM, Trey Jones <tjones(a)wikimedia.org> wrote: > Okay, I have a slightly better sample this morning. (I accidentally left > out Wikipedias with abbreviations longer than 2 letters). > > My new sample: > 500K zero-result full_text queries (web and API) across the Wikipedias > with 100K+ articles > 383,433 unique search strings (that's a long, long tail) > The sample covers a little over an hour: 2015-07-23 07:51:29 to 2015-07-23 > 08:55:42 > The top 10 (en, de, pt, ja, ru, es, it, fr, zh, nl), account for >83% of > queries > > Top 10 counts, for reference: > 221618 enwiki > 51936 dewiki > 25500 ptwiki > 24206 jawiki > 21891 ruwiki > 19913 eswiki > 18303 itwiki > 14443 frwiki > 11730 zhwiki > 7685 nlwiki > ----- > 417225 > > The DOI searches that appear to come from Lagotto installations hit 25 > wikis (as the Lagotto docs said they would), with en getting a lot more, > and ru getting fewer in this sample, and the rest *very* evenly > distributed. (I missed ceb and war before—apologies). The total is just > over 50K queries, or >10% of the full text queries against larger wikis > that result in zero results. > > ===DOI > 6050 enwiki > 1904 nlwiki > 1902 cebwiki > 1901 warwiki > 1900 viwiki > 1900 svwiki > 1900 jawiki > 1899 frwiki > 1899 eswiki > 1899 dewiki > 1898 zhwiki > 1898 ukwiki > 1898 plwiki > 1898 itwiki > 1897 ptwiki > 1897 nowiki > 1897 fiwiki > 1896 huwiki > 1896 fawiki > 1896 cswiki > 1896 cawiki > 1895 kowiki > 1895 idwiki > 1895 arwiki > 475 ruwiki > ----- > 50181 > > On Mon, Jul 27, 2015 at 5:04 PM, Trey Jones <tjones(a)wikimedia.org> wrote: > >> > I've started looking at a 500K sample from 7/24 across all wikis. I'll >> have more results tomorrow, but right now it's already clear that someone >> is spamming useless DOI searches across wikis—and it's 9% of the wiki >> zero-results queries. >> >>

5 8

[Wikimedia-search] Enable or disable full text search query rewriting by default for API clients.
by Erik Bernhardson 01 Aug '15

01 Aug '15

We have a new feature for web requests that rewrites zero result queries into a new search that might have results. I've started porting this same feature over to API clients so it has a larger effect on our zero results rate, but code review has turned up some indecision on if this should be enabled or disabled by default in the API. Either way the feature will be toggleable. I thought we should open this up to a larger audience, are there any opinions? Erik B.

6 8

Re: [Wikimedia-search] testing the value of a reverse index
by David Causse 31 Jul '15

31 Jul '15

Le 29/07/2015 19:26, Trey Jones a écrit : > (Thoughts are cloudy with a chance of brainstorming) > > Hey guys I saw part of your discussion on IRC about testing whether > reverse indexes help. I couldn’t reply there at the time, so I started > thinking about it. This unfortunately long email is the result. (Sorry.) No problem, I like reading your mails :) > > While it would be good to know how the reverse index helps on a wiki > of more manageable size like frwiki, I wouldn’t necessarily expect the > patterns of typos to be the same between enwiki and frwiki (or any > other language wiki)—language phonotactics & orthography, keyboard > layout, mobile use, and user demographics could all have an effect on > the type and frequency of typos. So a reverse index could generally be > useful in one language and not in another—in theory it wouldn’t hurt > to test specifically on any large wiki where the cost of adding the > reverse index is non-trivial. We have some technical restrictions here, if we activate this settings on one wiki we'll need to reindex most of the wikis because we have cross-wiki searches. wikiA can query wikiB's index, if wikiB index is not updated with correct settings the query will fail. The cross wiki queries I know so far are : - all wikis can query commons.wikimedia.org index - itwiki will query all its sister projects (itwiktionary, itwikivoyage, itwikibooks ...) - maybe more So it's hard to work with mixed settings with the current architecture :( > > I’m trying to think of ways to extrapolate from a sample of some sort. > I’m spit-balling and thinking through as I type—I don’t know if any of > these are good ideas, but maybe one will lead to a better idea. > > Do we know what percentage of searches (in enwiki or in general) match > article titles? We could extract article titles and search against > those with and without a reverse index as a test. > > Or, is it possible to get a reasonably sized random subset of enwiki, > say 10-20%? If so, you could run a sample of non-zero queries against > it and determine that, say, 47% of queries that get results on the > full wiki also get results on this partial wiki… and the run the zero > queries with a reverse index and extrapolate. We can dump a subset of enwiki, the dump tool we use has a --limit param. Unfortunately I have absolutely no idea if the subset will be representative. There is likely a phenomenon similar to db dumps: old docs will be dumped first, for lucene old docs generally means docs that has never been updated, in other words it will be pages that are not very interesting. > > Hmm… if none of the relevant search elements rely on anything other > than the presence of terms in a document, then you could make a > “compact” version of enwiki, where each document keeps only one > instance of each word in it. A quick hacky test on a handful of medium > to longish documents gives compression of 30-50% per document, if > that’s enough to matter. Of course, term frequency, proximity, and > other things would be wildly skewed—but “is it in the index?” would work. It's a good idea but I don't know how to dump this info, there's no easy way to dump the index lexicon in production. Another (similar idea) would be to dump only the fields needed for the suggester to work. The suggester works with title and redirect only, in theory we could dump only these fields which would result in something like 200Mb gzip files for enwiki. Unfortunately I don't have this option in the dump script :( I think it's the best way to go but : - we need to change the dump tool to filter a selected set of fields - I never tested this tool in production, I don't know if it'll hurt perf. I guess it's OK because it's somewhat the same process that is done with inplace reindex. > > Actually, of all you need is “is it in the index?” you could just dump > a list of words in the index and run searches against that. That's a bit trickier, we need to run the phrase suggester query, it'd be hard to simulate its behaviour. Hopefull we can run this "phrase suggester" by hand with an elasticsearch request. > > Okay… here’s an idea: tokenize the zero-result queries and search > individual tokens against a list of terms indexed in enwiki, with and > without a reverse index. The suggester works with shingles (word grams of size 1, 2 and 3). Maybe it makes sense to run the queries against the word unigrams... but this will definitely be harder than running the elasticsearch suggest query. > > None of these will give exact results, but various incarnations would > give upper and lower bounds on the usefulness of the reverse index. > For example, if only 0.05% of query tokens, in 0.07% of queries, are > found only by the reverse index, it probably isn’t going to help. If > 75% of them are, then it probably is. Agreed, To sum up, here is a reasonable process to check if the reverse field is worth a try: - Add an option to filter a subset of fields to dumpIndex - Extract a subset of full text searches that returned zero result and no suggestions (en, fr, de, it and es would be a good start?) - Dump title and redirect fields from these wikis - Import this data into an elasticsearch instance with the reverse field activated (on labs?) - write a small script that runs phrase suggester queries - run the phrase suggester query and count Note that we will not be able to measure things like : search is a better than samech for the query saerch. This seems impossible to check without human review. We could do another run with queries where a suggestion was found and generate a diff that will be reviewed by hand: user_query: saerch prod_suggestion: samech with_reverse: search

3 4

[Wikimedia-search] Data backfilling
by Oliver Keyes 29 Jul '15

29 Jul '15

A (belated) note that the dashboard backfilling is complete and should be up to date :). -- Oliver Keyes Research Analyst Wikimedia Foundation

1 0

[Wikimedia-search] Initial data review of 100K full_text zero-results queries to enwiki
by Trey Jones 29 Jul '15

29 Jul '15

Hey everyone, I got access to some logs and I've been slogging through the data. In particular, I've partially analyzed a sample of 100K zero-result full_text searches against enwiki, over the course of about an hour (2015-07-23 07:51:29 to 2015-07-23 08:55:42). My results and opinions are below. *TL;DR Summary: If these patterns hold for another sample (and across languages), we should be able to get some decent mileage out of these simple approaches:* * - find sources of weird patterns and either ignore them, or contact the source and redirect them to a more appropriate destination* * - use language or character set detection to redirect queries to other wikis* * - filter the term "quot" from queries* * - filter 14###########: from the front of queries* * - replace _ with space in queries* All of this is somewhat rough, and exact numbers aren't guaranteed. Also the categories may overlap. I also intend to look for these same patterns from another sample from a different day and make sure they are more general and not just temporary idiosyncrasies. I also plan to look through other language wikis (i.e., Spanish and French to start) to see if there are cross-linguistic patterns like these. I think we have to some how come to terms with the fact that some queries don't deserve results, and maybe figure out the source of such "illegitimate" queries and filter them. (I'd really like to be able to track down the referrer, if there is one, for a lot of the weirder queries.) Top query: - 248 Dounload feer game - all via web... and Google can't find it. That's just weird. Some other categories of queries are below. The numbers are "<total queries> / <unique queries>". Since this is a 100K sample of zero-result queries, and zero-results are about 25% of all results, each 1,000 of total queries here represents about 0.25% of all search queries. 253 / 171 string of numbers 3610 / 2505 no Latin letters - I see Korean, Thai, Japanese, Cyrillic, doi #s (see below), Arabic, Hebrew, Greek, Armenian, Georgian, Devanagari, Burmese, Chinese, and some emoji (e.g., 11 searches for 😜💗🎨❤️💋😞☀️💦). - I also saw instances of mixed Latin / non-Latin queries - Includes gibberish, which is hard to grep for, but easy to spot by eye - Lots of the non-gibberish ones are clearly in other languages, and I saw queries in other Latin-alphabet languages go by, too. 2630 / 2627 DOIs, all in quotes 3015 / 1017 have quot in them (which gets auto-corrected to "quote", obviously) - 327 are one word: quot ... quot - I don't know where these are coming from, but they are weird. If we strip "quot" we would get many of these. This must be coming from some source that is adding quotes, then escaping them as """ and then stripping & and ;. Weird. 7155 / 6337 #:Name - almost all are 14###########:Text - e.g., 1436755654740:Sherlock Holmes - These all look like Wikipedia titles! - Two each of 0:... and 6000:... 114 / 85 actual http(s):// URLs 488 / 244 URL-like things starting with www... and ending with .com, .ru, etc. 211 / 132 other searches starting with “www.” 1085 / 1083 article searches in this format: ('"<TITLE>"', '<AUTHOR(S)>') 2457 / 2060 TV episodes (based on the presence of "S#E#"—that's season #, episode #) 8419 / 7523 AND boolean searches 703 / 701 OR boolean searches - Many of these look auto-generated, esp in the aggregate. - For example: there are 498 / 249 "House_of_Gurieli" AND ... queries 6310 / 5742 queries with _ in them - only 934 / 790 if we skip the 14###########:Text and boolean AND queries Other things I noticed: - lots of queries for books, articles, movies, tv, mp3s, and porn (in multiple languages) - lots of "building up" searches (and these are all marked full_text), for example: achevm achevme achevmen achevment achevments achevments o achevments of achevments of achevments of h achevments of he achevments of hell achevments of helle achevments of hellen achevments of hellen k achevments of hellen k achevments of hellen kell achevments of hellen kelle achevments of hellen keller - reasonable-looking ~ queries don't work: intitle:George~ intitle:Washin~ gives 0 results intitle:Washington intitle:George gives 279 results Finally, I did see a bunch of typos, but I didn't try to quantify them because I was digging into all of these other interesting patterns. Have a good weekend. —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

8 21

[Wikimedia-search] Fwd: Zero search results—how can I help?
by Erik Bernhardson 25 Jul '15

25 Jul '15

This thread started between a few of us, but has some good ideas and thoughts. Forwarding into the search mailing list (where we will endeavour to have these conversations in the future). Erik B ---------- Forwarded message ---------- From: Oliver Keyes <okeyes(a)wikimedia.org> Date: Wed, Jul 22, 2015 at 8:31 AM Subject: Re: Zero search results—how can I help? To: David Causse <dcausse(a)wikimedia.org> Cc: Trey Jones <tjones(a)wikimedia.org>, Erik Bernhardson < ebernhardson(a)wikimedia.org> Whoops; I guess point 4 is the second list ;p. On 22 July 2015 at 11:30, Oliver Keyes <okeyes(a)wikimedia.org> wrote: > On 22 July 2015 at 10:55, David Causse <dcausse(a)wikimedia.org> wrote: >> Le 22/07/2015 15:21, Oliver Keyes a écrit : >>> >>> Thanks; much appreciated. Point 3 directly relates to my work so it's >>> good to be CCd :). >>> >>> FWIW, this kind of detail on the specific things we're doing is >>> missing from the main search mailing list and could be used very much >>> there to inform people. >> >> >> I agree, my intent right now is still to learn from each others and >> build/use a friendly environment where engineer with NLP background like >> Trey can work efficiently. When things will be clearer it'd be great to >> share our plan. >> >>> >>> Oliver is already handling the executor IDs and distinguishing full >>> and prefix search, so nyah ;p. >> >> Great! >> >> Just to be sure : does this means that a search count will be reduced to its >> executorID : >> - all request with the same executorID return 0 zero result -> add 1 to the >> zero result counter >> - if one of the request returns a result -> do not increment the zero result >> counter >> If yes I think this will be the killer patch for Q1 :) >> > > Executor IDs are stored and if a match is found in executor IDs <=120 > seconds after that one, the later outcome is considered "the outcome". > If not, we assume no second round-trip was made and so go with > whatever happened first. > > So if you make a request and it round-trips once and fails, failure. > Round-trip once and succeeds, success. Round-trip twice and fail both > times, failure. Round-trip twice and fail the first time and succeed > the second - one success, zero failures :). Erik wrote it, and I grok > the logic. > >>> On the language detection - actually >>> Kolkus and Rehurek published a work in 2009 that handles small amounts >>> of text really really well (n-gram based approaches /suck at this/) >>> and there's a Java implementation I've been playing with. Want me to >>> run it across some search strings and we can look at the results? Or >>> just send the code across. >> >> If you ask I'd say both! ;) >> >> We evaluated this kind of dictionary-based language detection (but this not >> this one specifically), problem for us was mostly due to performance: it >> takes time to tokenize the input string correctly and the dictionary we used >> was rather big. But we worked mainly on large content (webnews, press >> articles). >> In our case input strings should be very small so it makes more sense. We >> should be able to train the dictionary against the "all title in ns0" dumps >> though. >> >> This is also a great example to explain why I feel stuck sometimes: >> How will you be able to test it? >> - I'm not allowed to download search logs locally. >> - I think I won't be able to install java and play with this kind of tools >> on fluorine. >> > > Ahh, but! You're NDAd, your laptop is a work laptop, and you have FDE, > right? If yes to all three, I don't see a problem with me squirting > you a sample of logs (and the Java). I figure if we find the > methodology works we can look at speedups to the code, which is a lot > easier a task than looking at fast code and trying to improve the > methodology. > >> Another point: >> concerning the following tasks described below, I think it overlaps >> analytics tasks (because it's mainly related to learning from search logs). >> I don't know how you work today and maybe this is something you've already >> done or is obviously wrong. >> I think you're one of the best person today to help us to sort this out, so >> your feedback concerning the following lines will be greatly appreciated :) >> >> Thanks! > > Yes! Okay, thoughts on the below: > > 1. Build a search log parser - we sort of have that through the > streaming python script. It depends whether you mean a literal parser > or something to pick out all the "important" bits. See point 4. > 2. Big machine: I'd love this. But see point 4. > 3. Improve search logs for us: when we say improve for us do we mean > for analytics/improvements purposes? Because if so we've been talking > about having the logs in HDFS which would make things pretty easy for > all and sundry and avoid the need for a parser. > > One way of neatly handling all of this would be: > > 1. Get the logs in a format that has the fields we want and stream it > into Hadoop. No parser necessary. > 2. Stick the big-ass machine in the analytics cluster, where it has > default access to Hadoop and can grab data trivially, but doesn't have > to break anyone else's stuff. > 3. Fin. > > What am I missing? Other than "setting up a MediaWiki kafka client is > going to be kind of a bit of work". > >>>> >>>> Le 22/07/2015 14:38, David Causse a écrit : >>>>> >>>>> It's still not very clear in my mind but things could look like : >>>>> >>>>> * Epic: Build a toolbox to learn from search logs >>>>> - Create a script to run search queries against the production >>>>> index >>>>> - Build search logs parser that provide all the needed details : >>>>> time, >>>>> search type, wiki origin, target search index, search query, search >>>>> query >>>>> ID, number of results, offset of the results (search page) >>>>> (side note : Erik will it be possible to pass the queryID from >>>>> page to page when user clicks "next page"?) >>>>> - Have a descent machine (64g RAM would be great) in the production >>>>> cluster where we can >>>>> - download production search logs >>>>> - install the tools we want >>>>> - stress it not being afraid to kill it >>>>> - do all the stuff we want to learn from data and search logs >>>>> >>>>> * Epic: Improve search logs for us >>>>> - Add an "incognito parameter" to cirrus that could be used by the >>>>> toolbox script not to pollute our search logs when running our "search >>>>> script". >>>>> - Add a log when the user click on a search result to have a >>>>> mapping >>>>> between the queryID, the result choosen and the offset of the chosen >>>>> link in >>>>> the result list. >>>>> - This task is certainly complex and highly depends on the >>>>> client, >>>>> I don't know if we will be able to track this down on all clients but >>>>> it'd >>>>> be great for us. >>>>> - More things will be added as we learn >>>>> >>>>> * Epic: start to measure and control relevance >>>>> - Create a corpus of search queries for each wiki with their >>>>> expected >>>>> results >>>>> - Run these queries weekly/monthly and compute the F1-Score for >>>>> each >>>>> wiki >>>>> - Continuously enhance the search queries corpus >>>>> - Provide a weekly/monthly perf score for each wiki >>>>> >>>>> As you can see this is mostly about tools, I propose to start with batch >>>>> tools and think later of how we could make this more real-time. >>>>> >>>>> >> > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation

6 12

[Wikimedia-search] Dashboard backfilling and wonkiness
by Oliver Keyes 23 Jul '15

23 Jul '15

Hey all, So, the data for the Search dashboards (http://searchdata.wmflabs.org/metrics/) comes from a variety of sources, one of which is the daily logs of all Cirrus search requests - about 46GB of data a day. We set up a pipeline to this to report the "zero" rate - how many queries happen with zero results. This was a pretty shaky pipeline because it was an ultra-urgent, need-it-for-a-presentation thing. Good news: my prediction that it needed work was accurate. Bad news: my prediction that it needed work was accurate ;). When Erik and I went through all of the scripts and rewrote them on the 15th we discovered a lot of maintenance tasks that were being identified as searches. These are now being excluded, but we have to backfill 1.5 months of data. I've chosen to eliminate the old data and then backfill, because it means we avoid having data from multiple, dissonant software versions, and because it just makes the backfilling task a bit easier. As a result, the dashboards may look a bit odd over the next couple of days; they have data from the 15th onwards that we're comfortable about, but are gradually backfilling from 1 June to 14 July - starting on 1 June. So at the moment we have 1 June and 15-21 July. Weird. And then 1, 2nd June, 15th...so on. So expect to see increasingly less weird graphs, until the point where they're back to normal, (but more consistent and sane looking). Until then: yeah, they're gonna look a bit weird. Thanks, -- Oliver Keyes Research Analyst Wikimedia Foundation

2 2

← Newer
1
...
68
69
70
71
72
73
74
75
76
Older →

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery