In the team leads sync today it was suggested that we consider using Google
Calendar to track releases.
In addition to this, we have several Phabricator boards, plus pages on
Meta, Office, and Tech, plus some other stuff I'm probably forgetting.
It's hard to keep track of everything that's going on, and it's hard to
keep individual work in context of both short-term and long-term plans.
Ideally (in the spirit of lessons learned from the book Getting Things
Done), there would be a single resource that we trust to encapsulate
everything we need to know about Discovery from an engineering standpoint.
Could we feasibly create such a dashboard?
Alternatively (much more realistically), where might we create a root-level
landing page to organize links to all the various tools that we use?
moving to mobile-l, and cc Search & Discovery.
---------- Forwarded message ----------
From: Dmitry Brant <dbrant(a)wikimedia.org>
Date: Wed, Jul 29, 2015 at 3:38 PM
Subject: "Morelike" suggestions - the results are in!
To: Internal communication for WMF Reading team <
For the last few weeks, we've had an A/B test in the Android app where we
measure user engagement with the "read more" suggestions that we show at
the bottom of each article. We display three suggestions for further
reading, based on either (A) a plain full-text search query based on the
title of the current article, or (B) a query using the "morelike" feature
And the winner is... (perhaps not entirely surprisingly) "morelike"! Users
who saw suggestions based on "morelike" were over 20% more likely to click
on one of the suggestions.
Here's a quick analysis and chart of the data from the last 10 days:
I've broadened my analysis from enwiki to the other larger wikis, looking
at the same phenomena I found in enwiki.
While the DOI searches are definitely an issue across 25 wikis, with the
other earlier-identified issues some are cross-wiki and some are not.
*TL;DR: After DOI searches, "unix timestamp" searches are the biggest
cross-wikipedia issue. Weird AND queries and quot queries are big
contributors on enwiki, which make them important overall. We could easily
fix the unix timestamp queries (either auto correct or make suggestions),
and we could fix lots of the quot queries. All of these could be included
in the category of "automata" that could potentially be separated from
regular queries, and it wouldn't hurt to track down their sources and help
people search better.*
The <unix-timestamp-looking number>:<wiki title> format (with a small
number with a space after the colon) is spread across 45 wikis, with 28,089
instances out of 500K (~5.6%). More than half of the results are enwiki
(15,961), but there are 3133 on ru, 2986 on it, 1889 on ja, and hundreds on
tr, fa, nl, ar, he, hi, id, and cs. At a cursory glance, all seem to be
largely named entities or queries in the appropriate language. Removing the
"14###########:", tracking down the source, or putting this on the automata
list would help a lot.
The boolean AND queries are largely in enwiki (17607: ~3.5% overall, ~7.9%
in enwiki), and they are a mixed bag, but many (626) appear with quot, and
most (16657) are of the form
"article_title_with_underscore" AND "article title without underscores"
where the first half is repeated over and over and the second half is
something linked to in the first article. Find the source and add to the
In plwiki (263), the AND queries are all of the form
*<musical thing>* AND (muzyk* OR Dyskografia)
where <musical thing> seems to be an artist, band, album, or something
similar. This looks like an automaton, but may not be worth pursuing.
Similarly the ones from nl.
Globally, OR queries are much more common. 46,035 (~9.2%), spread much more
evenly over all the wikis. These are almost all the DOI queries.
quot is totally an enwiki thing. It's ~1.2% overall and ~2.8% in enwiki in
this sample, which is a lot for one small thing. We should either create a
secondary search with filtered quot or track down the source and help them
figure out how to do better.
TV episodes and films ("<title> S#E#" film) are mostly on enwiki (~1.1%
overall, ~2.4% of enwiki queries), with some on ja, fr, and de, and single
digits on it and ru. I'd count this as automata, though finding a source
would be nice.
Strings of numbers do happen everywhere, but are only common on enwiki,
with less on jawiki, and much less on de, fr, ru, vi, and nl.
My last bit of analysis will later this week, and I'll try to look at
non-English and/or cross-wiki stuff, write it all up in Phabricator, and
On Tue, Jul 28, 2015 at 9:51 AM, Trey Jones <tjones(a)wikimedia.org> wrote:
> Okay, I have a slightly better sample this morning. (I accidentally left
> out Wikipedias with abbreviations longer than 2 letters).
> My new sample:
> 500K zero-result full_text queries (web and API) across the Wikipedias
> with 100K+ articles
> 383,433 unique search strings (that's a long, long tail)
> The sample covers a little over an hour: 2015-07-23 07:51:29 to 2015-07-23
> The top 10 (en, de, pt, ja, ru, es, it, fr, zh, nl), account for >83% of
> Top 10 counts, for reference:
> 221618 enwiki
> 51936 dewiki
> 25500 ptwiki
> 24206 jawiki
> 21891 ruwiki
> 19913 eswiki
> 18303 itwiki
> 14443 frwiki
> 11730 zhwiki
> 7685 nlwiki
> The DOI searches that appear to come from Lagotto installations hit 25
> wikis (as the Lagotto docs said they would), with en getting a lot more,
> and ru getting fewer in this sample, and the rest *very* evenly
> distributed. (I missed ceb and war before—apologies). The total is just
> over 50K queries, or >10% of the full text queries against larger wikis
> that result in zero results.
> 6050 enwiki
> 1904 nlwiki
> 1902 cebwiki
> 1901 warwiki
> 1900 viwiki
> 1900 svwiki
> 1900 jawiki
> 1899 frwiki
> 1899 eswiki
> 1899 dewiki
> 1898 zhwiki
> 1898 ukwiki
> 1898 plwiki
> 1898 itwiki
> 1897 ptwiki
> 1897 nowiki
> 1897 fiwiki
> 1896 huwiki
> 1896 fawiki
> 1896 cswiki
> 1896 cawiki
> 1895 kowiki
> 1895 idwiki
> 1895 arwiki
> 475 ruwiki
> On Mon, Jul 27, 2015 at 5:04 PM, Trey Jones <tjones(a)wikimedia.org> wrote:
> I've started looking at a 500K sample from 7/24 across all wikis. I'll
>> have more results tomorrow, but right now it's already clear that someone
>> is spamming useless DOI searches across wikis—and it's 9% of the wiki
>> zero-results queries.
We have a new feature for web requests that rewrites zero result queries
into a new search that might have results. I've started porting this same
feature over to API clients so it has a larger effect on our zero results
rate, but code review has turned up some indecision on if this should be
enabled or disabled by default in the API. Either way the feature will be
I thought we should open this up to a larger audience, are there any
Le 29/07/2015 19:26, Trey Jones a écrit :
> (Thoughts are cloudy with a chance of brainstorming)
> Hey guys I saw part of your discussion on IRC about testing whether
> reverse indexes help. I couldn’t reply there at the time, so I started
> thinking about it. This unfortunately long email is the result. (Sorry.)
No problem, I like reading your mails :)
> While it would be good to know how the reverse index helps on a wiki
> of more manageable size like frwiki, I wouldn’t necessarily expect the
> patterns of typos to be the same between enwiki and frwiki (or any
> other language wiki)—language phonotactics & orthography, keyboard
> layout, mobile use, and user demographics could all have an effect on
> the type and frequency of typos. So a reverse index could generally be
> useful in one language and not in another—in theory it wouldn’t hurt
> to test specifically on any large wiki where the cost of adding the
> reverse index is non-trivial.
We have some technical restrictions here, if we activate this settings
on one wiki we'll need to reindex most of the wikis because we have
wikiA can query wikiB's index, if wikiB index is not updated with
correct settings the query will fail.
The cross wiki queries I know so far are :
- all wikis can query commons.wikimedia.org index
- itwiki will query all its sister projects (itwiktionary, itwikivoyage,
- maybe more
So it's hard to work with mixed settings with the current architecture :(
> I’m trying to think of ways to extrapolate from a sample of some sort.
> I’m spit-balling and thinking through as I type—I don’t know if any of
> these are good ideas, but maybe one will lead to a better idea.
> Do we know what percentage of searches (in enwiki or in general) match
> article titles? We could extract article titles and search against
> those with and without a reverse index as a test.
> Or, is it possible to get a reasonably sized random subset of enwiki,
> say 10-20%? If so, you could run a sample of non-zero queries against
> it and determine that, say, 47% of queries that get results on the
> full wiki also get results on this partial wiki… and the run the zero
> queries with a reverse index and extrapolate.
We can dump a subset of enwiki, the dump tool we use has a --limit
param. Unfortunately I have absolutely no idea if the subset will be
representative. There is likely a phenomenon similar to db dumps: old
docs will be dumped first, for lucene old docs generally means docs that
has never been updated, in other words it will be pages that are not
> Hmm… if none of the relevant search elements rely on anything other
> than the presence of terms in a document, then you could make a
> “compact” version of enwiki, where each document keeps only one
> instance of each word in it. A quick hacky test on a handful of medium
> to longish documents gives compression of 30-50% per document, if
> that’s enough to matter. Of course, term frequency, proximity, and
> other things would be wildly skewed—but “is it in the index?” would work.
It's a good idea but I don't know how to dump this info, there's no easy
way to dump the index lexicon in production.
Another (similar idea) would be to dump only the fields needed for the
suggester to work.
The suggester works with title and redirect only, in theory we could
dump only these fields which would result in something like 200Mb gzip
files for enwiki. Unfortunately I don't have this option in the dump
I think it's the best way to go but :
- we need to change the dump tool to filter a selected set of fields
- I never tested this tool in production, I don't know if it'll hurt
perf. I guess it's OK because it's somewhat the same process that is
done with inplace reindex.
> Actually, of all you need is “is it in the index?” you could just dump
> a list of words in the index and run searches against that.
That's a bit trickier, we need to run the phrase suggester query, it'd
be hard to simulate its behaviour. Hopefull we can run this "phrase
suggester" by hand with an elasticsearch request.
> Okay… here’s an idea: tokenize the zero-result queries and search
> individual tokens against a list of terms indexed in enwiki, with and
> without a reverse index.
The suggester works with shingles (word grams of size 1, 2 and 3). Maybe
it makes sense to run the queries against the word unigrams... but this
will definitely be harder than running the elasticsearch suggest query.
> None of these will give exact results, but various incarnations would
> give upper and lower bounds on the usefulness of the reverse index.
> For example, if only 0.05% of query tokens, in 0.07% of queries, are
> found only by the reverse index, it probably isn’t going to help. If
> 75% of them are, then it probably is.
To sum up, here is a reasonable process to check if the reverse field is
worth a try:
- Add an option to filter a subset of fields to dumpIndex
- Extract a subset of full text searches that returned zero result and
no suggestions (en, fr, de, it and es would be a good start?)
- Dump title and redirect fields from these wikis
- Import this data into an elasticsearch instance with the reverse field
activated (on labs?)
- write a small script that runs phrase suggester queries
- run the phrase suggester query and count
Note that we will not be able to measure things like :
search is a better than samech for the query saerch.
This seems impossible to check without human review. We could do another
run with queries where a suggestion was found and generate a diff that
will be reviewed by hand:
I got access to some logs and I've been slogging through the data. In
particular, I've partially analyzed a sample of 100K zero-result full_text
searches against enwiki, over the course of about an hour (2015-07-23
07:51:29 to 2015-07-23 08:55:42). My results and opinions are below.
*TL;DR Summary: If these patterns hold for another sample (and across
languages), we should be able to get some decent mileage out of these
* - find sources of weird patterns and either ignore them, or contact the
source and redirect them to a more appropriate destination*
* - use language or character set detection to redirect queries to other
* - filter the term "quot" from queries*
* - filter 14###########: from the front of queries*
* - replace _ with space in queries*
All of this is somewhat rough, and exact numbers aren't guaranteed. Also
the categories may overlap. I also intend to look for these same patterns
from another sample from a different day and make sure they are more
general and not just temporary idiosyncrasies. I also plan to look through
other language wikis (i.e., Spanish and French to start) to see if there
are cross-linguistic patterns like these.
I think we have to some how come to terms with the fact that some queries
don't deserve results, and maybe figure out the source of such
"illegitimate" queries and filter them. (I'd really like to be able to
track down the referrer, if there is one, for a lot of the weirder queries.)
- 248 Dounload feer game
- all via web... and Google can't find it. That's just weird.
Some other categories of queries are below. The numbers are "<total
queries> / <unique queries>". Since this is a 100K sample of zero-result
queries, and zero-results are about 25% of all results, each 1,000 of total
queries here represents about 0.25% of all search queries.
253 / 171 string of numbers
3610 / 2505 no Latin letters
- I see Korean, Thai, Japanese, Cyrillic, doi #s (see below), Arabic,
Hebrew, Greek, Armenian, Georgian, Devanagari, Burmese, Chinese, and some
emoji (e.g., 11 searches for 😜💗🎨❤️💋😞☀️💦).
- I also saw instances of mixed Latin / non-Latin queries
- Includes gibberish, which is hard to grep for, but easy to spot by eye
- Lots of the non-gibberish ones are clearly in other languages, and I saw
queries in other Latin-alphabet languages go by, too.
2630 / 2627 DOIs, all in quotes
3015 / 1017 have quot in them (which gets auto-corrected to "quote",
- 327 are one word: quot ... quot
- I don't know where these are coming from, but they are weird. If we strip
"quot" we would get many of these. This must be coming from some source
that is adding quotes, then escaping them as """ and then stripping &
and ;. Weird.
7155 / 6337 #:Name
- almost all are 14###########:Text
- e.g., 1436755654740:Sherlock Holmes
- These all look like Wikipedia titles!
- Two each of 0:... and 6000:...
114 / 85 actual http(s):// URLs
488 / 244 URL-like things starting with www... and ending with .com, .ru,
211 / 132 other searches starting with “www.”
1085 / 1083 article searches in this format: ('"<TITLE>"', '<AUTHOR(S)>')
2457 / 2060 TV episodes (based on the presence of "S#E#"—that's season #,
8419 / 7523 AND boolean searches
703 / 701 OR boolean searches
- Many of these look auto-generated, esp in the aggregate.
- For example: there are 498 / 249 "House_of_Gurieli" AND ... queries
6310 / 5742 queries with _ in them
- only 934 / 790 if we skip the 14###########:Text and boolean AND queries
Other things I noticed:
- lots of queries for books, articles, movies, tv, mp3s, and porn (in
- lots of "building up" searches (and these are all marked full_text), for
achevments of h
achevments of he
achevments of hell
achevments of helle
achevments of hellen
achevments of hellen k
achevments of hellen k
achevments of hellen kell
achevments of hellen kelle
achevments of hellen keller
- reasonable-looking ~ queries don't work:
intitle:George~ intitle:Washin~ gives 0 results
intitle:Washington intitle:George gives 279 results
Finally, I did see a bunch of typos, but I didn't try to quantify them
because I was digging into all of these other interesting patterns.
Have a good weekend.
Software Engineer, Discovery
This thread started between a few of us, but has some good ideas and
thoughts. Forwarding into the search mailing list (where we will endeavour
to have these conversations in the future).
---------- Forwarded message ----------
From: Oliver Keyes <okeyes(a)wikimedia.org>
Date: Wed, Jul 22, 2015 at 8:31 AM
Subject: Re: Zero search results—how can I help?
To: David Causse <dcausse(a)wikimedia.org>
Cc: Trey Jones <tjones(a)wikimedia.org>, Erik Bernhardson <
Whoops; I guess point 4 is the second list ;p.
On 22 July 2015 at 11:30, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
> On 22 July 2015 at 10:55, David Causse <dcausse(a)wikimedia.org> wrote:
>> Le 22/07/2015 15:21, Oliver Keyes a écrit :
>>> Thanks; much appreciated. Point 3 directly relates to my work so it's
>>> good to be CCd :).
>>> FWIW, this kind of detail on the specific things we're doing is
>>> missing from the main search mailing list and could be used very much
>>> there to inform people.
>> I agree, my intent right now is still to learn from each others and
>> build/use a friendly environment where engineer with NLP background like
>> Trey can work efficiently. When things will be clearer it'd be great to
>> share our plan.
>>> Oliver is already handling the executor IDs and distinguishing full
>>> and prefix search, so nyah ;p.
>> Just to be sure : does this means that a search count will be reduced to
>> executorID :
>> - all request with the same executorID return 0 zero result -> add 1 to
>> zero result counter
>> - if one of the request returns a result -> do not increment the zero
>> If yes I think this will be the killer patch for Q1 :)
> Executor IDs are stored and if a match is found in executor IDs <=120
> seconds after that one, the later outcome is considered "the outcome".
> If not, we assume no second round-trip was made and so go with
> whatever happened first.
> So if you make a request and it round-trips once and fails, failure.
> Round-trip once and succeeds, success. Round-trip twice and fail both
> times, failure. Round-trip twice and fail the first time and succeed
> the second - one success, zero failures :). Erik wrote it, and I grok
> the logic.
>>> On the language detection - actually
>>> Kolkus and Rehurek published a work in 2009 that handles small amounts
>>> of text really really well (n-gram based approaches /suck at this/)
>>> and there's a Java implementation I've been playing with. Want me to
>>> run it across some search strings and we can look at the results? Or
>>> just send the code across.
>> If you ask I'd say both! ;)
>> We evaluated this kind of dictionary-based language detection (but this
>> this one specifically), problem for us was mostly due to performance: it
>> takes time to tokenize the input string correctly and the dictionary we
>> was rather big. But we worked mainly on large content (webnews, press
>> In our case input strings should be very small so it makes more sense. We
>> should be able to train the dictionary against the "all title in ns0"
>> This is also a great example to explain why I feel stuck sometimes:
>> How will you be able to test it?
>> - I'm not allowed to download search logs locally.
>> - I think I won't be able to install java and play with this kind of
>> on fluorine.
> Ahh, but! You're NDAd, your laptop is a work laptop, and you have FDE,
> right? If yes to all three, I don't see a problem with me squirting
> you a sample of logs (and the Java). I figure if we find the
> methodology works we can look at speedups to the code, which is a lot
> easier a task than looking at fast code and trying to improve the
>> Another point:
>> concerning the following tasks described below, I think it overlaps
>> analytics tasks (because it's mainly related to learning from search
>> I don't know how you work today and maybe this is something you've
>> done or is obviously wrong.
>> I think you're one of the best person today to help us to sort this out,
>> your feedback concerning the following lines will be greatly appreciated
> Yes! Okay, thoughts on the below:
> 1. Build a search log parser - we sort of have that through the
> streaming python script. It depends whether you mean a literal parser
> or something to pick out all the "important" bits. See point 4.
> 2. Big machine: I'd love this. But see point 4.
> 3. Improve search logs for us: when we say improve for us do we mean
> for analytics/improvements purposes? Because if so we've been talking
> about having the logs in HDFS which would make things pretty easy for
> all and sundry and avoid the need for a parser.
> One way of neatly handling all of this would be:
> 1. Get the logs in a format that has the fields we want and stream it
> into Hadoop. No parser necessary.
> 2. Stick the big-ass machine in the analytics cluster, where it has
> default access to Hadoop and can grab data trivially, but doesn't have
> to break anyone else's stuff.
> 3. Fin.
> What am I missing? Other than "setting up a MediaWiki kafka client is
> going to be kind of a bit of work".
>>>> Le 22/07/2015 14:38, David Causse a écrit :
>>>>> It's still not very clear in my mind but things could look like :
>>>>> * Epic: Build a toolbox to learn from search logs
>>>>> - Create a script to run search queries against the production
>>>>> - Build search logs parser that provide all the needed details :
>>>>> search type, wiki origin, target search index, search query, search
>>>>> ID, number of results, offset of the results (search page)
>>>>> (side note : Erik will it be possible to pass the queryID
>>>>> page to page when user clicks "next page"?)
>>>>> - Have a descent machine (64g RAM would be great) in the
>>>>> cluster where we can
>>>>> - download production search logs
>>>>> - install the tools we want
>>>>> - stress it not being afraid to kill it
>>>>> - do all the stuff we want to learn from data and search logs
>>>>> * Epic: Improve search logs for us
>>>>> - Add an "incognito parameter" to cirrus that could be used by
>>>>> toolbox script not to pollute our search logs when running our "search
>>>>> - Add a log when the user click on a search result to have a
>>>>> between the queryID, the result choosen and the offset of the chosen
>>>>> link in
>>>>> the result list.
>>>>> - This task is certainly complex and highly depends on the
>>>>> I don't know if we will be able to track this down on all clients but
>>>>> be great for us.
>>>>> - More things will be added as we learn
>>>>> * Epic: start to measure and control relevance
>>>>> - Create a corpus of search queries for each wiki with their
>>>>> - Run these queries weekly/monthly and compute the F1-Score for
>>>>> - Continuously enhance the search queries corpus
>>>>> - Provide a weekly/monthly perf score for each wiki
>>>>> As you can see this is mostly about tools, I propose to start with
>>>>> tools and think later of how we could make this more real-time.
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
So, the data for the Search dashboards
(http://searchdata.wmflabs.org/metrics/) comes from a variety of
sources, one of which is the daily logs of all Cirrus search requests
- about 46GB of data a day. We set up a pipeline to this to report the
"zero" rate - how many queries happen with zero results. This was a
pretty shaky pipeline because it was an ultra-urgent,
Good news: my prediction that it needed work was accurate. Bad news:
my prediction that it needed work was accurate ;).
When Erik and I went through all of the scripts and rewrote
them on the 15th we discovered a lot of maintenance tasks that were
being identified as searches. These are now being excluded, but we
have to backfill 1.5 months of data. I've chosen to eliminate the old
data and then backfill, because it means we avoid having data from
multiple, dissonant software versions, and because it just makes the
backfilling task a bit easier.
As a result, the dashboards may look a bit odd over the next couple of
days; they have data from the 15th onwards that we're comfortable
about, but are gradually backfilling from 1 June to 14 July - starting
on 1 June. So at the moment we have 1 June and 15-21 July. Weird. And
then 1, 2nd June, 15th...so on.
So expect to see increasingly less weird graphs, until the point where
to normal, (but more consistent and sane looking). Until then: yeah,
they're gonna look a bit weird.