We've had a lot of ideas floating around over the past week or two about
what to do in the final weeks of the quarter towards tackling the zero
results rate problem. This morning the engineering team had a 25 minute
meeting to coalesce these ideas into a plan and sync up. We took notes in
this etherpad: https://etherpad.wikimedia.org/p/nextupforsearch
The short summary of the meeting was a test which tries relaxing the AND
operator for common terms in queries would be tried. This should improve
natural language queries by reducing how important words like "the", "a",
etc. are to the query, thus focussing in on the essence of the query. This
also means that pages that don't contain these common terms, but only
contain the core terms, could now be returned in results.
This work is tracked in the following series of tasks, the structure of
which should now be very familiar to you all:
- T112178 <https://phabricator.wikimedia.org/T112178>: Relax 'AND'
operator with the common term query
- T112581 <https://phabricator.wikimedia.org/T112581>: Run A/B test on
relaxing AND operator for search (test starting on 2015-09-22)
- T112582 <https://phabricator.wikimedia.org/T112582>: Validate data for
AND operator A/B test (on or after 2015-09-23)
- T112583 <https://phabricator.wikimedia.org/T112583>: Analyse results
of AND operator A/B test (on or after 2015-09-29)
What this does mean is that we've probably got a bunch of tests lined up to
start at the same time. In principle this isn't a problem, but if the tests
overlap it can cause difficulties. This will be discussed in tomorrow's
As always, if there are any questions, let me know!
Lead Product Manager, Discovery
I've done further analysis on the ~1400 zero-results non-DOI query corpus,
looking at the effects of perfect (or at least human-level) language
detection, and the effects of running all queries against many wikis.
> More that 85% of failed queries to enwiki are in English, or are not in a
> particular language. Only about 35% of non-English queries in some language
> (<4.5% of zero-results queries), if funneled to the right language wiki,
> get any results.
The types of queries most likely to get results from the non-enwikis are
> names and queries in English. There are lots of English words in
> non-English wikis (enough that they can do decent spelling correction!),
> and the idiosyncrasies of language processing on other wikis allow certain
> classes of typos in names and English words to match, or the typos happen
> to exist uncorrected in the non-enwiki.
Perhaps a better approach to handling non-English queries is user-specified
> alternate languages.
Software Engineer, Discovery
We currently have a data outage on our dashboards - they display, but
we're missing the last few days.
The good news is that we know exactly what happened here; as part of
our work to (amusingly enough) make the data pipeline here more robust
and standardised, we switched all of our data retrieval scripts over
to a new project and repository (previously they'd lived in the repo
for the dashboard they referred to, which doesn't scale). A bug in the
shell script that tied them all together meant none of them ran - and
of course we switched everything over immediately before a long
weekend. Doh ;p.
The original bug has a patchset in awaiting review, and as soon as
it's +2d we're going to begin backfilling the datasets. You can follow
our progress on that at https://phabricator.wikimedia.org/T111749
I've written up my analysis of the ElasticSearch language detection plugin
that Erik recently enabled:
The short version is that it really likes Romanian (and Italian, and has a
bit of a thing for French), and precision on English is great, but recall
is poor (probably because of all the typos and other crap that go to enwiki
that is still technically "English"). Chinese and Arabic are good.
I think we could do better, and we should evaluate (a) other language
detectors and (b) the effect of a good language detector on zero results
rate (i.e., simulate sending queries to the right place and see how much of
a difference it makes).
Moderately pretty pictures included.
Software Engineer, Discovery
Cross-posting from wikidata-l.
---------- Forwarded message ----------
From: Dan Garry <dgarry(a)wikimedia.org>
Date: 7 September 2015 at 15:29
Subject: Announcing the release of the Wikidata Query Service
The Discovery Department at the Wikimedia Foundation is pleased to announce
the release of the Wikidata Query Service
<https://www.mediawiki.org/wiki/Wikidata_query_service>! You can find the
interface for the service at https://query.wikidata.org.
The Wikidata Query Service is designed to let users run queries on the data
contained in Wikidata. The service uses SPARQL
<https://en.wikipedia.org/wiki/SPARQL> as the query language. You can see
some example queries in the user manual
Right now, the service is still in beta. This means that our goal
to monitor of the service usage and collect feedback about what people
think should be next. To do that, we've created the Wikidata Query Service
dashboard <https://searchdata.wmflabs.org/wdqs/> to track usage of the
service, and we're in the process
<https://phabricator.wikimedia.org/T111403> of setting up a feedback
mechanism for users of the service. Once we've got monitored the usage of
the service for a while and got user feedback, we'll decide on what's next
for development of the service.
If you have any feedback, suggestions, or comments, please do send an email
to the Discovery Department's public mailing list,
Lead Product Manager, Discovery
A few of us met this morning, to ensure that we have a plan for everyone in
the department to be productive on Gerrit Cleanup Day (Wednesday
2015-09-23). We think most folks are accounted for, and came up with ideas
I added Gerrit Cleanup Day as an upcoming event on our wiki page, and
created a page with the proposed plan that came out of this morning's
Action items prior to the day (mostly listing them here for my own
- Erik will coordinate with the developers to help them be productive
- Kevin will ask Quim to try to get David paired up with someone in his
timezone (maybe Trey also)
- Kevin will talk to Oliver, who can guide Mikhail
- Kevin will get a gerrit account, to be able to +1/-1
- Kevin will organize some kind of kickoff meeting the morning of the
- Kevin will check with Moiz
- Kevin will check with Wes to see what he is planning
Agile Coach, Wikimedia Foundation