We've had a lot of ideas floating around over the past week or two about
what to do in the final weeks of the quarter towards tackling the zero
results rate problem. This morning the engineering team had a 25 minute
meeting to coalesce these ideas into a plan and sync up. We took notes in
this etherpad: https://etherpad.wikimedia.org/p/nextupforsearch
The short summary of the meeting was a test which tries relaxing the AND
operator for common terms in queries would be tried. This should improve
natural language queries by reducing how important words like "the", "a",
etc. are to the query, thus focussing in on the essence of the query. This
also means that pages that don't contain these common terms, but only
contain the core terms, could now be returned in results.
This work is tracked in the following series of tasks, the structure of
which should now be very familiar to you all:
- T112178 <https://phabricator.wikimedia.org/T112178>: Relax 'AND'
operator with the common term query
- T112581 <https://phabricator.wikimedia.org/T112581>: Run A/B test on
relaxing AND operator for search (test starting on 2015-09-22)
- T112582 <https://phabricator.wikimedia.org/T112582>: Validate data for
AND operator A/B test (on or after 2015-09-23)
- T112583 <https://phabricator.wikimedia.org/T112583>: Analyse results
of AND operator A/B test (on or after 2015-09-29)
What this does mean is that we've probably got a bunch of tests lined up to
start at the same time. In principle this isn't a problem, but if the tests
overlap it can cause difficulties. This will be discussed in tomorrow's
analysis meeting.
As always, if there are any questions, let me know!
Thanks,
Dan
--
Dan Garry
Lead Product Manager, Discovery
Wikimedia Foundation
Hi Everyone,
I've done further analysis on the ~1400 zero-results non-DOI query corpus,
looking at the effects of perfect (or at least human-level) language
detection, and the effects of running all queries against many wikis.
In summary:
> More that 85% of failed queries to enwiki are in English, or are not in a
> particular language. Only about 35% of non-English queries in some language
> (<4.5% of zero-results queries), if funneled to the right language wiki,
> get any results.
>
The types of queries most likely to get results from the non-enwikis are
> names and queries in English. There are lots of English words in
> non-English wikis (enough that they can do decent spelling correction!),
> and the idiosyncrasies of language processing on other wikis allow certain
> classes of typos in names and English words to match, or the typos happen
> to exist uncorrected in the non-enwiki.
>
Perhaps a better approach to handling non-English queries is user-specified
> alternate languages.
More details:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_…
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
Hey all,
We currently have a data outage on our dashboards - they display, but
we're missing the last few days.
The good news is that we know exactly what happened here; as part of
our work to (amusingly enough) make the data pipeline here more robust
and standardised, we switched all of our data retrieval scripts over
to a new project and repository (previously they'd lived in the repo
for the dashboard they referred to, which doesn't scale). A bug in the
shell script that tied them all together meant none of them ran - and
of course we switched everything over immediately before a long
weekend. Doh ;p.
The original bug has a patchset in awaiting review, and as soon as
it's +2d we're going to begin backfilling the datasets. You can follow
our progress on that at https://phabricator.wikimedia.org/T111749
Thanks,
--
Oliver Keyes
Count Logula
Wikimedia Foundation
I've written up my analysis of the ElasticSearch language detection plugin
that Erik recently enabled:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_E…
The short version is that it really likes Romanian (and Italian, and has a
bit of a thing for French), and precision on English is great, but recall
is poor (probably because of all the typos and other crap that go to enwiki
that is still technically "English"). Chinese and Arabic are good.
I think we could do better, and we should evaluate (a) other language
detectors and (b) the effect of a good language detector on zero results
rate (i.e., simulate sending queries to the right place and see how much of
a difference it makes).
Moderately pretty pictures included.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
Cross-posting from wikidata-l.
---------- Forwarded message ----------
From: Dan Garry <dgarry(a)wikimedia.org>
Date: 7 September 2015 at 15:29
Subject: Announcing the release of the Wikidata Query Service
To: wikidata-l(a)lists.wikimedia.org
The Discovery Department at the Wikimedia Foundation is pleased to announce
the release of the Wikidata Query Service
<https://www.mediawiki.org/wiki/Wikidata_query_service>! You can find the
interface for the service at https://query.wikidata.org.
The Wikidata Query Service is designed to let users run queries on the data
contained in Wikidata. The service uses SPARQL
<https://en.wikipedia.org/wiki/SPARQL> as the query language. You can see
some example queries in the user manual
<https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual>.
Right now, the service is still in beta. This means that our goal
<https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q2_Goals#Wikid…>
is
to monitor of the service usage and collect feedback about what people
think should be next. To do that, we've created the Wikidata Query Service
dashboard <https://searchdata.wmflabs.org/wdqs/> to track usage of the
service, and we're in the process
<https://phabricator.wikimedia.org/T111403> of setting up a feedback
mechanism for users of the service. Once we've got monitored the usage of
the service for a while and got user feedback, we'll decide on what's next
for development of the service.
If you have any feedback, suggestions, or comments, please do send an email
to the Discovery Department's public mailing list,
wikimedia-search(a)lists.wikimedia.org.
Thanks,
Dan
--
Dan Garry
Lead Product Manager, Discovery
Wikimedia Foundation
Hi all,
If you've been to http://searchdata.wmflabs.org/ recently, you would have
noticed that we have a new dashboard (and a work-in-progress facelift).
Introducing… The Wikidata Query Service dashboard:
http://searchdata.wmflabs.org/wdqs/ ! Yay! Hopefully this will help the
WDQS team as they continue their work on that awesome project.
As with the Search Metrics dashboard
<http://searchdata.wmflabs.org/metrics/>, we welcome constructive criticism
and feature suggestions with an open mind.
One suggestion that I'm going to look into is finding out how many people
who visited the homepage ended up submitting a query. We also have failure
stats, so those will be showing up in the near future.
Thank you,
Mikhail // Junior Swifty
--
*Mikhail Popov* // Data Analyst, The Swifties, Discovery
<https://www.mediawiki.org/wiki/Wikimedia_Discovery>
https://wikimediafoundation.org/
*Imagine a world in which every single human being can freely share in
the **sum
of all knowledge. That's our commitment.* Donate
<https://donate.wikimedia.org/>.
A few of us met this morning, to ensure that we have a plan for everyone in
the department to be productive on Gerrit Cleanup Day (Wednesday
2015-09-23). We think most folks are accounted for, and came up with ideas
for others.
I added Gerrit Cleanup Day as an upcoming event on our wiki page[1], and
created a page with the proposed plan[2] that came out of this morning's
meeting.
Action items prior to the day (mostly listing them here for my own
convenience):
- Erik will coordinate with the developers to help them be productive
- Kevin will ask Quim to try to get David paired up with someone in his
timezone (maybe Trey also)
- Kevin will talk to Oliver, who can guide Mikhail
- Kevin will get a gerrit account, to be able to +1/-1
- Kevin will organize some kind of kickoff meeting the morning of the
big day
- Kevin will check with Moiz
- Kevin will check with Wes to see what he is planning
[1] https://www.mediawiki.org/wiki/Wikimedia_Discovery#Upcoming_events
[2]
https://www.mediawiki.org/wiki/Discovery_plans_for_gerrit_cleanup_day_2015
Kevin Smith
Agile Coach, Wikimedia Foundation