Why do people use Google instead of Wikipedia search? Two obvious answers
come to mind: Google gives better results, and users are just used to using
Google 'cause it's useful.
So I set out to see how search on Wikipedia compares to Google for queries
we can recover from referrals from Google.
Disclaimers: we don't know what personalized results people got, whether
they liked the result, or what they intended to search for; all we have is
the wiki page they landed on. Also, results vary depending on which Google
you start from—which I didn't consider until after the experiments and
analysis were underway.
Summary: for about 60% of queries, Wikipedia search does fine. (And about a
quarter of all searches are exact matches for Wikipedia article titles.)
Trouble areas identified include: typos in the first two characters,
question marks, abbreviations and other ambiguous terms, quotes, questions,
formulaic queries, and non-Latin diacritics.
I have a list of about 20 suggestions for projects from small to enormous
that we could tackle to improve results (plus another plug for a Relevance
Best factoid: someone searched for *what is hummus* and ended up on the
wiki page for Hillary Clinton.
Full details here:
Software Engineer, Discovery
Hi Discovery team,
the Gerrit Cleanup Day on Wed 23rd is approaching fast - only one week
left. More info: https://phabricator.wikimedia.org/T88531
Do you feel prepared for the day and all team members know what to do?
If not, what are you missing and how can we help?
Some Gerrit queries for each team are listed under "Gerrit queries per
team/area" in https://phabricator.wikimedia.org/T88531
Are they helpful and a good start? Or do they miss some areas (or do
you have existing Gerrit team queries to use instead or to "integrate",e.g. for parts of MediaWiki core you might work on)?
Also, which person will be the main team contact for the day (and
available in #wikimedia-dev on IRC) and help organize review work in
your areas, so other teams could easily reach out?
Some team plates are emptier than others so they're wondering where and
how to lend a helping hand (to find out in advance, due to timezones).
Thanks for your help to make the Gerrit Cleanup day a success!
Andre Klapper | Wikimedia Bugwrangler
The php engine used in prod by the wmf, hhvm, has built in support for
shared (non-preemptive) concurrency via async/await keywords. Over
the weekend i spent some time converting the Elastica client library we use
to work asynchronously, which would essentially let us continue on
performing other calculations in the web request while network requests are
processing. I've only ported over the client library, not the
CirrusSearch code. Also this is not a complete port, there are a couple
code paths that work but most of the test suite still fails.
The most obvious place we could see a benefit from this is when multiple
queries are issued to elasticsearch from a single web request. If the
second query doesn't depend on the results of the first it can be issued in
parallel. This is actually somewhat common use case, for example doing a
full text and a title search in the same request. I'm wary of making much
of a guess in terms of actual latency reduction we could expect, but maybe
on the order of 50 to 100 ms in cases which we currently perform requests
serially and we have enough work to process. Really its hard to say at this
In addition to making some existing code faster, having the ability to do
multiple network operations in an async manner opens up other possibilities
when we are implementing things in the future. In closing, this currently
isn't going anywhere it was just something interesting to toy with. I
think it could be quite interesting to investigate further.
Late last week while looking over our existing scoring methods i was
thinking that while counting incoming links is nice, a couple guys
dominated search with (among other things) a better way to judge the
quality of incoming links, aka PageRank.
PageRank takes a very simple input, it just needs a list of all links
between pages. We happen to already store all of these in elasticsearch. I
wrote a few scripts to suck out the full enwiki graph (~400M edges), ship
it over to stat1002, throw it into hadoop, and crunch it with a few hundred
cores. The end result is a score for every NS_MAIN page in enwiki based on
the quality of incoming links.
I've taken these calculated pagerank's and used them as the scoring method
for search-as-you-type for http://en-suggesty.wmflabs.org.
Overall this seems promising as another scoring metric to integrate to our
search results. Not sure yet how to figure out things like how much weight
does pagerank have in the score? This might be yet another thing where
building out our relevance lab would enable us to make more informed
Overall i think some sort of pipeline from hadoop into our scoring system
could be quite useful. The initial idea seems to be to crunch data in
hadoop, stuff it into a read-only api, and then query it back out at
indexing time in elasticsearch to be held within the ES docs. I'm not sure
what the best way will be, but having a simple and repeatable way to
calculate scoring info in hadoop and ship that into ES will probably become
more and more important.
Cross posting to discovery
---------- Forwarded message ----------
From: Tomasz Finc <tfinc(a)wikimedia.org>
Date: Thu, Sep 17, 2015 at 12:26 PM
Subject: Announcing the launch of Maps
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Cc: Yuri Astrakhan <yastrakhan(a)wikimedia.org>, Max Semenik <
The Discovery Department has launched an experimental tile and static maps
service available at https://maps.wikimedia.org.
Using this service you can browse and embed map tiles into your own tools
using OpenStreetMap data. Currently, we handle traffic from *.wmflabs .org
and *.wikivoyage .org (referrer header must be either missing or set to
these values) but we would like to open it up to Wikipedia traffic if we
see enough use. Our hope is that this service fits the needs of the
numerous maps developers and tool authors who have asked for a WMF hosted
tile service with an initial focus on WikiVoyage.
We'd love for you to try our new service, experiment writing tools using
our tiles, and giving us feedback <https://www.mediawiki.org/wiki/Talk:Maps> .
If you've built a tool using OpenStreetMap-based imagery then using our
service is a simple drop-in replacement.
Getting started is as easy as
How can you help?
* Adapt your labs tool to use this service - for example, use Leaflet js
library and point it to https://maps.wikimedia.org
* File bugs in Phabricator
* Provide us feedback to help guide future features
* Improve our map style <https://github.com/kartotherian/osm-bright.tm2>
* Improve our data extraction
Based on usage and your feedback, the Discovery team
<https://www.mediawiki.org/wiki/Discovery> will decide how to proceed.
We could add more data sources (both vector and raster), work on additional
services such as static maps or geosearch, work on supporting all
languages, switch to client-side WebGL rendering, etc. Please help us
decide what is most important.
https://www.mediawiki.org/wiki/Maps has more about the project and related
== In Depth ==
Tiles are served from https://maps.wikimedia.org, but can only be accessed
from any subdomains of *.wmflabs .org and *.wikivoyage.org. Kartotherian
can produce tiles as images (png), and as raw vector data (PBF Mapbox
format or json):
Additionally, Kartotherian can produce snapshot (static) images of any
location, scaling, and zoom level with
For example, to get an image centered at 42,-3.14, at zoom level 4, size
800x600, use https://maps.wikimedia.org/img/osm-intl,4,42,-3.14,800x600.png
(copy/paste the link, or else it might not work due to referrer
Do note that the static feature is highly experimental right now.
We would like to thank WMF Ops (especially Alex Kosiaris, Brandon Black,
and Jaime Crespo), services team, OSM community and engineers, and the
Mapnik and Mapbox teams. The project would not have completed so fast
Recently, the Team Practices Group agreed to a set of norms around how that
team will use IRC.
Would it be helpful for Discovery to agree on its own IRC norms? They could
end up being quite different from what TPG decided on. But whatever we
decided on, it seems like it would be helpful to know that we're all on the
same page. Especially as we bring on new team members.
Agile Coach, Wikimedia Foundation
We've had a lot of ideas floating around over the past week or two about
what to do in the final weeks of the quarter towards tackling the zero
results rate problem. This morning the engineering team had a 25 minute
meeting to coalesce these ideas into a plan and sync up. We took notes in
this etherpad: https://etherpad.wikimedia.org/p/nextupforsearch
The short summary of the meeting was a test which tries relaxing the AND
operator for common terms in queries would be tried. This should improve
natural language queries by reducing how important words like "the", "a",
etc. are to the query, thus focussing in on the essence of the query. This
also means that pages that don't contain these common terms, but only
contain the core terms, could now be returned in results.
This work is tracked in the following series of tasks, the structure of
which should now be very familiar to you all:
- T112178 <https://phabricator.wikimedia.org/T112178>: Relax 'AND'
operator with the common term query
- T112581 <https://phabricator.wikimedia.org/T112581>: Run A/B test on
relaxing AND operator for search (test starting on 2015-09-22)
- T112582 <https://phabricator.wikimedia.org/T112582>: Validate data for
AND operator A/B test (on or after 2015-09-23)
- T112583 <https://phabricator.wikimedia.org/T112583>: Analyse results
of AND operator A/B test (on or after 2015-09-29)
What this does mean is that we've probably got a bunch of tests lined up to
start at the same time. In principle this isn't a problem, but if the tests
overlap it can cause difficulties. This will be discussed in tomorrow's
As always, if there are any questions, let me know!
Lead Product Manager, Discovery
I've done further analysis on the ~1400 zero-results non-DOI query corpus,
looking at the effects of perfect (or at least human-level) language
detection, and the effects of running all queries against many wikis.
> More that 85% of failed queries to enwiki are in English, or are not in a
> particular language. Only about 35% of non-English queries in some language
> (<4.5% of zero-results queries), if funneled to the right language wiki,
> get any results.
The types of queries most likely to get results from the non-enwikis are
> names and queries in English. There are lots of English words in
> non-English wikis (enough that they can do decent spelling correction!),
> and the idiosyncrasies of language processing on other wikis allow certain
> classes of typos in names and English words to match, or the typos happen
> to exist uncorrected in the non-enwiki.
Perhaps a better approach to handling non-English queries is user-specified
> alternate languages.
Software Engineer, Discovery