Wikimedia-search September 2015

wikimedia-search@lists.wikimedia.org

13 participants
22 discussions

Why People Use Search Engines Instead of Wikimedia Search
by Trey Jones 22 Sep '15

22 Sep '15

Hi All, Why do people use Google instead of Wikipedia search? Two obvious answers come to mind: Google gives better results, and users are just used to using Google 'cause it's useful. So I set out to see how search on Wikipedia compares to Google for queries we can recover from referrals from Google. Disclaimers: we don't know what personalized results people got, whether they liked the result, or what they intended to search for; all we have is the wiki page they landed on. Also, results vary depending on which Google you start from—which I didn't consider until after the experiments and analysis were underway. Summary: for about 60% of queries, Wikipedia search does fine. (And about a quarter of all searches are exact matches for Wikipedia article titles.) Trouble areas identified include: typos in the first two characters, question marks, abbreviations and other ambiguous terms, quotes, questions, formulaic queries, and non-Latin diacritics. I have a list of about 20 suggestions for projects from small to enormous that we could tackle to improve results (plus another plug for a Relevance Lab!). Best factoid: someone searched for *what is hummus* and ended up on the wiki page for Hillary Clinton. Full details here: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Why_People_Use_Searc… —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

1 0

Gerrit Cleanup Day on Wed 23rd: Are you ready?
by Andre Klapper 22 Sep '15

22 Sep '15

Hi Discovery team, the Gerrit Cleanup Day on Wed 23rd is approaching fast - only one week left. More info: https://phabricator.wikimedia.org/T88531 Do you feel prepared for the day and all team members know what to do? If not, what are you missing and how can we help? Some Gerrit queries for each team are listed under "Gerrit queries per team/area" in https://phabricator.wikimedia.org/T88531 Are they helpful and a good start? Or do they miss some areas (or do you have existing Gerrit team queries to use instead or to "integrate",e.g. for parts of MediaWiki core you might work on)? Also, which person will be the main team contact for the day (and available in #wikimedia-dev on IRC) and help organize review work in your areas, so other teams could easily reach out? Some team plates are emptier than others so they're wondering where and how to lend a helping hand (to find out in advance, due to timezones). Thanks for your help to make the Gerrit Cleanup day a success! andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

2 2

Asynchronously calling elasticsearch
by Erik Bernhardson 21 Sep '15

21 Sep '15

The php engine used in prod by the wmf, hhvm, has built in support for shared (non-preemptive) concurrency via async/await keywords[1][2]. Over the weekend i spent some time converting the Elastica client library we use to work asynchronously, which would essentially let us continue on performing other calculations in the web request while network requests are processing. I've only ported over the client library[3], not the CirrusSearch code. Also this is not a complete port, there are a couple code paths that work but most of the test suite still fails. The most obvious place we could see a benefit from this is when multiple queries are issued to elasticsearch from a single web request. If the second query doesn't depend on the results of the first it can be issued in parallel. This is actually somewhat common use case, for example doing a full text and a title search in the same request. I'm wary of making much of a guess in terms of actual latency reduction we could expect, but maybe on the order of 50 to 100 ms in cases which we currently perform requests serially and we have enough work to process. Really its hard to say at this point. In addition to making some existing code faster, having the ability to do multiple network operations in an async manner opens up other possibilities when we are implementing things in the future. In closing, this currently isn't going anywhere it was just something interesting to toy with. I think it could be quite interesting to investigate further. [1] http://docs.hhvm.com/manual/en/hack.async.php [2] https://phabricator.wikimedia.org/T99755 [2] https://github.com/ebernhardson/Elastica/tree/async

7 9

Page rank
by Erik Bernhardson 21 Sep '15

21 Sep '15

Late last week while looking over our existing scoring methods i was thinking that while counting incoming links is nice, a couple guys dominated search with (among other things) a better way to judge the quality of incoming links, aka PageRank. PageRank takes a very simple input, it just needs a list of all links between pages. We happen to already store all of these in elasticsearch. I wrote a few scripts to suck out the full enwiki graph (~400M edges), ship it over to stat1002, throw it into hadoop, and crunch it with a few hundred cores. The end result is a score for every NS_MAIN page in enwiki based on the quality of incoming links. I've taken these calculated pagerank's and used them as the scoring method for search-as-you-type for http://en-suggesty.wmflabs.org. Overall this seems promising as another scoring metric to integrate to our search results. Not sure yet how to figure out things like how much weight does pagerank have in the score? This might be yet another thing where building out our relevance lab would enable us to make more informed decisions. Overall i think some sort of pipeline from hadoop into our scoring system could be quite useful. The initial idea seems to be to crunch data in hadoop, stuff it into a read-only api, and then query it back out at indexing time in elasticsearch to be held within the ES docs. I'm not sure what the best way will be, but having a simple and repeatable way to calculate scoring info in hadoop and ship that into ES will probably become more and more important.

1 0

Fwd: Announcing the launch of Maps
by Tomasz Finc 17 Sep '15

17 Sep '15

Cross posting to discovery ---------- Forwarded message ---------- From: Tomasz Finc <tfinc(a)wikimedia.org> Date: Thu, Sep 17, 2015 at 12:26 PM Subject: Announcing the launch of Maps To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Cc: Yuri Astrakhan <yastrakhan(a)wikimedia.org>, Max Semenik < msemenik(a)wikimedia.org> The Discovery Department has launched an experimental tile and static maps service available at https://maps.wikimedia.org. Using this service you can browse and embed map tiles into your own tools using OpenStreetMap data. Currently, we handle traffic from *.wmflabs .org and *.wikivoyage .org (referrer header must be either missing or set to these values) but we would like to open it up to Wikipedia traffic if we see enough use. Our hope is that this service fits the needs of the numerous maps developers and tool authors who have asked for a WMF hosted tile service with an initial focus on WikiVoyage. We'd love for you to try our new service, experiment writing tools using our tiles, and giving us feedback <https://www.mediawiki.org/wiki/Talk:Maps> . If you've built a tool using OpenStreetMap-based imagery then using our service is a simple drop-in replacement. Getting started is as easy as https://www.mediawiki.org/wiki/Maps#Getting_Started How can you help? * Adapt your labs tool to use this service - for example, use Leaflet js library and point it to https://maps.wikimedia.org * File bugs in Phabricator <https://phabricator.wikimedia.org/tag/discovery-maps-sprint/> * Provide us feedback to help guide future features <https://www.mediawiki.org/wiki/Talk:Maps> * Improve our map style <https://github.com/kartotherian/osm-bright.tm2> * Improve our data extraction <https://github.com/kartotherian/osm-bright.tm2source> Based on usage and your feedback, the Discovery team <https://www.mediawiki.org/wiki/Discovery> will decide how to proceed. We could add more data sources (both vector and raster), work on additional services such as static maps or geosearch, work on supporting all languages, switch to client-side WebGL rendering, etc. Please help us decide what is most important. https://www.mediawiki.org/wiki/Maps has more about the project and related Maps work. == In Depth == Tiles are served from https://maps.wikimedia.org, but can only be accessed from any subdomains of *.wmflabs .org and *.wikivoyage.org. Kartotherian can produce tiles as images (png), and as raw vector data (PBF Mapbox format or json): .../{source}/{zoom}/{x}/{y}[(a){scale}x].{format} Additionally, Kartotherian can produce snapshot (static) images of any location, scaling, and zoom level with .../{source},{zoom},{lat},{lon},{width}x{height}[(a){scale}x].{format}. For example, to get an image centered at 42,-3.14, at zoom level 4, size 800x600, use https://maps.wikimedia.org/img/osm-intl,4,42,-3.14,800x600.png (copy/paste the link, or else it might not work due to referrer restriction). Do note that the static feature is highly experimental right now. We would like to thank WMF Ops (especially Alex Kosiaris, Brandon Black, and Jaime Crespo), services team, OSM community and engineers, and the Mapnik and Mapbox teams. The project would not have completed so fast without you. Thank You --tomasz

1 0

IRC norms
by Kevin Smith 15 Sep '15

15 Sep '15

Recently, the Team Practices Group agreed to a set of norms around how that team will use IRC[1]. Would it be helpful for Discovery to agree on its own IRC norms? They could end up being quite different from what TPG decided on. But whatever we decided on, it seems like it would be helpful to know that we're all on the same page. Especially as we bring on new team members. Thoughts? [1] https://www.mediawiki.org/wiki/Team_Practices_Group/Team_Norms/IRC_Norms Kevin Smith Agile Coach, Wikimedia Foundation

5 4

Update on what's next for tackling the zero results rate goal
by Dan Garry 15 Sep '15

15 Sep '15

We've had a lot of ideas floating around over the past week or two about what to do in the final weeks of the quarter towards tackling the zero results rate problem. This morning the engineering team had a 25 minute meeting to coalesce these ideas into a plan and sync up. We took notes in this etherpad: https://etherpad.wikimedia.org/p/nextupforsearch The short summary of the meeting was a test which tries relaxing the AND operator for common terms in queries would be tried. This should improve natural language queries by reducing how important words like "the", "a", etc. are to the query, thus focussing in on the essence of the query. This also means that pages that don't contain these common terms, but only contain the core terms, could now be returned in results. This work is tracked in the following series of tasks, the structure of which should now be very familiar to you all: - T112178 <https://phabricator.wikimedia.org/T112178>: Relax 'AND' operator with the common term query - T112581 <https://phabricator.wikimedia.org/T112581>: Run A/B test on relaxing AND operator for search (test starting on 2015-09-22) - T112582 <https://phabricator.wikimedia.org/T112582>: Validate data for AND operator A/B test (on or after 2015-09-23) - T112583 <https://phabricator.wikimedia.org/T112583>: Analyse results of AND operator A/B test (on or after 2015-09-29) What this does mean is that we've probably got a bunch of tests lined up to start at the same time. In principle this isn't a problem, but if the tests overlap it can cause difficulties. This will be discussed in tomorrow's analysis meeting. As always, if there are any questions, let me know! Thanks, Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

1 0

Maps and KPI's
by Kevin Smith 15 Sep '15

15 Sep '15

Notes from this afternoon's Maps and KPI's meeting have been posted: https://www.mediawiki.org/wiki/Discovery/Maps_and_KPIs_2015-09-14 Those who attended can feel free to correct anything I got wrong. Kevin Smith Agile Coach, Wikimedia Foundation

1 0

Some Results of Cross-Languae Wiki Searching
by Trey Jones 11 Sep '15

11 Sep '15

Hi Everyone, I've done further analysis on the ~1400 zero-results non-DOI query corpus, looking at the effects of perfect (or at least human-level) language detection, and the effects of running all queries against many wikis. In summary: > More that 85% of failed queries to enwiki are in English, or are not in a > particular language. Only about 35% of non-English queries in some language > (<4.5% of zero-results queries), if funneled to the right language wiki, > get any results. > The types of queries most likely to get results from the non-enwikis are > names and queries in English. There are lots of English words in > non-English wikis (enough that they can do decent spelling correction!), > and the idiosyncrasies of language processing on other wikis allow certain > classes of typos in names and English words to match, or the typos happen > to exist uncorrected in the non-enwiki. > Perhaps a better approach to handling non-English queries is user-specified > alternate languages. More details: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_… —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

1 0

Congratulations WDQS team
by Mikhail Popov 10 Sep '15

10 Sep '15

Yinz are popular now! Cheers~ -- *Mikhail Popov* // Data Analyst, Discovery <https://www.mediawiki.org/wiki/Wikimedia_Discovery> https://wikimediafoundation.org/ *Imagine a world in which every single human being can freely share in the **sum of all knowledge. That's our commitment.* Donate <https://donate.wikimedia.org/>.

4 3

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Wikimedia-search September 2015