Discovery September 2015

discovery@lists.wikimedia.org

14 participants
26 discussions

[Wikimedia-search] Page rank
by Erik Bernhardson 25 Sep '15

25 Sep '15

Late last week while looking over our existing scoring methods i was thinking that while counting incoming links is nice, a couple guys dominated search with (among other things) a better way to judge the quality of incoming links, aka PageRank. PageRank takes a very simple input, it just needs a list of all links between pages. We happen to already store all of these in elasticsearch. I wrote a few scripts to suck out the full enwiki graph (~400M edges), ship it over to stat1002, throw it into hadoop, and crunch it with a few hundred cores. The end result is a score for every NS_MAIN page in enwiki based on the quality of incoming links. I've taken these calculated pagerank's and used them as the scoring method for search-as-you-type for http://en-suggesty.wmflabs.org. Overall this seems promising as another scoring metric to integrate to our search results. Not sure yet how to figure out things like how much weight does pagerank have in the score? This might be yet another thing where building out our relevance lab would enable us to make more informed decisions. Overall i think some sort of pipeline from hadoop into our scoring system could be quite useful. The initial idea seems to be to crunch data in hadoop, stuff it into a read-only api, and then query it back out at indexing time in elasticsearch to be held within the ES docs. I'm not sure what the best way will be, but having a simple and repeatable way to calculate scoring info in hadoop and ship that into ES will probably become more and more important.

2 3

renaming of this list
by Daniel Zahn 23 Sep '15

23 Sep '15

Hello subscribers of wikimedia-search, this list has been renamed to "discovery" as requested on https://phabricator.wikimedia.org/T110256 This is to let you know and test if everything worked at the same time. You will see that the listinfo page https://lists.wikimedia.org/mailman/listinfo/wikimedia-search is also forwarded to the new name. All config options and subscribers with their passwords have been imported from old list to new list. Archives have been regenerated from the .mbox file. Best regards, Daniel -- Daniel Zahn <dzahn(a)wikimedia.org> Operations Engineer

1 0

renaming of this list
by Daniel Zahn 23 Sep '15

23 Sep '15

Hello subscribers of wikimedia-search, this list has been renamed to "discovery" as requested on https://phabricator.wikimedia.org/T110256 This is to let you know and test if everything worked at the same time. I am mailing the _old_ list address on purpose to test that mail to that is also forwarded as intended. Please start using discovery@lists though. You will see that the listinfo page https://lists.wikimedia.org/mailman/listinfo/wikimedia-search is also forwarded to the new name. All config options and subscribers with their passwords have been imported from old list to new list. Archives have been regenerated from the .mbox file. As said above, the old email address of the list also still works. It has been added as an "acceptable alias" to list config. Best regards, Daniel -- Daniel Zahn <dzahn(a)wikimedia.org> Operations Engineer

1 0

[Wikimedia-search] renaming of this list
by Daniel Zahn 23 Sep '15

23 Sep '15

Hello subscribers of wikimedia-search, this list has been renamed to "discovery" as requested on https://phabricator.wikimedia.org/T110256 This is to let you know and test if everything worked at the same time. I am mailing the _old_ list address on purpose to test that mail to that is also forwarded as intended. Please start using discovery@lists though. You will see that the listinfo page https://lists.wikimedia.org/mailman/listinfo/wikimedia-search is also forwarded to the new name. All config options and subscribers with their passwords have been imported from old list to new list. Archives have been regenerated from the .mbox file. As said above, the old email address of the list also still works. It has been added as an "acceptable alias" to list config. Best regards, Daniel -- Daniel Zahn <dzahn(a)wikimedia.org> Operations Engineer

1 0

[Wikimedia-search] renaming of this list
by Daniel Zahn 23 Sep '15

23 Sep '15

Hello subscribers of wikimedia-search, this list has been renamed to "discovery" as requested on https://phabricator.wikimedia.org/T110256 This is to let you know and test if everything worked at the same time. I am mailing the _old_ list address on purpose to test that mail to that is also forwarded as intended. Please start using discovery@lists though. You will see that the listinfo page https://lists.wikimedia.org/mailman/listinfo/wikimedia-search is also forwarded to the new name. All config options and subscribers with their passwords have been imported from old list to new list. Archives have been regenerated from the .mbox file. As said above, the old email address of the list also still works. It has been added as an "acceptable alias" to list config. Best regards, Daniel -- Daniel Zahn <dzahn(a)wikimedia.org> Operations Engineer

1 0

[Wikimedia-search] Why People Use Search Engines Instead of Wikimedia Search
by Trey Jones 22 Sep '15

22 Sep '15

Hi All, Why do people use Google instead of Wikipedia search? Two obvious answers come to mind: Google gives better results, and users are just used to using Google 'cause it's useful. So I set out to see how search on Wikipedia compares to Google for queries we can recover from referrals from Google. Disclaimers: we don't know what personalized results people got, whether they liked the result, or what they intended to search for; all we have is the wiki page they landed on. Also, results vary depending on which Google you start from—which I didn't consider until after the experiments and analysis were underway. Summary: for about 60% of queries, Wikipedia search does fine. (And about a quarter of all searches are exact matches for Wikipedia article titles.) Trouble areas identified include: typos in the first two characters, question marks, abbreviations and other ambiguous terms, quotes, questions, formulaic queries, and non-Latin diacritics. I have a list of about 20 suggestions for projects from small to enormous that we could tackle to improve results (plus another plug for a Relevance Lab!). Best factoid: someone searched for *what is hummus* and ended up on the wiki page for Hillary Clinton. Full details here: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Why_People_Use_Searc… —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

1 0

[Wikimedia-search] Gerrit Cleanup Day on Wed 23rd: Are you ready?
by Andre Klapper 22 Sep '15

22 Sep '15

Hi Discovery team, the Gerrit Cleanup Day on Wed 23rd is approaching fast - only one week left. More info: https://phabricator.wikimedia.org/T88531 Do you feel prepared for the day and all team members know what to do? If not, what are you missing and how can we help? Some Gerrit queries for each team are listed under "Gerrit queries per team/area" in https://phabricator.wikimedia.org/T88531 Are they helpful and a good start? Or do they miss some areas (or do you have existing Gerrit team queries to use instead or to "integrate",e.g. for parts of MediaWiki core you might work on)? Also, which person will be the main team contact for the day (and available in #wikimedia-dev on IRC) and help organize review work in your areas, so other teams could easily reach out? Some team plates are emptier than others so they're wondering where and how to lend a helping hand (to find out in advance, due to timezones). Thanks for your help to make the Gerrit Cleanup day a success! andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

2 2

[Wikimedia-search] Asynchronously calling elasticsearch
by Erik Bernhardson 21 Sep '15

21 Sep '15

The php engine used in prod by the wmf, hhvm, has built in support for shared (non-preemptive) concurrency via async/await keywords[1][2]. Over the weekend i spent some time converting the Elastica client library we use to work asynchronously, which would essentially let us continue on performing other calculations in the web request while network requests are processing. I've only ported over the client library[3], not the CirrusSearch code. Also this is not a complete port, there are a couple code paths that work but most of the test suite still fails. The most obvious place we could see a benefit from this is when multiple queries are issued to elasticsearch from a single web request. If the second query doesn't depend on the results of the first it can be issued in parallel. This is actually somewhat common use case, for example doing a full text and a title search in the same request. I'm wary of making much of a guess in terms of actual latency reduction we could expect, but maybe on the order of 50 to 100 ms in cases which we currently perform requests serially and we have enough work to process. Really its hard to say at this point. In addition to making some existing code faster, having the ability to do multiple network operations in an async manner opens up other possibilities when we are implementing things in the future. In closing, this currently isn't going anywhere it was just something interesting to toy with. I think it could be quite interesting to investigate further. [1] http://docs.hhvm.com/manual/en/hack.async.php [2] https://phabricator.wikimedia.org/T99755 [2] https://github.com/ebernhardson/Elastica/tree/async

7 9

[Wikimedia-search] Fwd: Announcing the launch of Maps
by Tomasz Finc 17 Sep '15

17 Sep '15

Cross posting to discovery ---------- Forwarded message ---------- From: Tomasz Finc <tfinc(a)wikimedia.org> Date: Thu, Sep 17, 2015 at 12:26 PM Subject: Announcing the launch of Maps To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Cc: Yuri Astrakhan <yastrakhan(a)wikimedia.org>, Max Semenik < msemenik(a)wikimedia.org> The Discovery Department has launched an experimental tile and static maps service available at https://maps.wikimedia.org. Using this service you can browse and embed map tiles into your own tools using OpenStreetMap data. Currently, we handle traffic from *.wmflabs .org and *.wikivoyage .org (referrer header must be either missing or set to these values) but we would like to open it up to Wikipedia traffic if we see enough use. Our hope is that this service fits the needs of the numerous maps developers and tool authors who have asked for a WMF hosted tile service with an initial focus on WikiVoyage. We'd love for you to try our new service, experiment writing tools using our tiles, and giving us feedback <https://www.mediawiki.org/wiki/Talk:Maps> . If you've built a tool using OpenStreetMap-based imagery then using our service is a simple drop-in replacement. Getting started is as easy as https://www.mediawiki.org/wiki/Maps#Getting_Started How can you help? * Adapt your labs tool to use this service - for example, use Leaflet js library and point it to https://maps.wikimedia.org * File bugs in Phabricator <https://phabricator.wikimedia.org/tag/discovery-maps-sprint/> * Provide us feedback to help guide future features <https://www.mediawiki.org/wiki/Talk:Maps> * Improve our map style <https://github.com/kartotherian/osm-bright.tm2> * Improve our data extraction <https://github.com/kartotherian/osm-bright.tm2source> Based on usage and your feedback, the Discovery team <https://www.mediawiki.org/wiki/Discovery> will decide how to proceed. We could add more data sources (both vector and raster), work on additional services such as static maps or geosearch, work on supporting all languages, switch to client-side WebGL rendering, etc. Please help us decide what is most important. https://www.mediawiki.org/wiki/Maps has more about the project and related Maps work. == In Depth == Tiles are served from https://maps.wikimedia.org, but can only be accessed from any subdomains of *.wmflabs .org and *.wikivoyage.org. Kartotherian can produce tiles as images (png), and as raw vector data (PBF Mapbox format or json): .../{source}/{zoom}/{x}/{y}[(a){scale}x].{format} Additionally, Kartotherian can produce snapshot (static) images of any location, scaling, and zoom level with .../{source},{zoom},{lat},{lon},{width}x{height}[(a){scale}x].{format}. For example, to get an image centered at 42,-3.14, at zoom level 4, size 800x600, use https://maps.wikimedia.org/img/osm-intl,4,42,-3.14,800x600.png (copy/paste the link, or else it might not work due to referrer restriction). Do note that the static feature is highly experimental right now. We would like to thank WMF Ops (especially Alex Kosiaris, Brandon Black, and Jaime Crespo), services team, OSM community and engineers, and the Mapnik and Mapbox teams. The project would not have completed so fast without you. Thank You --tomasz

1 0

[Wikimedia-search] IRC norms
by Kevin Smith 15 Sep '15

15 Sep '15

Recently, the Team Practices Group agreed to a set of norms around how that team will use IRC[1]. Would it be helpful for Discovery to agree on its own IRC norms? They could end up being quite different from what TPG decided on. But whatever we decided on, it seems like it would be helpful to know that we're all on the same page. Especially as we bring on new team members. Thoughts? [1] https://www.mediawiki.org/wiki/Team_Practices_Group/Team_Norms/IRC_Norms Kevin Smith Agile Coach, Wikimedia Foundation

5 4

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery September 2015