Discovery July 2018

discovery@lists.wikimedia.org

4 participants
6 discussions

Re: [discovery] [Maps-l] New maps style technical review
by Deb Tankersley 25 Jul '18

25 Jul '18

Hi Paul, Thanks so much for all your hard work, dedication to the maps project, and your devotion to documenting as your time with the Foundation is ending. It was great to work with you to create and refine the new map style; I sincerely hope we can get it deployed someday. Cheers and see ya around the map, Deb On Tue, Jul 24, 2018 at 2:03 PM Paul Norman <paul(a)paulnorman.ca> wrote: > The current WMF map rendering software is already vector tile based. The > work was about disputed borders, changing the vector tile schema, and > cartography improvements. > > MapTiler and openmaptiles aren't really that related. I went over > openmaptiles in the section on vector schema lessons learned. > > > On Jul 24, 2018 2:58 AM, Naveen Francis <naveenpf(a)wikimedia.in> wrote: > > Sad to hear that WMF is not deploying vector map now. > For WMF, 'internal politics' is the key for any projects they take up. > > Is this an implementation of vector sytle ? > https://www.maptiler.com/cloud/#streets/kn/vector/7.18/77.534/19.724 > > https://openmaptiles.org/schema/ > The vector tile schema has been developed by Klokan Technologies GmbH > <https://www.klokantech.com/> and was initially modelled after the > cartography of the Carto Basemap Positron > <https://carto.com/location-data-services/basemaps/>. The vector tile > schema has been refined and improved in cooperation with the Wikimedia > Foundation <https://www.mediawiki.org/wiki/Maps> and is heavily > influenced by the many years Paul Norman's experience of creating maps from > OpenStreetMap. > > > > Thanks, > Naveen Francis > <http://wikibooks.in> > > On Mon, Jul 23, 2018 at 4:56 AM, Paul Norman <paul(a)paulnorman.ca> wrote: > > For some time I’ve been working as a contractor developing a new style and > vector tile schema for the Wikimedia Foundation. It’s been completed but > not deployed for several months. As my contract finishes this month and > Wikimedia Foundation leadership has decided to not deploy the new map > styles, I’m writing up a technical lessons learned from my experiences on > the style. I’m not going to be discussing the organizational factors that > lead to the decision, but looking at how I’d code things differently if > starting over. > > *Overview* > > A complete map style consists of three parts: The database loading rules, > the feature selection rules, and the styling rules. For a style written in > the languages used by the WMF stack, these are expressed in osm2pgsql > instructions, a tm2source project with SQL, and CartoCSS. The first tells > you how to get the data into the database, the second defines what data > goes into the vector tiles, and the third is how to draw features in the > vector tiles. What goes in the vector tiles is also known as the “schema” > and can be expressed in terms of what features appear and when, e.g. > secondary roads first appear on zoom 12. For increased confusion, the > database also has a “schema”, both of which are distinct from a PostgreSQL > “SCHEMA.” > > In the current style, the parts are the osm2pgsql C transforms, > osm-bright.tm2source, and WMF’s fork of osm-bright.tm2. In the new style, > the parts are ClearTables + osm2pgsql, meddo, and brighmed. > > The goal with the style changes was to improve the representation of > disputed borders, switch to a vector tile schema without a legal cloud over > it, and make some styling improvements. In this it succeeded. > > *Database schema* > > The decision was made early on to go with ClearTables. This is an > alternative set of rules for osm2pgsql which loads the data into many more > tables for greater performance, easier style rules, and a bigger layer of > abstraction between raw OSM tags and the SQL you need to write. It was > started by me before my work at WMF and only a few features were added. > > ClearTables does what it is designed to do, yet it was a mistake for this > project. I still believe it is technically a better solution, yet the > advantages are not worth the costs of doing something different. > > The two most common database schemas are the built-in osm2pgsql “C > transforms” and the OpenStreetMap Carto. They aren’t any better code - with > ClearTable’s test suite, it’s probably got fewer bugs, but there are many > guides on how to set them up, and it requires fewer components. > > Setting up the database isn’t an issue for WMF production servers, but is > considered one of the more difficult steps for potential contributors to > any style. Minimizing differences from other setups here helps greatly. A > second issue is that many potential users of the style already have a > database. I have heard from multiple people who would like to run the style > if it could be used with their existing databases. > > *Static data* > > Map styles need some forms of “static” data loaded such as oceans, > low-zoom data, and borders. Normally this is done on an ad-hoc basis with a > long complicated shp2pgsql or ogr2ogr command, but I wrote a python script > that downloads the data and loads it with ogr2ogr, as well as handling all > the SQL needed to update the data without a service interruption. > > This script is useful enough that I have reused it for other projects, > which was made easy because I didn’t hard-code the files used into the > script, but used another file to define them. > > *Borders* > > One of the drivers of the work was to better display disputed borders. To > do this a pre-processing step was considered necessary, and I wrote a > necessary program in C++ with libosmium. This worked, but I should have > made more of an effort to get it packed by Debian GIS and run on Jochen > Toph’s OpenStreetMapData.com servers so others could use the work to > encourage more developers to participate in maintenance. I should also have > given pyosmium a more detailed look. > Vector tile schema > One of the reasons for switching to a new schema was legal threats against > people using the Mapbox Streets schema. This meant osm2vectortiles also had > to switch schemas at the same time. There was an effort to work with them > to use a common schema, but it never happened because we had different > needs. In retrospect, we should have either gone with a common schema and > tm2source project, or done nothing in common. Either choice is valid, and > it’s a balance of coordination work against a common development direction. > > It was useful to have someone external to discuss ideas with, but this > wouldn’t have been required with other people on the team. > > *Style* > > The original plan was to largely stick with the cartography of osm-bright. > This changed once we got into implementation and we realized how insane > some parts of the osm-bright cartography were, and efforts were made > towards redoing the style. > > The road colours selected were from ColorBrewer2 OrRd6, with casing > colours done by adjusting the Lch lightness and chroma. It would have been > better to pick endpoints and generate colours using a script, similar to > osm-carto. This would have allowed easier changes and sped up development > by reducing the number of variables that need to be manually set. > > *Overall* > > The style was completed successfully in time, and none of the changes > would have significantly changed that. They would have mainly made it > easier to attract external contributors if an effort were put into that. As > attracting external contributors wasn’t a priority, they didn’t matter. > > > > _______________________________________________ > Maps-l mailing list > Maps-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/maps-l > > > > _______________________________________________ > Maps-l mailing list > Maps-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/maps-l > -- -- deb tankersley Program Manager, Engineering Wikimedia Foundation

1 0

Language Analysis Glossary
by Trey Jones 19 Jul '18

19 Jul '18

Happy Friday the 13th,* everyone! I realized that I've gotten more lax about defining terms in my write ups over time, because I am always talking about types (pre- and post-analysis) and tokens and monolithic and unpacked analyzers, etc, etc. So, I reorganized the Search Glossary a bit into topical sections, and added a big section on Language Analysis, which I will point to in my write ups. Please review the new section it if you have time and interest. If you know this stuff, please correct any errors. If you don't know this stuff, please ask questions about anything that's unclear so I can improve it. Thanks! The new section of the glossary is under "Language Analysis <https://www.mediawiki.org/wiki/Wikimedia_Discovery/Search/Glossary#Language…> ". Cheers, —Trey *Word of the day: friggatriskaidekaphobia <https://en.wiktionary.org/wiki/friggatriskaidekaphobia>. Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

2 1

Re: [discovery] NLP for on-wiki search
by Trey Jones 19 Jul '18

19 Jul '18

Hi everyone, I've got an update on the NLP project selection. We've narrowed things down to a handful of projects we could work on with a consultant, and a handful we could work on internally. David, Erik, and I reviewed a selection of the most promising-seeming and/or most interesting projects and gave them a very rough cost estimate based on how big of a relative impact they would have, technologically how hard they would be, and how difficult the UI aspect would be. The scores are not definitive, but helped guide the discussion. You can see the list of projects we looked at and more details of the scoring on MediaWiki <https://www.mediawiki.org/w/index.php?title=User:TJones_(WMF)/Notes/Potenti…> . For the possibility of working with an outside consultant, we also considered how easily separated each project would be from our overall system (making it easier for someone new to get up to speed), how projects feed into each other, how easily we could work on projects ourselves (like, we know pretty much what to do, we just have to do it), etc. Our current *recommendation for an outside consultant* would be to start with (1) *spelling correction/did you mean improvements,* with an option to extend the project to include either (2) *"more like" suggestion improvements,* or (3) *query reformulation mining,* specifically for typo corrections. For spelling correction (#1), we are envisioning an approach that integrates generic intra-word and inter-word statistical models, optional language-specific features, and explicit weighted corrections. We believe we could mine redirects flagged as typo correction for explicit corrections, and the query reformulation mining (#3) would also provide frequency-weighted explicit corrections. Our hope is that a system built initially for English would be readily applicable to other alphabetic languages, most probably other Indo-European languages, based on statistics available from Elastic; and that some elements of the system could be applied to other non-alphabetic languages and languages that are typologically <https://en.wikipedia.org/wiki/Morphological_typology> dissimilar to Indo-European languages. Looking at the rest of the list, (a) *wrong keyboard detection* seems like something we should work on internally, since we already have a few good ideas on how to approach it. (b) *Acronym support* is a pet peeve for several members of the team, and seems to be straightforward to improve. (c) *Automatic stemmer building* and (d) *automatic stop word* generation aren't so much projects we should work on as things we should research to see if there are already tools or lists out there we could use to make the projects much easier. Comments and questions here or on the talk page are welcome. Cheers, —Trey Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation On Tue, May 15, 2018 at 11:30 AM, Trey Jones <tjones(a)wikimedia.org> wrote: > Hi everyone, > > I just finished putting together an annotated list of potential > applications of natural language processing to on-wiki search > <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Potential_Applicatio…>. > There are dozens and dozens of ideas there—including many that are > interesting but probably not practical. If you have any additional ideas, > questions, suggestions, recommendations, or preferences, please > share!—either on the mailing list or on the talk page. > > The goal is to narrow it down to one or two things to pursue over the next > two to four quarters, along with other projects we are working on. > > Thanks! > —Trey > > Trey Jones > Sr. Software Engineer, Search Platform > Wikimedia Foundation > >

2 1

Discovery Weekly Update for the week starting 2018-07-02
by Chris Koerner 14 Jul '18

14 Jul '18

Hello again, This is the weekly update from the Search Platform team for the weeks starting 2018-07-02 and 2018-07-09. Programming Note: With the Wikimania Hackathon, Wikimania proper, and resulting travel for folks in coming weeks, the next update will be for the week starting 2018-07-30. As always, feedback and questions welcome. == Discussions == === Search === * David and Stas worked on fine tuning of search configs to mediawiki-config for Wikidata [0] * Stas and Addshore helped to catch and clean up some bad lookups and report them properly [1] * A "Wrong document type" error was corrected by Erik by fixing Sanitizer MetaStore integration [2] * Erik worked on tracking queries that run on the Elastic Search clusters longer than both server side and client side timeouts by fixing some slow logging functionality [3] * There was a Meta-wiki error where search suggests non-existent title due to namespace/redirect mixup. Erik's note: "it's a bit awkward, but typing Help:Glo into autocomplete on metawiki suggests 'Global Account' from the main namespace, and selecting it takes you to Help:Unified login" [4] * In order to dispatch queries to a particular search setup (cirrus defaults vs wikibase custom query builder), David created a flexible way to classify queries, meant to replace the 'getSyntaxUsed' approach currently in SearchContext. [5] * Trey and a community volunteer, Athena, created a basic Mirandese analysis chain. It was tested on RelForge and pushed into production this week [6]. Trey kicked off, completed and tested the re-indexing of the Mirandese Wikis [7]. * The Re-Re-Index of the Serbian Wikis after refactored plugins were deployed has been completed [8] and the re-index of the Croatian, Serbo-Croatian, and Bosnian Wikis was also done [9] * We currently mix a tiny number of namespace documents into the regular indices, which seems inefficient; so Erik built a unified namespace index [10] * Erik updated the 'OtherIndex' to operate on a cluster other than the one holding the wiki [11] * Trey updated a variety of things on the Analysis Tools with lots little fixes and improvements as well as a few small errors in the analysis code that conflated post-analysis types and pre-analysis types [12] == Did you know? == * The period of this status update includes Friday, July 13, 2018. The fear of the number thirteen is called "triskaidekaphobia" [13]. There are two words for fear of Friday the 13th: "paraskavedekatriaphobia" [14] and "friggatriskaidekaphobia" [15]—the first maintains a consistent etymology with the Greek word for Friday, "Paraskeví", while the second invokes "Frigg", the Norse Goddess after whom Friday is named in English. [0] https://phabricator.wikimedia.org/T182717 [1] https://phabricator.wikimedia.org/T198091 [2] https://phabricator.wikimedia.org/T197446 [3] https://phabricator.wikimedia.org/T196180 [4] https://phabricator.wikimedia.org/T115756 [5] https://phabricator.wikimedia.org/T197774 [6] https://phabricator.wikimedia.org/T194941 [7] https://phabricator.wikimedia.org/T197890 [8] https://phabricator.wikimedia.org/T196404 [9] https://phabricator.wikimedia.org/T196658 [10] https://phabricator.wikimedia.org/T192699 [11] https://phabricator.wikimedia.org/T194678 [12] https://phabricator.wikimedia.org/T199273 [13] https://en.wiktionary.org/wiki/triskaidekaphobia [14] https://en.wiktionary.org/wiki/paraskavedekatriaphobia [15] https://en.wiktionary.org/wiki/friggatriskaidekaphobia --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Relations Specialist Wikimedia Foundation

1 0

Fwd: [MediaWiki-l] New episode of "Between the Brackets": Stas Malyshev
by Deborah Tankersley 11 Jul '18

11 Jul '18

Cross-posting...! -- deb tankersley Program Manager, Engineering Wikimedia Foundation ---------- Forwarded message --------- From: Yaron Koren <yaron(a)wikiworks.com> Date: Tue, Jul 10, 2018 at 11:01 AM Subject: [MediaWiki-l] New episode of "Between the Brackets": Stas Malyshev To: MediaWiki announcements and site admin list < mediawiki-l(a)lists.wikimedia.org> Hi, A new episode of the MediaWiki podcast "Between the Brackets" has been released, featuring an interview with Wikimedia Foundation developer Stas Malyshev, who works on search in both MediaWiki and Wikidata. You can listen to the interview here: http://betweenthebrackets.libsyn.com/episode-12-stas-malyshev -Yaron _______________________________________________ MediaWiki-l mailing list To unsubscribe, go to: https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

1 0

Discovery Weekly Update for the week starting 2018-06-25
by Chris Koerner 04 Jul '18

04 Jul '18

Hello, Here's the update from the Search Platform team for the week of 2018-06-25. As always, feedback and questions welcome. == Discussions == === Search === * Stas has added Lexemes to the list of namespaces for instant indexing on Wikidata [0]. * David has merged refactoring of Wikidata search configs [1]. === Wikidata Query Service === * Stas have fixed several bugs in WDQS: global limit on MWAPI [2], correct reporting of results order [3] and "q" option in regexps [4]. [0] https://phabricator.wikimedia.org/T196896 [1] https://phabricator.wikimedia.org/T182717 [2] https://phabricator.wikimedia.org/T197495 [3] https://phabricator.wikimedia.org/T197496 [4] https://phabricator.wikimedia.org/T197566 --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Relations Specialist Wikimedia Foundation

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery July 2018