As part of https://phabricator.wikimedia.org/T153282 a new style for
Wikimedia maps is being developed, and I've loaded up the whole planet
on one of my test servers as a test and demo.
The demo is available at http://legolas.paulnorman.ca:6789/, and through
"Compare" on the right-hand side of the interface you can compare it
with the current Wikimedia style, OpenStreetMap Carto, and lots of
others. Some other things to be aware of when comparing are:
- The map is displayed with Kosmtik, a design tool with minimal caching,
and it might be restarted while I'm working on it
- Even though the server is faster than production, it may appear slower
because it doesn't have everything cached
- The OSM data on the server is normally within a day of the latest data
Some of the more noticeable style changes are
- Road colours are different, helping view the overall layout of the
- There are fewer cases of subtly different shades of green.
- Bridges and multi-level road constructions are now handled properly,
which should make some areas easier to figure out
I am particularly interested in feedback on
- the overall colour darkness and intensity,
- which of city, region, and country labels are most important:
Feedback is welcome, either through email, phab tickets, or by IRC in
#wikimedia-interactive on freenode.
Over the past few years, my anecdotal impression is that search results
from Wikipedia have become less and less prominent when I use major web
I'm aware that Discovery is working on internal search features including
cross-project search, and that WMF people working on readership are trying
to increase the dwell time and number of pages that Wikipedia visitors
spend on Wikipedia. Has anyone analyzed trends for web search engine
rankings of Wikipedia articles, particularly over the last few years? Also,
is anyone analyzing what would be required to increase the rankings of
Wikipedia articles (and information from sister sites, such as Wikisource
and Commons) when people use web search engines?
Forwarding to the discovery mailing as the outcome of this research might
be extremely valuable for search.
---------- Forwarded message ----------
From: Morten Wang <nettrom(a)gmail.com>
Date: Wed, Apr 19, 2017 at 1:17 AM
Subject: [Wiki-research-l] Project exploring automated classification of
To: Research into Wikimedia content and communities <
I am currently working with Aaron Halfaker and Dario Taraborelli at the
Wikimedia Foundation on a project exploring automated classification of
article importance. Our goal is to characterize the importance of an
article within a given context and design a system to predict a relative
importance rank. We have a project page on meta and welcome comments or
thoughts on our talk page. You can of course also respond here on
wiki-research-l, or send me an email.
Before moving on to model-building I did a fairly thorough literature
review, finding a myriad of papers spanning several disciplines. We have a
draft literature review also up on meta, which should give you a
reasonable introduction to the topic. Again, comments or thoughts (e.g.
papers we’ve missed) on the talk page, mailing list, or through email are
[[User:Nettrom]] aka [[User:SuggestBot]]
Wiki-research-l mailing list
We seem to have some consensus that for the upcoming learning to rank work
we will build out a python library to handle the bulk of the backend data
plumbing work. The library will primarily be code integrating with pyspark
to do various pieces such as:
# Sampling from the click logs to generate the set of queries + page's that
will be labeled with click models
# Distributing the work of running click models against those sampled data
# Pushing queries we use for feature generation into kafka, and reading
back the resulting feature vectors (the other end of this will run those
generated queries against either the hot-spare elasticsearch cluster or the
relforge cluster to get feature scores)
# Merging feature vectors with labeled data, splitting into
test/train/validate sets, and writing out files formatted for whichever
training library we decide on (xgboost, lightgbm and ranklib are in the
# Whatever plumbing is necessary to run the actual model training and do
hyper parameter optimization
# Converting the resulting models into a format suitable for use with the
elasticsearch learn to rank plugin
# Reporting on the quality of models vs some baseline
The high level goal is that we would have relatively simple python scripts
in our analytics repository that are called from oozie, those scripts would
know the appropriate locations to load/store data and pass into this
library for the bulk of the processing. There will also be some script,
probably within the library, that combines many of these steps for feature
engineering purposes to take some set of features and run the whole thing.
So, what do we call this thing? Horrible first attempts:
tl;dr: Search continues to expand functionality by displaying more
information on the search results page
Ever started searching for something on Wikipedia and wondered—*really*, is
that all that there is? Does it feel like you’re somehow playing hide and
seek with all the knowledge that’s out there? And...wouldn’t it be great to
see articles or categories that are similar to your search query and maybe
some related images or links to other languages in which to read that
article? Or, maybe you just want to read and contribute to projects other
than Wikipedia but need a jump start with a few short summaries from sister
The Discovery Search team has been testing out some really cool new
features that will enable some fun and fascinating clicking—down the rabbit
hole of Wikipedia. But first, let’s recap what we’ve been doing recently.
We've been doing tons of work creating, updating, and finessing the search
back end to enhance search queries. There have been many complex things
that have happened, things like: adding ascii-folding and stemming,
detecting when a visitor might be typing in a language that is different
than the Wikipedia that they are on, switching from tf-idf to BM25,
dropping trailing question marks, and updating to ElasticSearch version 5.
We have much more planned in the coming months—machine learning with
‘learning to rank’, investigating and deploying new language analyzers,
and, after exhaustive analysis, removing quotes within queries by
default. We’ll also be working closely with the new
Structured Data team in their brand new work on Commons.
We also want to improve the part that our readers and editors interface
with: the search results page! We started brainstorming during the late
summer of 2016 on what we could do to make search results better—to easily
find interesting, relevant content and to create a more intuitive viewing
experience. We designed and refined numerous ideas on how to improve
the search results page and received lots of good feedback from the
Empowered by the feedback, we began testing starting with a display of
results from the Wikimedia sister projects next to the regular search
results. The idea for this test was to enable discovery into other
projects—projects that our visitors might not have known about—by
displaying interesting results in small snippets. The sidebar display of
the sister projects borrows from a similar feature in use on the Italian,
Catalan and French Wikipedias. We've run two A/B tests on the sister
project search results with detailed analysis and, after a bit of final
touches to the code, we will release the new functionality into production
on all Wikipedias near the end of April 2017.
Our next A/B test will be to add additional information and related results
for each search query. This will be in the form of an ‘explore similar’
link that, when someone interacts with the link, an expanded display will
appear with related pages, categories and links to the article in other
languages—all of which might lead to further knowledge discovery. We
know that not every search query will return exactly what folks were
looking for, but we feel that adding links to similar, but related
information would be helpful and, possibly, super interesting!
We also plan on doing a few more A/B tests in the coming year:
* Test a new display that will show the pronunciation of a word with its
definition and part of speech—all from existing data in Wiktionary.
Initially this will be in English only.
* Test placing a small image (from the article) next to each search result
that is displayed on the page.
* Test an additional future using a new auto completion metadata display in
the search box that is located on the top right of most pages in Wikipedia,
similar to what happens on the Wikipedia.org portal.
For the more technical minded, there is a way to test out these new
features in your own browser. To display the sister project search results,
it will require a bit of URL manipulation; but for the explore similar and
Wiktionary widget, you can modify your common.js file to test an early
version of the features. Detailed information is available on
Once the testing, analysis and feedback cycle is done for each new feature,
we’d like to slowly implement them into production on all Wikipedias
throughout the rest of the year. We’re really hoping that these
enhancements to how search works will further the usefulness of search and
make our readers and editors more productive.
Cheers from the Discovery Search team!
Product Manager, Discovery
On Wed, Apr 5, 2017 at 12:55 PM, Aaron Halfaker <aaron.halfaker(a)gmail.com>
> Link to code?
> No code yet, although there is proof of concept code which this will
inform this work at
> "ltr" means "left to right" to me. Maybe you could do something like
> Sounds like LTR is out as the term is already used elsewhere and is more
widely known. LTRank isn't a bad compromise with spelling out the whole
> On Wed, Apr 5, 2017 at 2:28 PM, Erik Bernhardson <
> ebernhardson(a)wikimedia.org> wrote:
>> We seem to have some consensus that for the upcoming learning to rank
>> work we will build out a python library to handle the bulk of the backend
>> data plumbing work. The library will primarily be code integrating with
>> pyspark to do various pieces such as:
>> # Sampling from the click logs to generate the set of queries + page's
>> that will be labeled with click models
>> # Distributing the work of running click models against those sampled
>> data sets
>> # Pushing queries we use for feature generation into kafka, and reading
>> back the resulting feature vectors (the other end of this will run those
>> generated queries against either the hot-spare elasticsearch cluster or the
>> relforge cluster to get feature scores)
>> # Merging feature vectors with labeled data, splitting into
>> test/train/validate sets, and writing out files formatted for whichever
>> training library we decide on (xgboost, lightgbm and ranklib are in the
>> running currently)
>> # Whatever plumbing is necessary to run the actual model training and do
>> hyper parameter optimization
>> # Converting the resulting models into a format suitable for use with the
>> elasticsearch learn to rank plugin
>> # Reporting on the quality of models vs some baseline
>> The high level goal is that we would have relatively simple python
>> scripts in our analytics repository that are called from oozie, those
>> scripts would know the appropriate locations to load/store data and pass
>> into this library for the bulk of the processing. There will also be some
>> script, probably within the library, that combines many of these steps for
>> feature engineering purposes to take some set of features and run the whole
>> So, what do we call this thing? Horrible first attempts:
>> * ltr-pipeline
>> * learn-to-rank-pipeline
>> * bob
>> * cirrussearch-ltr
>> * ???
>> AI mailing list
> AI mailing list
After some discussion with David, we realised that Cirrus /
Elasticsearch switch is already more automated than we realised.
Cirrus is configured to talk the local Elasticsearch cluster. So if we
start serving traffic for Mediawiki from codfw, those mediawiki
instances should contact the Elasticsearch codfw cluster.
We do have the ability to change that configuration and for the use of
a specific cluster. That's what we did during the previous datacenter
switch, and what we already do for some maintenance operations (yes,
major upgrades of Elasticsearch do require downtime, so we use codfw
during those upgrades).
Since we have already tested a manual DC switch quite a few times, it
is time to check if this automatic switch is working as it should. The
only downside is that it increases the number of moving parts during
the Mediawiki switch.
On a last note, lots of thanks and praise to David and Erik who did
think ahead much more than I did and implemented those nice features!
Operations Engineer, Discovery
UTC+2 / CEST