Howdy,
Happy to report that production[1] and development[2] sets of Discovery
Dashboards are up and running again, this time managed by Puppet. (There
was a bug with web proxies and DNS settings that delayed this
announcement.) Theoretically they should be snappier to use now because
there is no longer an extra virtualization (Vagrant) layer and they are
running directly on Labs instances.
R is a software and programming language mainly used for statistical
inference, machine learning, and data wrangling & visualization. RStudio's
Shiny[3] is a framework for developing web applications in R, and it's what
Discovery's dashboards are written in.
The Reading::Discovery::Analysis team (with guidance and help from
Guillaume Lederrey) is proud to announce a new module available in Ops'
Puppet repo: shiny_server[4], which installs & configures RStudio's Shiny
Server[5] for serving R/Shiny applications. The module also provides
resources for installing R packages from CRAN, GitHub, and other remote git
repositories like Gerrit. For a practical example, refer to Discovery
Dashboards base[6] and production[7] profiles.
Cheers,
Mikhail on behalf of Discovery Analysts
[1] https://discovery.wmflabs.org
[2] https://discovery-beta.wmflabs.org/
[3] https://shiny.rstudio.com/
[4] https://github.com/wikimedia/puppet/tree/production/modules/shiny_server
[5] https://www.rstudio.com/products/shiny/shiny-server/
[6]
https://github.com/wikimedia/puppet/blob/production/modules/profile/manifes…
[7]
https://github.com/wikimedia/puppet/blob/production/modules/profile/manifes…
Hello!
We've had a significant slowdown of elasticsearch today (see Grafana
for exact timing [1]). The impact was low enough that it probably does
not require a full incident report (the number of errors did not raise
significantly [2]), but understanding what happened and sharing that
understanding is important. This is going to be a long and technical
email, you might get bored, feel free to close it and delete it right
now.
TL;DR: elastic1019 was overloaded, having too many heavy shards,
banning all shards from elastic1019 to reshuffle allowed it to
recover.
In more details:
elastic1019 was hosting shards for commonswiki, enwiki and frwiki,
which are all high load shards. elastic1019 is one of our older
server, which are less powerfull, and might also suffer from CPU
overheating [3].
The obvious question: "why do we even allow multiple heavy shards to
be allocated on the same node?". The answer is obvious as well: "it's
complicated...".
One of the very interesting feature of elasticsearch is its ability to
automatically balance shards. This allows the cluster to automatically
rebalance in case nodes are lost, and to automatically balance shards
to spread resource usage across all nodes in the cluster [4].
Constraints can be added to account for available disk space [5], rack
awareness [6], or even have specific filtering for specific indices
[7]. It does not directly allow to constraint allocation based on the
load of a specific shard.
We do have a few mechanism to ensure that load is as uniform as
possible on the cluster:
An index is split in multiple shards, a shard is replicated multiple
times to provide redundancy and to spread load. Those are configured
by index.
We know which are the heavy indices (commons, enwiki, frwiki, ...),
both in term of size and in term of traffic. Those indices are split
in a number of shards+replicas close to the number of nodes in the
cluster, to ensure that those shards are spread evenly on the cluster,
with only a few shards of the same index on the same node, but still
allow to loose a few nodes and keep all shards allocated. For example,
enwiki_content has 8 shards, with 2 replicas each, so a total number
of 24 shards, with a maximum of 2 shards on the same node. This
approach works well most of the time.
The limitation is that a shard is a "scalability unit", you can't move
around something smaller than a shard. In the case of enwiki, a single
shard is ~40Go and a fairly large number of requests per second. If
you have a node that has just one more of those shards, that's already
a significant amount of additional load.
The solution could be to split large indices in a lot more shards, the
scalability unit would be much smaller, and it would be much easier to
have a uniform load. Of course, there are also limitations. The total
number of shards in the cluster has a significant cost. Increasing it
will add load to cluster operations (which are already quite expensive
in with the total number of shards we have at this point). There are
also functional issues: ranking (BM25) uses statistics calculated per
shard, with smaller shards at some point the stats might not be
relevant of the whole corpus.
There are probably a lot more detail we could get into, feel free to
ask more questions and we can continue the conversation. And I'm sure
David and Erik have a lot to add!
Thanks for reading to the end!
Guillaume
[1] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=…
[2] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=…
[3] https://phabricator.wikimedia.org/T168816
[4] https://www.elastic.co/guide/en/elasticsearch/reference/current/shards-allo…
[5] https://www.elastic.co/guide/en/elasticsearch/reference/current/disk-alloca…
[6] https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-…
[7] https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-alloc…
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
UTC+2 / CEST
Hi!
> But, no results for Wikidata, the site that covers more topics than all out
> other sites?
Wikidata search is think right now may not be ready for this yet. It's
way more complicated than regular wiki search because it's a)
multilingual and b) data and not text. We're working on it though :)
--
Stas Malyshev
smalyshev(a)wikimedia.org
Cross-posting this to the Discovery mailing list with hopes that someone
from WMF Discovery can shed some light on this situation.
Pine
On Mon, May 15, 2017 at 2:08 PM, Tom <tom(a)hutch4.us> wrote:
> I actually think there is a drop in page content results too. Searching
> for example, pages using a tag <FooBar>text</FooBar> would report content
> found in x pages. Now search <FooBar> no content in pages found. Search
> <FooBar no > found on 3 pages but expect 50.
>
> I do want to do more testing. Rebuilding the index seems to be super fast
> unlike before which would take up to a few minutes to complete.
>
> Tom
>
> > On May 15, 2017, at 10:02 AM, [[kgh]] <mediawiki(a)kghoffmeyer.de> wrote:
> >
> > Heiya,
> >
> > it's me again. :) Does somebody at least see the issue. Probably a bug
> > that should be reported?
> >
> > Thanks and cheers
> >
> > Karsten
> >
> >
> >> Am 09.05.2017 um 16:32 schrieb [[kgh]]:
> >> Heiya,
> >>
> >> I have upgraded from 1.23 to 1.27 which was now possible since the
> >> latest release.
> >>
> >> After the process I observe a changed behavior regarding the rudimentary
> >> full-text search MediaWiki provides out of the box, i.e. I am not
> >> talking about the Cirrus/Elastica duo available as an extra.
> >>
> >> When adding a search term to the search field on MW 1.27 like e.g.
> >> "Lorem ipsum" (note: including the ") than only the page names for the
> >> findings are shown and not the page names and some text extract wrapping
> >> the searched term as MW 1.23 did. When adding just Lorem ipsum (note:
> >> excluding the ") I get the page names and some text extract wrapping the
> >> searched term as I did with 1.23. The results for Lorem ipsum however
> >> are a much worse fit than for "Lorem ipsum" so that's why I am here.
> >>
> >> Perhaps I missed some setting I now have to make or perhaps there is
> >> some script I overlooked to get things running. I'd like to get the
> >> wrapping text back. Pointers highly appreciated.
> >>
> >> Thanks for your time
> >>
> >> Karsten
> >>
> >>
> >> _______________________________________________
> >> MediaWiki-l mailing list
> >> To unsubscribe, go to:
> >> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
> >
> >
> > _______________________________________________
> > MediaWiki-l mailing list
> > To unsubscribe, go to:
> > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
>
> _______________________________________________
> MediaWiki-l mailing list
> To unsubscribe, go to:
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
Hi everybody,
(With apologies for cross-posting...)
You may have seen the recent communication [1
<https://www.mediawiki.org/wiki/Wikimedia_Engineering/June_2017_changes>]
about the product and tech tune-up which went live the week of June 5th,
2017. In that communication, we promised an update on the future of
Discovery projects and we will talk about those in this email.
The Discovery team structure has now changed, but the new teams will still
work together to complete the goals as listed in the draft annual plan.[2]
A summary of their anticipated work, as we finalize these changes, is
below. We plan on doing a check-in at the end of the calendar year to see
how our goals are progressing with the new smaller and separated team
structure.
Here is a list of the various projects under the Discovery umbrella, along
with the goals that they will be working on:
Search Backend
Improve search capabilities:
-
Implement ‘learning to rank’ [3] and other advanced machine learning
methodologies
-
Improve support for languages using new analyzers
-
Maintain and expand power user search functionality
Search Frontend
Improve user interface of the search results page with new functionality:
-
Implement explore similar [4]
<https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Testin…>
-
Update the completion suggester box [5]
<https://www.mediawiki.org/wiki/Extension:CirrusSearch/CompletionSuggester>
-
Investigate the usage of a Wiktionary widget for English Wikipedia [6]
Wikidata Query Service
Expand and scale:
-
Improve ability to support power features on-wiki for readers
-
Improve full text search functionality
-
Implement SPARQL federation support
Portal
Create and implement automated language statistics and translation updates
for Wikipedia.org
Analysis
Provide in-depth analytics support:
-
Perform experimental design, data collection, and data analysis
-
Perform ad-hoc analyses of Discovery-domain data
-
Maintain and augment the Discovery Dashboards,[7] which allow the teams
to track their KPIs and other metrics
Maps
Map support:
-
Implement new map style
-
Increase frequency of OSM data replication
-
As needed, assist with individual language Wikipedia’s implementation of
mapframe [8] <https://www.mediawiki.org/wiki/Maps/how_to:_embedded_maps>
Note: There is a possibility that we can do more with maps in the coming
year; we are currently evaluating strategic, partnership, and resourcing
options.
Structured Data on Commons
Extend structured data search on Commons, as part of the structured data
grant [9] via:
-
Research and implement advanced search capabilities
-
Implement new elements, filters, relationships
Graphs and Tabular Data on Commons
We will be re-evaluating this functionality against other Commons
initiatives such as the structured data grant. As with maps, we will
provide updates when we know more.
We are still working out all the details with the new team structure and
there might be some turbulence; let us know if there are any concerns and
we will do our best to answer them.
Best regards,
Deborah Tankersley, Product Manager, Discovery
Erika Bjune, Engineering Manager, Search Platform
Jon Katz, Reading Product Lead
Toby Negrin, Interim Vice President of Product
Victoria Coleman, Chief Technology Officer
[1] https://www.mediawiki.org/wiki/Wikimedia_Engineering/June_2017_changes
[2]
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/…
[3] https://en.wikipedia.org/wiki/Learning_to_rank
[4]
https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Testin…
[5]
https://www.mediawiki.org/wiki/Extension:CirrusSearch/CompletionSuggester
[6]
https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Testin…
[7] https://discovery.wmflabs.org/
[8] https://www.mediawiki.org/wiki/Maps/how_to:_embedded_maps
[9] https://commons.wikimedia.org/wiki/Commons:Structured_data