We have recently experienced multiple instabilities of our
elasticsearch cluster in codfw. First one on 2016-04-27 identified
around 10am UTC, second one on 2016-05-02 identified around 11pm UTC.
In both cases the symptoms were similar:
* a cluster restart was in progress to modify cluster discovery
strategy (moving from multicast to unicast [1]).
* cluster wide operations (get list of nodes, shards, changing cluster
settings) were extremely slow
* number of pending tasks (`curl -s
localhost:9200/_cluster/health?pretty | jq .number_of_pending_tasks`)
was high
* most of those tasks were deletion of shards (`curl -s
localhost:9200/_cat/pending_tasks | grep indices_store | wc -l`)
* a few of them were deletion and creation of the
"mediawiki_cirrussearch_frozen_indexes" index
* response time of client requests did not seem to be affected
Only "more like" traffic is going to codfw. On the 27, we switched
this traffic to eqiad to buy time for investigation and recovery [2].
We found a copy/paste error in the cluster discovery configuration
[3]. After a full cluster restart the situation stabilized.
On the 2, we did further investigation by capturing traffic and
looking for the cause of the deletion of the
"mediawiki_cirrussearch_frozen_indexes" index. We saw a high number of
deletion and creation requests for this index, coming from mediawiki
job runners.
Further investigation show that Elastica (the library used by
CirrusSearch to communicate with Elasticsearch) does a recreate
(deletion followed by creation) to create a new index [4]. This was
fixed quickly by disabling index creation [5][6]. A more permanent fix
is tracked [7].
We suspect that this issue was seen only on codfw because of the
latency increasing the probability of a race condition between 2
re-create operations.
Lessons learned:
* index creation / deletion can bring the cluster to its knees
* despite that, Elasticsearch is robust, client request do not seem to
have been affected
* the number of pending tasks raising seems to be a god indication of
cluster issues [8]
Big thanks to David and Erik for their support in this issue!
[1] https://phabricator.wikimedia.org/T110236
[2] https://gerrit.wikimedia.org/r/#/c/285620/
[3] https://gerrit.wikimedia.org/r/#/c/285612/
[4] https://github.com/ruflin/Elastica/blob/master/lib/Elastica/Index.php#L238
- not entirely sure about this reference
[5] https://gerrit.wikimedia.org/r/#/c/286541/
[6] https://gerrit.wikimedia.org/r/#/c/286542/
[7] https://phabricator.wikimedia.org/T133793
[8] https://phabricator.wikimedia.org/T134240
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
Hello,
The Maps team at the Wikimedia Foundation is getting closer to make it
possible to add interactive maps <https://www.mediawiki.org/wiki/Maps> to
Wikipedia. If you've ever used services like Google Maps or Mapquest you
may be familiar with interactive maps. We’d like to invite editors to have
a conversation on how these maps might be used within articles. We've put
together information on how these maps and their style works from a
technical perspective
<https://www.mediawiki.org/wiki/Maps/Conversation_about_interactive_map_use>
– where the data comes from, how maps are styled, how to add an interactive
map, and a few example use cases.
In particular we would like to focus the discussion around three key
questions (open discussion outside these questions is welcome too).
* What types of articles would use interactive maps?
* How do these articles differ in their requirements?
* Are there any classes of articles whose map styling requirement is
fundamentally in conflict with other article classes, thus requiring
multiple styles?
If you are interested, please visit
https://www.mediawiki.org/wiki/Maps/Conversation_about_interactive_map_use
to learn more and get involved.
--
Yours,
Chris Koerner
Community Liaison - Discovery
Wikimedia Foundation
As promised, I started to dig in to the Maps documentation [1] and
started to write some diagrams. The source of those diagram is at the
end of this email, you can process that source online if you want the
pretty pictures [2] (well, not that pretty).
I should probably post those diagrams somewhere for discussion, but
not sure where it make sense...
[1] https://wikitech.wikimedia.org/wiki/Maps
[2] http://www.planttext.com/planttext
**Maps component diagram:**
@startuml
[Varnish]
[Kartotherian]
[Tilerator]
[TileratorUI]
[OSM]
database Cassandra
database Redis
database Postgres
OSM -> Postgres: import maps data
Tilerator -left-> Redis: consume job queue
Tilerator --> Postgres: get data to pre-generate\nvector tiles
Tilerator -> Cassandra: stores pre-generated\nvector tiles
TileratorUI ..> Redis: schedule jobs to\npre-generate tiles
Kartotherian --> Cassandra: serve tiles\nto end user
Varnish --> Kartotherian
@enduml
**Maps deployment diagram:**
@startuml
() "Maps\n(public)" as mapsP
package codfw {
package "Maps cache cluster" as cache {
node cp2003 {
[varnish-frontend] as vfe2003
[varnish-backend] as vbe2003
}
node cp2009 {
[varnish-frontend] as vfe2009
[varnish-backend] as vbe2009
}
node cp2015 {
[varnish-frontend] as vfe2015
[varnish-backend] as vbe2015
}
node cp2021 {
[varnish-frontend] as vfe2021
[varnish-backend] as vbe2021
}
}
() "Maps\n(internal)" as mapsI
node "maps-test2001\n(master)" as maps2001 {
[Kartotherian] as Kartotherian2001
[Tilerator] as Tilerator2001
[TileratorUI] as TileratorUI2001
database Cassandra as Cassandra2001
database Redis as Redis2001
database "Postgres\nmaster" as Postgres2001
Tilerator2001 -left-> Redis2001
Tilerator2001 --> Postgres2001
Tilerator2001 -> Cassandra2001
TileratorUI2001 --> Redis2001
Kartotherian2001 --> Cassandra2001
}
node "maps-test2002-4\n(slaves)" as maps2002 {
[Kartotherian] as Kartotherian20xx
[Tilerator] as Tilerator20xx
[TileratorUI] as TileratorUI20xx
database Cassandra as Cassandra20xx
database Redis as Redis20xx
database "Postgres\nslaves" as Postgres20xx
Tilerator20xx -left-> Redis20xx
Tilerator20xx --> Postgres20xx
Tilerator20xx --> Cassandra20xx
TileratorUI20xx -> Redis20xx
Kartotherian20xx --> Cassandra20xx
}
mapsI - Kartotherian2001
mapsI - Kartotherian20xx
vbe2003 -> mapsI
vbe2009 -> mapsI
vbe2015 -> mapsI
vbe2021 -> mapsI
vfe2003 --> vbe2003
vfe2009 --> vbe2009
vfe2015 --> vbe2015
vfe2021 --> vbe2021
' un comment the block below to have the mostly complete Varnish connections
' vfe2003 --> vbe2003
' vfe2003 --> vbe2009
' vfe2003 --> vbe2015
' vfe2003 --> vbe2021
'
' vfe2009 --> vbe2003
' vfe2009 --> vbe2009
' vfe2009 --> vbe2015
' vfe2009 --> vbe2021
'
' vfe2015 --> vbe2003
' vfe2015 --> vbe2009
' vfe2015 --> vbe2015
' vfe2015 --> vbe2021
'
' vfe2021 --> vbe2003
' vfe2021 --> vbe2009
' vfe2021 --> vbe2015
' vfe2021 --> vbe2021
mapsP -- vfe2003
mapsP -- vfe2009
mapsP -- vfe2015
mapsP -- vfe2021
Postgres20xx <- Postgres2001
Cassandra20xx <-> Cassandra2001
Redis20xx <-> Redis2001
note right of vfe2003
interconnections between
Varnish frontend and backend
are more complex, not showing
all this here.
end note
note right of mapsI
Need to check if this is a
LVS endpoint or if Varnish talks
directly to Mapsf
end note
}
note as n1
- unsure about what communication
there is between maps-test nodes
end note
@enduml
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation