Hi!
my view of things what was happening, may not be accurate, as it was
first time I was touching this part of the cluster ;-)
1. ms1 had just 128 web threads configured, which could be occupied by
both looking up file, serving it (and blocking if squid doesn't
consume it fast enough), and blocking on fastcgi handlers
2. even though ms1 I/O was loaded, it wasn't loaded enough to justify
20s waits on empty file creations via NFS
3. read operations via NFS were relatively fast, comparing with write
operations
4. pybal was quite aggressive in depooling scalers
5. scalers were in one of two states, blocked on NFS write or
depooled, due to stampede of requests hitting one server, and all
others being depooled
6. if scaler would be depooled, not only it would not get requests,
but wouldn't be able to write output back (my assumption, based on
their frozen states, not actual verified fact)
7. due to 4-6, ms1 http threads would be stuck in 404-handler fastcgi
(waiting for scalers to respond), thus also blocking way out for
existing files
8. ms1 was spending lots of CPU in zfs`metaslab_alloc() tree (full
tree, if anyone is interested, is at
http://flack.defau.lt/ms1.svg -
use FF3.5, then you can search for metaslab_alloc in it.
9. some digging around the internals (it is amazing, how I forgot
pretty much everything I learned in 6h of ZFS sessions last
November ;-) showed that the costs could have been increased by amount
of our snapshots.
What was done:
1. Increased amount of worker threads on ms1 (why the heck was it that
small anyway)
2. Made balancing way less eager to depool servers (thanks sir Mark)
3. Disabled ZIL (didn't give much expected effect though, as problem
was elsewhere)
4. Dropped few oldest snapshots - thus targeting the metaslab_alloc()
issue.
Cheers,
Domas
P.S. For anyone Solaris savvy (I am not, despite where I work), you
know what this means:
unix`mutex_delay_default+0x7
unix`mutex_vector_enter+0x99
genunix`cv_wait+0x70
zfs`space_map_load_wait+0x20
zfs`space_map_load+0x36
zfs`metaslab_activate+0x6f
zfs`metaslab_group_alloc+0x18d
zfs`metaslab_alloc_dva+0xdb
zfs`metaslab_alloc+0x6d
zfs`zio_dva_allocate+0x62
zfs`zio_execute+0x60
genunix`taskq_thread+0xbc
unix`thread_start+0x8
3816
unix`mutex_delay_default+0xa
unix`mutex_vector_enter+0x99
genunix`cv_wait+0x70
zfs`space_map_load_wait+0x20
zfs`space_map_load+0x36
zfs`metaslab_activate+0x6f
zfs`metaslab_group_alloc+0x18d
zfs`metaslab_alloc_dva+0xdb
zfs`metaslab_alloc+0x6d
zfs`zio_dva_allocate+0x62
zfs`zio_execute+0x60
genunix`taskq_thread+0xbc
unix`thread_start+0x8
4068
unix`mutex_delay_default+0xa
unix`mutex_vector_enter+0x99
zfs`metaslab_group_alloc+0x136
zfs`metaslab_alloc_dva+0xdb
zfs`metaslab_alloc+0x6d
zfs`zio_write_allocate_gang_members+0x171
zfs`zio_dva_allocate+0xcc
zfs`zio_execute+0x60
genunix`taskq_thread+0xbc
unix`thread_start+0x8
4615
7500
unix`mutex_delay_default+0x7
unix`mutex_vector_enter+0x99
zfs`metaslab_group_alloc+0x136
zfs`metaslab_alloc_dva+0xdb
zfs`metaslab_alloc+0x6d
zfs`zio_dva_allocate+0x62
zfs`zio_execute+0x60
genunix`taskq_thread+0xbc
unix`thread_start+0x8
7785
unix`mutex_delay_default+0xa
unix`mutex_vector_enter+0x99
zfs`metaslab_group_alloc+0x136
zfs`metaslab_alloc_dva+0xdb
zfs`metaslab_alloc+0x6d
zfs`zio_dva_allocate+0x62
zfs`zio_execute+0x60
genunix`taskq_thread+0xbc
unix`thread_start+0x8
10487
genunix`avl_walk+0x39
zfs`space_map_alloc+0x21
zfs`metaslab_group_alloc+0x1a2
zfs`metaslab_alloc_dva+0xdb
zfs`metaslab_alloc+0x6d
zfs`zio_write_allocate_gang_members+0x171
zfs`zio_dva_allocate+0xcc
zfs`zio_execute+0x60
genunix`taskq_thread+0xbc
unix`thread_start+0x8
16297
genunix`avl_walk+0x39
zfs`space_map_alloc+0x21
zfs`metaslab_group_alloc+0x1a2
zfs`metaslab_alloc_dva+0xdb
zfs`metaslab_alloc+0x6d
zfs`zio_dva_allocate+0x62
zfs`zio_execute+0x60
genunix`taskq_thread+0xbc
unix`thread_start+0x8
26149