Hi!
my view of things what was happening, may not be accurate, as it was first time I was touching this part of the cluster ;-)
1. ms1 had just 128 web threads configured, which could be occupied by both looking up file, serving it (and blocking if squid doesn't consume it fast enough), and blocking on fastcgi handlers 2. even though ms1 I/O was loaded, it wasn't loaded enough to justify 20s waits on empty file creations via NFS 3. read operations via NFS were relatively fast, comparing with write operations 4. pybal was quite aggressive in depooling scalers 5. scalers were in one of two states, blocked on NFS write or depooled, due to stampede of requests hitting one server, and all others being depooled 6. if scaler would be depooled, not only it would not get requests, but wouldn't be able to write output back (my assumption, based on their frozen states, not actual verified fact) 7. due to 4-6, ms1 http threads would be stuck in 404-handler fastcgi (waiting for scalers to respond), thus also blocking way out for existing files 8. ms1 was spending lots of CPU in zfs`metaslab_alloc() tree (full tree, if anyone is interested, is at http://flack.defau.lt/ms1.svg - use FF3.5, then you can search for metaslab_alloc in it. 9. some digging around the internals (it is amazing, how I forgot pretty much everything I learned in 6h of ZFS sessions last November ;-) showed that the costs could have been increased by amount of our snapshots.
What was done:
1. Increased amount of worker threads on ms1 (why the heck was it that small anyway) 2. Made balancing way less eager to depool servers (thanks sir Mark) 3. Disabled ZIL (didn't give much expected effect though, as problem was elsewhere) 4. Dropped few oldest snapshots - thus targeting the metaslab_alloc() issue.
Cheers, Domas
P.S. For anyone Solaris savvy (I am not, despite where I work), you know what this means:
unix`mutex_delay_default+0x7 unix`mutex_vector_enter+0x99 genunix`cv_wait+0x70 zfs`space_map_load_wait+0x20 zfs`space_map_load+0x36 zfs`metaslab_activate+0x6f zfs`metaslab_group_alloc+0x18d zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_dva_allocate+0x62 zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 3816
unix`mutex_delay_default+0xa unix`mutex_vector_enter+0x99 genunix`cv_wait+0x70 zfs`space_map_load_wait+0x20 zfs`space_map_load+0x36 zfs`metaslab_activate+0x6f zfs`metaslab_group_alloc+0x18d zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_dva_allocate+0x62 zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 4068
unix`mutex_delay_default+0xa unix`mutex_vector_enter+0x99 zfs`metaslab_group_alloc+0x136 zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_write_allocate_gang_members+0x171 zfs`zio_dva_allocate+0xcc zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 4615
7500
unix`mutex_delay_default+0x7 unix`mutex_vector_enter+0x99 zfs`metaslab_group_alloc+0x136 zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_dva_allocate+0x62 zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 7785
unix`mutex_delay_default+0xa unix`mutex_vector_enter+0x99 zfs`metaslab_group_alloc+0x136 zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_dva_allocate+0x62 zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 10487
genunix`avl_walk+0x39 zfs`space_map_alloc+0x21 zfs`metaslab_group_alloc+0x1a2 zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_write_allocate_gang_members+0x171 zfs`zio_dva_allocate+0xcc zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 16297
genunix`avl_walk+0x39 zfs`space_map_alloc+0x21 zfs`metaslab_group_alloc+0x1a2 zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_dva_allocate+0x62 zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 26149