We're applying a Solaris kernel patch for the ZFS performance problem on ms1, our main media file server.
While in progress, uploads will be temporarily disabled. Cached images should still display, but you may see some missing images in the meantime...
-- brion vibber (brion @ wikimedia.org)
Brion Vibber wrote:
We're applying a Solaris kernel patch for the ZFS performance problem on ms1, our main media file server.
While in progress, uploads will be temporarily disabled. Cached images should still display, but you may see some missing images in the meantime...
The kernel patch is applied, the machine's rebooted, and uploads & math are re-enabled. So far all looks well; this'll keep things running smoothly while we continue work on rearranging the files (moving thumbs out to a separate server will simplify things and keep the main filesystem backup snapshots from filling up with short-lived files, for instance).
-- brion
Well, the Sun patch didn't hold us. So we are going to do a gradual move of directories, some data, and the actual thumbnail regeneration over to another server, ms4. How much data we move depends on how fast ms1 can push it over to ms4, currently very slow indeed. Even removing files on ms1 is painfully slow. So...
In order to get ms1 out of its hole, we're turning off uploads once again, so that it can free up space at a reasonable pace. Once we have enough free space to make ZFS happy, we'll turn it them back on. We expect that the rest of the migration of data and/or empty directories should take place over the next several days without further outages.
Ariel
Στις 15-07-2009, ημέρα Τετ, και ώρα 14:28 -0700, ο/η Brion Vibber έγραψε:
Brion Vibber wrote:
We're applying a Solaris kernel patch for the ZFS performance problem on ms1, our main media file server.
While in progress, uploads will be temporarily disabled. Cached images should still display, but you may see some missing images in the meantime...
The kernel patch is applied, the machine's rebooted, and uploads & math are re-enabled. So far all looks well; this'll keep things running smoothly while we continue work on rearranging the files (moving thumbs out to a separate server will simplify things and keep the main filesystem backup snapshots from filling up with short-lived files, for instance).
-- brion
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2009/7/17 Ariel T. Glenn ariel@wikimedia.org:
Well, the Sun patch didn't hold us. So we are going to do a gradual move of directories, some data, and the actual thumbnail regeneration over to another server, ms4. How much data we move depends on how fast ms1 can push it over to ms4, currently very slow indeed. Even removing files on ms1 is painfully slow. So...
So the ZFS bug came back?
The hard part is thinking of a file system that does snapshots as nicely, under any OS.
In order to get ms1 out of its hole, we're turning off uploads once again, so that it can free up space at a reasonable pace. Once we have enough free space to make ZFS happy, we'll turn it them back on. We expect that the rest of the migration of data and/or empty directories should take place over the next several days without further outages.
Honestly, a reboot will be quicker. I fully understand not even being able to, of course.
- d.
On Thu, Jul 16, 2009 at 10:18 PM, David Gerarddgerard@gmail.com wrote:
Honestly, a reboot will be quicker.
Why would a reboot help? The system would still have too little disk space (if that's what's actually causing the problem).
2009/7/17 Aryeh Gregor Simetrical+wikilist@gmail.com:
On Thu, Jul 16, 2009 at 10:18 PM, David Gerarddgerard@gmail.com wrote:
Honestly, a reboot will be quicker.
Why would a reboot help? The system would still have too little disk space (if that's what's actually causing the problem).
When you beat the crap out of Solaris, it needs rebooting way more than a Unix should. We don't tell the NT admins about this, they get snarky.
The ZFS bug manifests when the file system is (a) very full (b) getting lots of writes. The block allocation algorithm uses up all the CPU trying for perfection rather than adequacy. So system CPU goes through the roof and the system turns to molasses. Only way out: reboot - stopping writes or severely reducing the disk usage didn't work for us on Solaris 10.
After a reboot, don't write to the file system, just read the data off it.
Then start over with a lot less data on that FS. 70% or less.
Hard part: being able to take the machine out of service at all. Harder part: moving services off the box while keeping disk under 70%.
- d.
2009/7/17 David Gerard dgerard@gmail.com:
The ZFS bug manifests when the file system is (a) very full (b) getting lots of writes. The block allocation algorithm uses up all the CPU trying for perfection rather than adequacy. So system CPU goes through the roof and the system turns to molasses. Only way out: reboot - stopping writes or severely reducing the disk usage didn't work for us on Solaris 10.
And today I happened to be chatting to another Solaris 10 administrator who has seen precisely the same bug manifest itself on a terabyte zpool getting lots of I/O. So that's three cases. I certainly hope Sun get to the bottom of this one sooner rather than later ... and that btrfs gets out of alpha sooner than later.
- d.
David Gerard wrote:
And today I happened to be chatting to another Solaris 10 administrator who has seen precisely the same bug manifest itself on a terabyte zpool getting lots of I/O. So that's three cases. I certainly hope Sun get to the bottom of this one sooner rather than later ... and that btrfs gets out of alpha sooner than later.
Here's an anonymized extract of what I sent Brion privately back then:
... experts on AFS, IFS, NFS, and Linux (as in, official Linux kernel commit/maintainers).... Keep in mind that I'm an Internet guy, not an application or storage guy, I'm just passing along....
[http://wikitech.wikimedia.org/view/Ms1_troubles]
This is typical for that type of file system. NetApp filers have similar behavior after exceeding a certain relative capacity (maybe 80%). This is the trade-off for the advanced features like snapshots.
If you provision for keeping your primary storage at 60% capacity or less, you should continue to be just fine. You can use more storage capacity on secondary storage (such as backup storage), since lower tiers in the storage hierarchy are generally less latency sensitive.
[http://wikitech.wikimedia.org/view/User:River/Storage]
Sounds reasonable. Aligns well with what others do.
If acquisition cost is a concern, you might also consider building the hardware from scratch. Solaris 10 runs on most x86 hardware, but you have to be careful about selecting hardware (especially network and storage) that is on the Solaris HCL.
...
If the single server is still working well enough, migration to the two server version is probably your best bet for rolling your own. One to write and the other to distribute files and snapshots. You can even do it over Gb ethernet without FC at much lower cost. Network is still faster than disk access.
...
If you have to buy now, and are unlikely to upgrade for years, the current gold plated performance version is Sun ZFS over NetApp filers.
...
2009/8/4 William Allen Simpson william.allen.simpson@gmail.com:
If acquisition cost is a concern, you might also consider building the hardware from scratch. Solaris 10 runs on most x86 hardware, but you have to be careful about selecting hardware (especially network and storage) that is on the Solaris HCL.
Any generic x86 server hardware should be fine in general. Dells certainly are, I believe HP is actively supporting Solaris too.
If you have to buy now, and are unlikely to upgrade for years, the current gold plated performance version is Sun ZFS over NetApp filers.
Rilly? I thought they were comparable in performance but Sun was way cheaper (hence the patent kerfuffle).
- d.
David Gerard wrote:
2009/8/4 William Allen Simpson william.allen.simpson@gmail.com:
If you have to buy now, and are unlikely to upgrade for years, the current gold plated performance version is Sun ZFS over NetApp filers.
Rilly? I thought they were comparable in performance but Sun was way cheaper (hence the patent kerfuffle).
No idea myself. I'm just passing along comments verbatim. I'm pretty sure "gold plated" means expensive. And I'm pretty sure NetApp wouldn't stay in business long for "comparable" performance. So, a whole bunch of somebodies out there think that NetApp performance exceeds others. YMMV.
Meanwhile, back at the ranch, the team needs to decide whether the 2 server, shared, pushme-pullyou variant, without Fiber Channel (or iSCSI or whatever), would perform well enough to meet current needs.
On Tue, Aug 4, 2009 at 1:30 PM, William Allen Simpsonwilliam.allen.simpson@gmail.com wrote:
David Gerard wrote:
2009/8/4 William Allen Simpson william.allen.simpson@gmail.com:
If you have to buy now, and are unlikely to upgrade for years, the current gold plated performance version is Sun ZFS over NetApp filers.
Rilly? I thought they were comparable in performance but Sun was way cheaper (hence the patent kerfuffle).
No idea myself. I'm just passing along comments verbatim. I'm pretty sure "gold plated" means expensive. And I'm pretty sure NetApp wouldn't stay in business long for "comparable" performance. So, a whole bunch of somebodies out there think that NetApp performance exceeds others. YMMV.
Meanwhile, back at the ranch, the team needs to decide whether the 2 server, shared, pushme-pullyou variant, without Fiber Channel (or iSCSI or whatever), would perform well enough to meet current needs.
All the modern filesystems (WAFL, ZFS) have odd behavior and slowdowns as you approach full on the disk. I've got a bunch of multi-TB pools on Sun X4500s serving NFS and local storage, with ZFS, and have seen consistent stable performance if we keep them less than 70-80% full.
If you want more consistent behavior near the edges plus snapshots, you probably want to go buy Veritas / Symantec Foundation Suite - the Volume Manager gives you multi-disk RAID and snapshots, and the VxFS filesystem gives you growable and high-scale filesystems.
More disks and WAFL or ZFS ( just accepting a quota limit of 70% or something, to keep it from entering the misbehavior region) is probably cost-competitive, though. We're in the process of buying a bunch of X4540s with 48x500 GB drives; we expect to get something like 18 TB usable after RAID and hot spares and OS disks, and not to load them up past about 14 TB. About $28k each; less if you're an educational or charitable institution. The 750 GB and 1 TB drive options look attractive too.
On Tue, Aug 4, 2009 at 3:42 PM, George Herbertgeorge.herbert@gmail.com wrote:
On Tue, Aug 4, 2009 at 1:30 PM, William Allen Simpsonwilliam.allen.simpson@gmail.com wrote:
David Gerard wrote:
2009/8/4 William Allen Simpson william.allen.simpson@gmail.com:
If you have to buy now, and are unlikely to upgrade for years, the current gold plated performance version is Sun ZFS over NetApp filers.
Rilly? I thought they were comparable in performance but Sun was way cheaper (hence the patent kerfuffle).
No idea myself. I'm just passing along comments verbatim. I'm pretty sure "gold plated" means expensive. And I'm pretty sure NetApp wouldn't stay in business long for "comparable" performance. So, a whole bunch of somebodies out there think that NetApp performance exceeds others. YMMV.
Meanwhile, back at the ranch, the team needs to decide whether the 2 server, shared, pushme-pullyou variant, without Fiber Channel (or iSCSI or whatever), would perform well enough to meet current needs.
All the modern filesystems (WAFL, ZFS) have odd behavior and slowdowns as you approach full on the disk. I've got a bunch of multi-TB pools on Sun X4500s serving NFS and local storage, with ZFS, and have seen consistent stable performance if we keep them less than 70-80% full.
If you want more consistent behavior near the edges plus snapshots, you probably want to go buy Veritas / Symantec Foundation Suite - the Volume Manager gives you multi-disk RAID and snapshots, and the VxFS filesystem gives you growable and high-scale filesystems.
Another option for a shared file system with tiered storage (and multiple-copy archive ability) is Sun SAM/QFS. It also has the nicety of being open sourced recently.
Unfortunately, the open source version isn't ready for use. Also, native Linux support has always been kind of poor; however, you could always do SAM/QFS for tiered storage, and do pNFS (or NFSv4) for data access.
V/r,
Ryan Lane
On Wed, Jul 15, 2009 at 1:43 PM, Brion Vibber brion@wikimedia.org wrote:
We're applying a Solaris kernel patch for the ZFS performance problem on ms1, our main media file server.
While in progress, uploads will be temporarily disabled. Cached images should still display, but you may see some missing images in the meantime...
-- brion vibber (brion @ wikimedia.org)
Have you considered using glusterfs for improved reliability and speed? The more bricks you add the better the throughput and failsafe meaning, with hundreds of servers, that you could have a very impressive filesystem. Of course, it does stress your entire network and would only be feasible within a single datacenter, preferably on a single rack.
On Tue, Aug 4, 2009 at 1:37 PM, BrianBrian.Mingus@colorado.edu wrote:
On Wed, Jul 15, 2009 at 1:43 PM, Brion Vibber brion@wikimedia.org wrote:
We're applying a Solaris kernel patch for the ZFS performance problem on ms1, our main media file server.
While in progress, uploads will be temporarily disabled. Cached images should still display, but you may see some missing images in the meantime...
-- brion vibber (brion @ wikimedia.org)
Have you considered using glusterfs for improved reliability and speed? The more bricks you add the better the throughput and failsafe meaning, with hundreds of servers, that you could have a very impressive filesystem. Of course, it does stress your entire network and would only be feasible within a single datacenter, preferably on a single rack.
Wasn't well received among the high end storage crowd, based on discussions at FAST this year.
But that's secondhand - I haven't used it myself. YMMV.
On Tue, Aug 4, 2009 at 2:48 PM, George Herbert george.herbert@gmail.comwrote:
On Tue, Aug 4, 2009 at 1:37 PM, BrianBrian.Mingus@colorado.edu wrote:
On Wed, Jul 15, 2009 at 1:43 PM, Brion Vibber brion@wikimedia.org
wrote:
We're applying a Solaris kernel patch for the ZFS performance problem on ms1, our main media file server.
While in progress, uploads will be temporarily disabled. Cached images should still display, but you may see some missing images in the meantime...
-- brion vibber (brion @ wikimedia.org)
Have you considered using glusterfs for improved reliability and speed?
The
more bricks you add the better the throughput and failsafe meaning, with hundreds of servers, that you could have a very impressive filesystem. Of course, it does stress your entire network and would only be feasible
within
a single datacenter, preferably on a single rack.
Wasn't well received among the high end storage crowd, based on discussions at FAST this year.
But that's secondhand - I haven't used it myself. YMMV.
We switched our cluster and lab from nfs to glusterfs and it's been a dream. There is also the possibility of getting better random read speeds than you can get out of a 15k rpm drive.
wikitech-l@lists.wikimedia.org