Hello folks,
lately again we've had some stuff going on...
Though Brion was implementing that anon-blocking stuff (yay, more blocking - faster performance!) we were targeting other performance issues as well. Tim did rewrite ip block code (did cut 50ms or so ;-) as well as made lots of other nice stuff, and now we implemented Mark's idea to run diskless squids (well, they have disks, but no cache on them).
Lots of our new servers have joined object cache running (hehe, again) Tugela, instead of memcached. It's interesting to see how it should grow. Sadly, no expiration (memory->disk) of objects happened yet in a week, so we can't measure anything. BerkeleyDB standalone might be a bit faster than memcached, though, benchmarks on same hardware were not conducted.
~22G of data is cached in object cache now - parser objects, image metadata, diffs, sessions, user objects, 'you have new messages' bits and language objects. So far we didn't notice any of glitches that forced us to remove Tugela from service before (some cosmetic patches were done). Anyway, we have more RAM, that didn't cost millions, we use it.
Anyway, today with squids running from memory only we managed to achieve 0.09s average response times for logged in users, at least those who go directy to Florida. Before that Squid efficiency was really distorted by somewhat blocking async i/o (if it really existed there), poor sibling relations and memory leak.
We still have that memory leak and are somewhat lost with it.. Squid 'accounts' for 1G memory, uses >2G, and it grows, until restarted. We need to solve that, but nobody has every really touched valgrind at such loads (eh, today squid servers were serving like
700 requests per second each), and I'm not sure if anyone touched
valgrind properly at all ;-) We'll soon have a bunch of servers suitable for squid task, but still, using them more efficiently would help. We will always lack resources at some place :-)
Guidelines could help, as well we could simply provide our sources, a bit of configuration and load documentation. *shrug*.
Another troubling part is sibling relations - right now each proxy marks others as siblings and proxy-only, that is shouldn't save contents into cache. Eventually they do not talk to each other at all and hit backend, and all have their separate caches. I'm not sure if that's related with equal object expiration times or any other hypothesis. If anyone has had experience with squids in such setups, where there're lots of objects and lots of servers and efficiency was managed, it would be sure nice to hear it. It is still strange that it blocks quite a bit at some housekeeping operations on i/o.
BTW, it took a while today to detect a serious packet loss in our upstream providers. It does slightly affect client network performance, but quite stalls communication between our distributed clusters. Looking for such problems becomes a bit of witch hunt :)
So much of today's experiences and joys ;-)
Cheers, Domas
Domas Mituzas wrote: <snip>
We still have that memory leak and are somewhat lost with it..
Cant you just have the squids killed every day ?
I managed some huge DNS servers and we restarted bind every eight hours on resolver, every week on sunday for nameserver. That worked well and prevented leaks.
<snip>
BTW, it took a while today to detect a serious packet loss in our upstream providers. It does slightly affect client network performance, but quite stalls communication between our distributed clusters. Looking for such problems becomes a bit of witch hunt :)
What about that nagios box ? :o) Get me a small server and i will flood you with 'packet loss > 0.2%' email & irc notices :o)
cheers,
wikitech-l@lists.wikimedia.org