[Labs-l] NFS server network capacity upgrade

Sat May 24 08:08:44 UTC 2014

I don't say we should do this instead of the current scheduled
improvement, but just to prevent future issues.

There will be never a server that is "good enough" no matter how fast
hardware you have I can create a tool that will kill the server anyway
just by using too much resources (in this case, I could just create a
bunch of separate tools that would each run 15 tasks that would write
/ read NFS as much as it can).

This in case of 1 person would be violation of TOS. Now imagine a
number of "uneducated" users of tool labs who are doing EXACTLY this
just in case of 1 task with few jobs. They all together would be just
like this "evil" 1 user and of course wouldn't really violate any
rule, which is logical.

Right now we do:
* limit a number of tasks per tool
* ensure there is enough of RAM for each tool using SGE scheduling

What we don't do:
* monitor network usage per tool
* monitor IO usage per tool

This is not to be some kind of evil admin who would slap users who
operate tools which use too much, neither to set up more stupid
restrictions, I myself hate restrictions of all kind. I am aware that
popular tools that are accessed heavily will produce lot of traffic
even if optimized so these would be fine even if on top of the list.
But what I think that would be useful and important, is to let people
who operate tools that seems underoptimized know about that, and
eventually help them to optimize these so that they eat less
resources.

I believe that this would, in a long term, saved a lot of resources
and money. There /are/ tools that need optimization right now, and
these are one of reasons why other tools are dying now. They are dying
because the systems are overloaded, and systems are overloaded because
they are not being used effectively.

On Sat, May 24, 2014 at 9:56 AM, Gerard Meijssen
<gerard.meijssen at gmail.com> wrote:
> Hoi,
> Nice in theory. However tools DO die when others produce too much shit for
> the server to handle.
>
> In my mind the most important thing is for Labs to be operational. Worrying
> about dimes and cents is too expensive when it is at the cost of a
> diminished service to Labs users.
>
> Yes, even when performance is always ensured it pays to target bad practices
> because sure as hell, some things do need improvements and it pays to make
> sure that software gets optimised.
> Thanks,
>      GerardM
>
>
> On 24 May 2014 09:39, Petr Bena <benapetr at gmail.com> wrote:
>>
>> what about doing some steps to optimize current resource usage so that
>> it's not needed to put more and more money to increase the HW
>> resources?
>>
>> for example I believe there is a number of tools that are using nfs
>> servers in insane way, for example generating tons of temporary data
>> that could be instead stored to /tmp instead of /data/project also the
>> static binaries that are in /data/project could be probably cached in
>> memory somehow so that they don't need to be loaded over network
>> everytime the task restart.
>>
>> Perhaps installing sar-like monitoring tool on nfs server would help
>> to discover which tool uses the nfs most and such a report could help
>> developers of these tools to figure out where is a need for
>> optimization. I myself have some idea of how labs work so my own tools
>> are usually very optimized to use these network resources (and even
>> disk storage) as less as possible, but others might not be aware of
>> that and may need some help optimizing these.
>>
>> On Fri, May 23, 2014 at 6:28 PM, Marc A. Pelletier <marc at uberbox.org>
>> wrote:
>> > Hello everyone,
>> >
>> > In the following week or two, we are planning on adding another bound
>> > network port to increase the NFS server's bandwidth (which is,
>> > currently, saturating at regular interval).
>> >
>> > This will imply a short period of downtime (on the order of 10 minutes
>> > or so) during which no NFS service will be provided.  In theory, this
>> > will result in file access simply stalling and resuming at the end of
>> > the outage, but processes that have timeouts may be disrupted (in
>> > particular, web service access will likely report gateway issues during
>> > that interval).
>> >
>> > While this is not set in stone, I am aiming for Friday, May 30 at 18:00
>> > UTC for the downtime.  I will notify this list with a confirmation or a
>> > new schedule in at least three days in advance.
>> >
>> > Thanks for your patience,
>> >
>> > -- Marc
>> >
>> > _______________________________________________
>> > Labs-l mailing list
>> > Labs-l at lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/labs-l
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>