[Labs-l] NFS server network capacity upgrade

Maximilian Doerr cybernet678 at yahoo.com
Sat May 24 13:47:36 UTC 2014


Well said.

Gesendet von Maximilian's iPhone. 
(Sent from Maximilian's iPhone.)

> On May 24, 2014, at 4:08, Petr Bena <benapetr at gmail.com> wrote:
> 
> I don't say we should do this instead of the current scheduled
> improvement, but just to prevent future issues.
> 
> There will be never a server that is "good enough" no matter how fast
> hardware you have I can create a tool that will kill the server anyway
> just by using too much resources (in this case, I could just create a
> bunch of separate tools that would each run 15 tasks that would write
> / read NFS as much as it can).
> 
> This in case of 1 person would be violation of TOS. Now imagine a
> number of "uneducated" users of tool labs who are doing EXACTLY this
> just in case of 1 task with few jobs. They all together would be just
> like this "evil" 1 user and of course wouldn't really violate any
> rule, which is logical.
> 
> Right now we do:
> * limit a number of tasks per tool
> * ensure there is enough of RAM for each tool using SGE scheduling
> 
> What we don't do:
> * monitor network usage per tool
> * monitor IO usage per tool
> 
> This is not to be some kind of evil admin who would slap users who
> operate tools which use too much, neither to set up more stupid
> restrictions, I myself hate restrictions of all kind. I am aware that
> popular tools that are accessed heavily will produce lot of traffic
> even if optimized so these would be fine even if on top of the list.
> But what I think that would be useful and important, is to let people
> who operate tools that seems underoptimized know about that, and
> eventually help them to optimize these so that they eat less
> resources.
> 
> I believe that this would, in a long term, saved a lot of resources
> and money. There /are/ tools that need optimization right now, and
> these are one of reasons why other tools are dying now. They are dying
> because the systems are overloaded, and systems are overloaded because
> they are not being used effectively.
> 
> On Sat, May 24, 2014 at 9:56 AM, Gerard Meijssen
> <gerard.meijssen at gmail.com> wrote:
>> Hoi,
>> Nice in theory. However tools DO die when others produce too much shit for
>> the server to handle.
>> 
>> In my mind the most important thing is for Labs to be operational. Worrying
>> about dimes and cents is too expensive when it is at the cost of a
>> diminished service to Labs users.
>> 
>> Yes, even when performance is always ensured it pays to target bad practices
>> because sure as hell, some things do need improvements and it pays to make
>> sure that software gets optimised.
>> Thanks,
>>     GerardM
>> 
>> 
>>> On 24 May 2014 09:39, Petr Bena <benapetr at gmail.com> wrote:
>>> 
>>> what about doing some steps to optimize current resource usage so that
>>> it's not needed to put more and more money to increase the HW
>>> resources?
>>> 
>>> for example I believe there is a number of tools that are using nfs
>>> servers in insane way, for example generating tons of temporary data
>>> that could be instead stored to /tmp instead of /data/project also the
>>> static binaries that are in /data/project could be probably cached in
>>> memory somehow so that they don't need to be loaded over network
>>> everytime the task restart.
>>> 
>>> Perhaps installing sar-like monitoring tool on nfs server would help
>>> to discover which tool uses the nfs most and such a report could help
>>> developers of these tools to figure out where is a need for
>>> optimization. I myself have some idea of how labs work so my own tools
>>> are usually very optimized to use these network resources (and even
>>> disk storage) as less as possible, but others might not be aware of
>>> that and may need some help optimizing these.
>>> 
>>> On Fri, May 23, 2014 at 6:28 PM, Marc A. Pelletier <marc at uberbox.org>
>>> wrote:
>>>> Hello everyone,
>>>> 
>>>> In the following week or two, we are planning on adding another bound
>>>> network port to increase the NFS server's bandwidth (which is,
>>>> currently, saturating at regular interval).
>>>> 
>>>> This will imply a short period of downtime (on the order of 10 minutes
>>>> or so) during which no NFS service will be provided.  In theory, this
>>>> will result in file access simply stalling and resuming at the end of
>>>> the outage, but processes that have timeouts may be disrupted (in
>>>> particular, web service access will likely report gateway issues during
>>>> that interval).
>>>> 
>>>> While this is not set in stone, I am aiming for Friday, May 30 at 18:00
>>>> UTC for the downtime.  I will notify this list with a confirmation or a
>>>> new schedule in at least three days in advance.
>>>> 
>>>> Thanks for your patience,
>>>> 
>>>> -- Marc
>>>> 
>>>> _______________________________________________
>>>> Labs-l mailing list
>>>> Labs-l at lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>>> 
>>> _______________________________________________
>>> Labs-l mailing list
>>> Labs-l at lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>> 
>> 
>> 
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
> 
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l



More information about the Labs-l mailing list