[Labs-admin] Fwd: Re: Servers with GPUs
Andrew Bogott
abogott at wikimedia.org
Wed Feb 22 17:11:37 UTC 2017
FYI, here is a thing that reading/ores people are daydreaming about.
I've made it clear that bare-metal-in-labs is a nonstarter; having
special hardware with special VM types is not necessarily all that much
work. I bet a labvirt with a GPU will be super expensive though, and I
don't know a damn thing about how GPU resource-contention would be handled.
-A
-------- Forwarded Message --------
Delivered-To: andrewbogott at gmail.com
Received: by 10.55.130.1 with SMTP id e1csp77787qkd; Wed, 22 Feb 2017
08:57:16 -0800 (PST)
X-Received: by 10.237.36.116 with SMTP id
s49mr18754207qtc.128.1487782636280; Wed, 22 Feb 2017 08:57:16 -0800 (PST)
Return-Path: <abogott+caf_=andrewbogott=gmail.com at wikimedia.org>
Received: from mail-qk0-f173.google.com (mail-qk0-f173.google.com.
[209.85.220.173]) by mx.google.com with ESMTPS id
q53si660238qtf.337.2017.02.22.08.57.16 for <andrewbogott at gmail.com>
(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed,
22 Feb 2017 08:57:16 -0800 (PST)
Received-SPF: pass (google.com: domain of
abogott+caf_=andrewbogott=gmail.com at wikimedia.org designates
209.85.220.173 as permitted sender) client-ip=209.85.220.173;
Authentication-Results: mx.google.com; dkim=pass
header.i=@wikimedia.org; spf=pass (google.com: domain of
abogott+caf_=andrewbogott=gmail.com at wikimedia.org designates
209.85.220.173 as permitted sender)
smtp.mailfrom=abogott+caf_=andrewbogott=gmail.com at wikimedia.org;
dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=wikimedia.org
Received: by mail-qk0-f173.google.com with SMTP id n127so300qkf.0 for
<andrewbogott at gmail.com>; Wed, 22 Feb 2017 08:57:16 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20161025;
h=x-gm-message-state:delivered-to:dkim-signature:mime-version
:in-reply-to:references:from:date:message-id:subject:to:cc;
bh=geU9E0JuTBp45fwea4nD9VRvS20wnOpfRwfENW1LjhU=;
b=JCBzeI8wEoxwlF8L9N02zAe4B6yM5bhHph//6v2nv99TlzVWmYMh+M5YysH03KFmox
Pv6DRB9RYw7g+3w6On4goiSqbRlOanasU9gEBuOqLaivY/4ictoMjr0mfCZynFxGwwcO
nfgiGna0f9LVgrElZZDd01HweqTJ0uBW7tPPJDxt74YLf7I+tysJReVuvW/e4GzGtMHG
B801Icgxf5WNdjQYUImCtXXHQFxlDcW+PydqDPQMH5Tyg8ssKWUQwFqCKe0PVdtQsCY9
HLYgur5owYC+yZr2D1CNR3iUMH2jcHM0IdsuLmiKL6kSCy+x6oslXO4kJf3Hj3O7sFEy 3ZFw==
X-Gm-Message-State:
AMke39mzi2oJ/S99gBkx9KxBafTj2nbLxmj5JK4xkqZ1ij7Mt9aq5oAT2u0MCGpmwqclQKf7e+yX+X2zHORrO8Cpp0HTeDQVCA==
X-Received: by 10.55.94.6 with SMTP id
s6mr34227257qkb.166.1487782635865; Wed, 22 Feb 2017 08:57:15 -0800 (PST)
X-Forwarded-To: andrewbogott at gmail.com
X-Forwarded-For: abogott at wikimedia.org andrewbogott at gmail.com
Delivered-To: abogott at wikimedia.org
Received: by 10.140.99.34 with SMTP id p31csp954099qge; Wed, 22 Feb
2017 08:57:14 -0800 (PST)
X-Received: by 10.237.50.229 with SMTP id
z92mr30222596qtd.182.1487782634693; Wed, 22 Feb 2017 08:57:14 -0800 (PST)
Return-Path: <abaso at wikimedia.org>
Received: from mx1001.wikimedia.org (mx1001.wikimedia.org.
[208.80.154.76]) by mx.google.com with ESMTPS id
52si1315621qtw.95.2017.02.22.08.57.14 for <abogott at wikimedia.org>
(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed,
22 Feb 2017 08:57:14 -0800 (PST)
Received-SPF: pass (google.com: domain of abaso at wikimedia.org
designates 2a00:1450:400c:c0c::22a as permitted sender)
client-ip=2a00:1450:400c:c0c::22a;
Received: from mail-wr0-x22a.google.com
([2a00:1450:400c:c0c::22a]:33057) by mx1001.wikimedia.org with esmtps
(TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.84_2) (envelope-from
<abaso at wikimedia.org>) id 1cgaE4-0000fA-8M for abogott at wikimedia.org;
Wed, 22 Feb 2017 16:57:14 +0000
Received: by mail-wr0-x22a.google.com with SMTP id 97so6296811wrb.0 for
<abogott at wikimedia.org>; Wed, 22 Feb 2017 08:57:12 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wikimedia.org;
s=google;
h=mime-version:in-reply-to:references:from:date:message-id:subject:to
:cc; bh=geU9E0JuTBp45fwea4nD9VRvS20wnOpfRwfENW1LjhU=;
b=WFFw+gGuZwu1fwbD30zuj8lNJi5A1g7aoIe2iVvXiXk2ePZe0PpxBVjGtiO52g9DmK
ivzQn2KiDrcgF1X3pHaYSUiINhiR5N/IFRcmPznm0HESdJyycHRUySCxgRVHM6Hlt2a7
YflNt6jxThtJE3Hygnodo53N7EPKxHWtj4NV0=
X-Received: by 10.223.172.66 with SMTP id
v60mr25236742wrc.77.1487782631305; Wed, 22 Feb 2017 08:57:11 -0800 (PST)
MIME-Version: 1.0
Received: by 10.80.148.203 with HTTP; Wed, 22 Feb 2017 08:56:30 -0800
(PST)
In-Reply-To: <3a1992ee-44b4-8f8c-f946-15b8f6bad81c at wikimedia.org>
References:
<CAB74=NqwwtYKkgxEZWex5HxFso1xBZ+Y4swRzJzRzhwKGXrnVw at mail.gmail.com>
<CALuiOYYEp34ZTR76x2cAEkmpd03njx+SZUEAbWP5cmqkp4ERaQ at mail.gmail.com>
<CAB74=NpEkj9r9SdGfsn=VFz44DxRz+Ch5gKrrZJeoEhdDBz5xw at mail.gmail.com>
<CAO8=cz3j_A018QCU9iry+UjWEFQZ5AcV4AD942RhEiFkXvqS=Q at mail.gmail.com>
<CAKP=3Wz0R5DpsA8-b+PmJHuUNYggisyN5_4yUStN6cga7ftApA at mail.gmail.com>
<CAH8ZkEggG2WvTeGrZVVyMVKyThe7rZe+TqQsWfOkboW5HtsY7w at mail.gmail.com>
<CAKP=3WwvZ3wfL1XH1aeszPN5Ja3sua_brFifzzSS5SHD0UzpEA at mail.gmail.com>
<CAB74=NoUBJG_f31nb0ACzKyv5JZhnnwqK7N0LVoVeHssozTaEg at mail.gmail.com>
<CAB74=NrrdD46n8nhY3rBmXNNkwkWxyiBiAb9xYNb8XzkUuHBOw at mail.gmail.com>
<CAKP=3WxiqnJZjc=2hPcCe5U7cdbxBMqkdzr91SAOdO7+ns9m0w at mail.gmail.com>
<CAB74=Nq09YL-PY_nbq-oir_7dReYjzLfVz-FgqvPYmD5U=Nqow at mail.gmail.com>
<CAKP=3Wxy4dncDCQJo4bgGwAta-4+Y_xJ0iQHeC1GuPfH6y=RpQ at mail.gmail.com>
<40905b9f-e1e6-1ab2-5d9b-baa5a1dbd9ac at wikimedia.org>
<CAB74=Npe3UwzPZv+ditB1bmNASvNuXgK00ZanM65E7-XySX_pg at mail.gmail.com>
<3a1992ee-44b4-8f8c-f946-15b8f6bad81c at wikimedia.org>
From: Adam Baso <abaso at wikimedia.org>
Date: Wed, 22 Feb 2017 10:56:30 -0600
Message-ID:
<CAB74=Nq+aSspOt2bp1Hno58xjcERf3P8S_c_HPeaBgFWiY_NFw at mail.gmail.com>
Subject: Re: Servers with GPUs
To: Andrew Bogott <abogott at wikimedia.org>
Cc: Aaron Halfaker <ahalfaker at wikimedia.org>, Dario Taraborelli
<dtaraborelli at wikimedia.org>, Ellery Wulczyn <ewulczyn at wikimedia.org>,
Andrew Otto <otto at wikimedia.org>, Corey Floyd <cfloyd at wikimedia.org>,
Andrew Otto <acotto at gmail.com>
Content-Type: multipart/alternative; boundary=f403045cf1523e23fe05492163db
I feel comfortable saying we want one of these options. I'm trying to
gchat Ryan Lane to see if has insight on the Nova support. I just
realized when I checked action=history he was one of the editors of
HeterogeneousGpuAcceleratorSupport
<https://wiki.openstack.org/wiki/HeterogeneousGpuAcceleratorSupport>!
-Adam
On Tue, Feb 21, 2017 at 2:16 PM, Andrew Bogott <abogott at wikimedia.org
<mailto:abogott at wikimedia.org>> wrote:
On 2/21/17 1:36 PM, Adam Baso wrote:
> I think either that, or if it's easier, OpenStack-provisioned
> physical servers. Do you think the latter is doable?
It's possible, although the last time we visited that issue we
swiftly determined that even though people were asking for it no one
actually wanted it. The conclusion from that process is at
https://wikitech.wikimedia.org/wiki/Labs_labs_labs/Bare_Metal
<https://wikitech.wikimedia.org/wiki/Labs_labs_labs/Bare_Metal>
If GPU instance support is in nova and actually maintained, that
might be worth a try.
-A
> I gather the former would require different, if not deeper
> analysis (cf. HeterogeneousGpuAcceleratorSupport
> <https://wiki.openstack.org/wiki/HeterogeneousGpuAcceleratorSupport>).
>
>
> -Adam
>
> On Tue, Feb 21, 2017 at 12:08 PM, Andrew Bogott
> <abogott at wikimedia.org <mailto:abogott at wikimedia.org>> wrote:
>
> Can y'all tell me a bit more about how this would relate to
> labs? Is the idea that you want an option to create VMs with
> virtualized GPU hardware? Or... something else? (I'm not
> immediately clear on how that would work, but I'm largely
> ignorant on the subject.)
>
> -A
>
>
>
> On 2/21/17 12:02 PM, Aaron Halfaker wrote:
>> +9000 :D. (also + Andrew bogott)
>>
>> Would love to have this kind of resource in labs and openly
>> available. I'm personally stoked to start experimenting but
>> not ready to invest in production GPUs yet. A few of my
>> external collaborators have asked about GPU resources in labs
>> too.
>>
>>
>>
>> On Feb 21, 2017 11:52, "Adam Baso" <abaso at wikimedia.org
>> <mailto:abaso at wikimedia.org>> wrote:
>>
>> Following up on this here current thread, what do you all
>> think about doing the GPU acceleration in Labs first?
>>
>> I don't know if it was halfak or marktraceur who
>> suggested it first (although Aaron's brought it up a
>> couple times now), but it's /probably/ less up front
>> architectural overhead to start out with, even if in the
>> future we'd have a strict requirement on HA (Q1 FY 18-19
>> at the very latest under current guess). As Aaron has
>> rightly noted doing this in Labs also lets us learn plus
>> give community members greater access to innovate early,
>> too. The primary downside of not getting this in
>> production with HA up front is if funding dries up for FY
>> 18-19 we're stuck or pegged to certain workflows later
>> on. But maybe it's not worth worrying about that too much.
>>
>> I think one very much open question, though, would be if
>> it's possible to have a machine with the GPU card
>> installed and specifically assigned in Labs. Does anyone
>> know if that's actually possible?
>>
>> -Adam
>>
>>
>>
>>
>>
>> On Thu, Feb 16, 2017 at 4:56 PM, Aaron Halfaker
>> <ahalfaker at wikimedia.org
>> <mailto:ahalfaker at wikimedia.org>> wrote:
>>
>> +1 for looping me in on that thread & discussion. I'd
>> like to loop in someone from Labs (probably
>> andrebogott) about purchasing GPUs for Labs so that
>> we can (1) run prediction models in Beta and (2)
>> empower our volunteers/external researchers to
>> experiment with us.
>>
>> -Aaron
>>
>> On Thu, Feb 16, 2017 at 12:11 PM, Adam Baso
>> <abaso at wikimedia.org <mailto:abaso at wikimedia.org>> wrote:
>>
>> +Corey
>>
>> On Thu, Feb 16, 2017 at 11:15 AM, Adam Baso
>> <abaso at wikimedia.org
>> <mailto:abaso at wikimedia.org>> wrote:
>>
>> We have a need for push notification servers
>> already, so I've opened a thread with Mark
>> and Faidon about getting those servers and
>> putting in the Nvidia-recommended cards for
>> TensorFlow (ostensibly for machine vision),
>> for the sake of simplifying assumptions about
>> hardware. I'm awaiting their feedback about
>> whether we actually need to split the
>> servers. If we /do/ need to split the servers
>> for separate purposes, then I think that
>> means we'd push back the online computer
>> vision servers and GPUs purchase to early Q4
>> FY 17-18 rather than just getting it done Q1
>> FY 17-18 - which is when we need to be moving
>> aggressively on push notification so it would
>> be prudent to just get it done in one fell
>> swoop.
>>
>> Aaron, I know last week you had said you'd be
>> /open/ to collaborating on this...and I was
>> quite noncommittal!...But I appreciate your
>> saying you'd /like/ to collaborate here.
>> Would you like if I loop you on that thread
>> with Mark and Faidon? *Any others who should
>> / would like to join that thread?* Just be
>> prepared for the thread to be covering two
>> separate use cases - one on cross-platform
>> push notification and one on basic machine
>> vision.
>>
>> -Adam
>>
>> On Thu, Feb 16, 2017 at 10:57 AM, Aaron
>> Halfaker <ahalfaker at wikimedia.org
>> <mailto:ahalfaker at wikimedia.org>> wrote:
>>
>> OK so I think we'll want to (1) get a GPU
>> in the stat boxes ASAP and (2) decide
>> whether we want to plan GPU resources in
>> Prod for FY2018 or FY2019.
>>
>> For (2), I don't think my team's current
>> plans will bring us to using the GPU in
>> production in the next year, but I
>> suspect that Reading may want to push
>> some work re. image processing in that
>> time. If that's the case, I want my team
>> to be able to collaborate and support
>> getting that deployed in prod. To do
>> this well, I want GPU resources in
>> Wikimedia Labs too. That sounds like a
>> whole other can of worms.
>>
>>
>>
>> On Wed, Feb 15, 2017 at 4:40 PM, Ellery
>> Wulczyn <ewulczyn at wikimedia.org
>> <mailto:ewulczyn at wikimedia.org>> wrote:
>>
>> Having GPUs for training should be
>> sufficient for now, although if we
>> end up getting a ton of use, using
>> GPUs could be a lot faster and
>> probably cheaper than provisioning
>> the same compute amount of CPUs.
>>
>> On Thu, Feb 2, 2017 at 12:21 PM,
>> Aaron Halfaker
>> <ahalfaker at wikimedia.org
>> <mailto:ahalfaker at wikimedia.org>> wrote:
>>
>> If we only need the GPU for model
>> training, it'll be OK to use one
>> stat box. If we need the GPU for
>> scoring/predictions, we'll need a
>> whole new hardware plan.
>>
>> On Thu, Feb 2, 2017 at 1:06 PM,
>> Andrew Otto <otto at wikimedia.org
>> <mailto:otto at wikimedia.org>> wrote:
>>
>> Hm, a good rule of thumb is:
>> If it can be offline or not
>> running and not affect end
>> users, then it is probably
>> fine to use a stat box.
>>
>>
>>
>> On Thu, Feb 2, 2017 at 1:56
>> PM, Adam Baso
>> <abaso at wikimedia.org
>> <mailto:abaso at wikimedia.org>>
>> wrote:
>>
>> Got it - I take that to
>> be the case whether it's
>> batched operation (e.g.,
>> on millions of files) or
>> it's more of an
>> in-the-user-flow sort of
>> thing. Is that right?
>>
>> On Thu, Feb 2, 2017 at
>> 10:55 AM, Andrew Otto
>> <acotto at gmail.com
>> <mailto:acotto at gmail.com>>
>> wrote:
>>
>> I’d say, if you are
>> going to just to
>> analytics type
>> stuff, then the
>> single stat machine
>> will do. If you want
>> to depend on a GPU
>> for a end-user
>> production thing,
>> then you’ll have to
>> work with ops to find
>> another place to run
>> it. :/ :)
>>
>> On Thu, Feb 2, 2017
>> at 11:40 AM, Aaron
>> Halfaker
>> <ahalfaker at wikimedia.org
>> <mailto:ahalfaker at wikimedia.org>>
>> wrote:
>>
>> Ellery, will we
>> need the GPUs in
>> order to use a NN
>> or will we only
>> need it for
>> training models?
>>
>> On Thu, Feb 2,
>> 2017 at 10:21 AM,
>> Adam Baso
>> <abaso at wikimedia.org
>> <mailto:abaso at wikimedia.org>>
>> wrote:
>>
>> I envision
>> two primary
>> uses:
>>
>> 1) Large
>> scale batch
>> offline
>> processing of
>> existing
>> media assets
>> so that the
>> material is
>> ready for
>> curatorial flows.
>> 2) As part of
>> an end user
>> flow where
>> multiple
>> concurrent
>> users are
>> uploading
>> media and
>> verifying and
>> adding
>> structured
>> data on the
>> fly as part
>> of production
>> use.
>>
>> Can both of
>> these be done
>> on stats
>> machines?
>>
>> Ought we have
>> GPU
>> acceleration
>> in two
>> machines
>> instead of
>> one machine?
>>
>> -Adam
>>
>>
>>
>> On Thu, Feb
>> 2, 2017 at
>> 8:15 AM,
>> Andrew Otto
>> <otto at wikimedia.org
>> <mailto:otto at wikimedia.org>>
>> wrote:
>>
>> Oh ya!
>> If you
>> have a
>> use case
>> for this
>> too, all
>> the better!
>>
>> Do you
>> need it
>> for
>> analytics
>> type
>> work? Or
>> do you
>> need it
>> to
>> process
>> stuff for
>> a
>> production
>> feature?
>>
>> On Thu,
>> Feb 2,
>> 2017 at
>> 9:10 AM,
>> Aaron
>> Halfaker
>> <ahalfaker at wikimedia.org
>> <mailto:ahalfaker at wikimedia.org>>
>> wrote:
>>
>> Hi Adam,
>>
>> + a
>> bunch
>> of CCs
>>
>> Last
>> I
>> heard,
>> Dario
>> thought
>> we
>> might
>> be
>> able
>> to
>> cover
>> the
>> cost
>> with
>> Research
>> budget.
>> Otto
>> thought
>> that
>> we
>> could
>> get a
>> top
>> of
>> line
>> GPU
>> and
>> load
>> it
>> into
>> an
>> analytics
>> machine
>> some
>> time
>> in Q4
>> of
>> this
>> year.
>> Ellery
>> was
>> planning
>> to
>> use
>> 3rd
>> party
>> GPU
>> processing
>> services
>> until
>> it
>> was
>> ready.
>>
>> See
>> https://phabricator.wikimedia.org/T148843
>> <https://phabricator.wikimedia.org/T148843>
>>
>> -Aaron
>>
>> On
>> Wed,
>> Feb
>> 1,
>> 2017
>> at
>> 6:01
>> PM,
>> Adam
>> Baso
>> <abaso at wikimedia.org
>> <mailto:abaso at wikimedia.org>>
>> wrote:
>>
>> Aaron,
>> okay
>> if
>> I
>> schedule
>> a
>> 20
>> minute
>> meeting
>> with
>> you
>> to
>> talk
>> servers
>> with
>> GPUs?
>>
>>
>> Broadly,
>> I'm
>> trying
>> to
>> figure
>> out
>> what
>> server
>> CapEx
>> I
>> need
>> to
>> ask
>> of
>> Mark
>> (e.g.,
>> for
>> TensorFlow
>> object
>> detection
>> in
>> anticipation
>> of
>> work
>> later
>> in
>> FY
>> 17-18
>> /
>> earlier
>> FY
>> 18-19).
>> I
>> had
>> asked
>> him
>> the
>> other
>> day
>> about
>> when
>> he
>> needs
>> requests
>> for
>> next
>> FY,
>> and
>> he
>> basically
>> said
>> the
>> sooner
>> the
>> better.
>>
>> -Adam
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-admin/attachments/20170222/8168daa2/attachment-0001.html>
More information about the Labs-admin
mailing list