Keynote take aways:

https://allisonrandal.com/ is really cool 
- ATT is doing: an openstack shim mostly for Ironic integration to manage baremetal, then k8s to deploy a full customer oriented Openstack deployment using k8s native rolling deploy and management for Openstack deployed within containers, and then Openstack deployed instances that can have k8s and other container orchestration platforms deployed within instances.  It's turtles all the way down people.  This also isn't unique actually.  Uses opensack-helm.  A lot of talk this year around the integration of Openstack and k8s platforms and layering thereof.  Really interesting stuff.
- Ironic is a huge winner in the last 6m - 1y.  9% => 20% adoption across surveyed customers.  It seems to have gone from a thing you had to stretch to your use case with a lot of time and attention to a popular deployment pattern for all hardware (including hardware underpinning the openstack clouds themselves).  So that's really interesting considering it's on our roadmap to visit in the PN[0] age.
- Zuulv3 is being touted as a community part of the O[1] ecosystem.  They said these words "Now that Zuul v3 is released it's actually ready to be used by other people".  Should we laugh or cry?
- Good description of Openstack components as "programmable infrastructure".
- O[1] foundation views their work as an integration engine for the full stack of components

-- Sessions --

* O[1] Passports

- https://www.openstack.org/passport is a cool initiative to make getting trial credentials to all participating public cloud run on O[1] possible.  Really curious to see how folks have put their custom stamp on 'vanilla O[1]'.

* O[1] CephFS now fully awesome

- Manila is a projec that aims to manage File Servers (shared file servers usually) with several backends, including NFS and the new-hotness CephFS.  That means in theory we could replace some of our homebrew things with a combination of these components down the line.  We could even in theory use Manila on top of NFS. It's pretty interesting and could present some better deployment and management models like presentation of these shares as block storage to instances from the host level.  That could mean better management with use of flavor constraints and quota tie-in.  

- CephFS seems super awesome. I'm skeptical, my impression of new Ceph features historically is that they reache fever pitch for adoption way sooner than maturity.

* Overview and Status of Cep File System

- More CephFS talk and magic.  It's new, it's here. It integrates with Manila, and you can now deploy Ceph as a backend to Glance, Cinder, Nova (yeah..), and Manila

* How does Neutron Do That?

- Pretty misleading title let me say!  We never touched Neutron nor did they show any actual Neutron internals as I thought.  It was however a pretty decent hands on with components that underpin Neutron.   Namespaces, veth interfaces, linux bridging, etc.  I did learn a few tid bits, but it was a sort of pitch for Rackspace training.  Worth going to, could have been 30m and not 1h30m.

* Dos And Dont's of Ceph and O[1]

This guy finally addressed specifically our problems of overzealous tenants IO effecting other tenants. The reason I haven't been able to find a backed in Ceph QoS component is that there isn't one :)  The work arounds discussed were generic and won't work for us atm, but could in future deployment.  Lots of discussion of configuration gotchas and when to use defaults and not.  I would go back and watch this before a trial Ceph deployment.  Straight forward stuff.  Much of it too in the weeds for our current lack of Ceph status, but food for thought.  He talked about when is Ceph appropriate and the age old wisdom of fast network storage being 10x faster than local IO and how that has faired with the growth in local storage solutions.  Ironically, he talked for a bit about when not to use Ceph and go for collocated LVM block allocations that are local to the compute node.  Convinced me we should be doing this with Cinder even without real Ceph solution.  Cinder volume types can be used to make QoS profiles at the host level assuming the correct IO scheduler etc.

Cinder
- Ceph RBD 65% of deployments
- LVM 24%
- NFS 11% (we are not alone!)

* Kubernetes on O[1]

This is was mostly about the k8s side cloud provider native tie ins -- none of which we use. And honestly none of which were that interesting for us atm. But the ability to use Neutron LBaaS in place of agnostic intra-k8s in the future could be a win.  I wonder how much of our ingress problem we could side step.  A lot of assumptions made for cloud provider integration, much of which we could work through pretty easily.  He stepped on the idea of container security a bit and esp collocated container customers on bare metal and said this quote that is funny when you consider our model: "You should have a k8s deployment per hostile tenant..."

* Neutron to Neutron routing

Basically a pitch to use BGP VPN's between unrelated Neutron l3 instances.  This is pretty relevant to us and he talked a bit about the ambigouity of cellsv2 vs regions vs strategy of choice.  Worth thinking about, he reviewed a few of the options with no big surprises. There are a lot of assumptions about high trust environments without the need for IPSec etc. https://docs.openstack.org/networking-bgpvpn/latest/

* Inside Neutron routing architectures

Given by one of the early leaders in the Neutron (then quantum space).  Neutron was originally just a test harness but enough features were added to it that folks started deploying it in production and the rest is history.  That somewhat helps explain the early wild west days of Neutron.  Neutron has grown and changed a ton because of: ongoing understanding of use cases, and the almost entirely organic growth.  For instance they stopped putting metadata servers in each namespace in Ocata and now have haproxy and one metadata server to rule them all etc.  

Neutron /only/ uses network namespace isolation.  Unspoken in the session, but that means that all processes for all tenants see the global process space.  I haven't figured out how that's exploitable yet but it sure seems like there could be interesting effects.

The model of topology we are targeting is called "legacy" and so at the end of the session I asked if this nomenclature meant it would go away at some point into the future.  Both the session leader and the chair of the Neutron subcommittee (who was sitting right in front of me) guaranteed it's not even a conversation.  The phrasing was "no one is even talking about it" and "There would be a riot, and it won't happen".  So that was worth going to the session right there.

Much work with openflow integration and related tools: OVN, Midonet, open daylight, open contrail (a Juniper cross over project I think?).

Acknowledged that with Neutron there is no right answer to any particular problem, it is so flexible you can shoot yourself in the foot.  Described as a general trade off between operational complexity vs data path efficiency.  So that's a nice summation of the general concerns.  There was no giant 'aha' moment here which I take to be a good thing.  Neutron has moved on from some of the assumptions and models of Mitaka and below it but for the most part we are thinking in the right terms.

* Questions to make storage vendors squirm

Demonstration of the interplay between latency and throughput.  I.e. moving x payload w/ y latency will only ever demonstrate z throughput and that's not an indictment of a storage system but instead of a testing methodology.

* Lots of talk about don't be fooled by caching in demos, ask about IO cost of fancy features, talk about backups early.

"Storage benchmarking is mostly a lie"

I talked to him a bit after about the problem before this problem of profiling load to know what parameters to actually look for in a solution and my difficulties with getting a good picture of live load on old systems that need to be replaced.  My approach was based on anecdotal patterns capture, and a lot of supposition and surely that's bad.  He acknowledged that's pretty much the only solution.  There is no codified way to profile a system in place, and the general difficulty of the problem and cross vendor/discipline nature means no one wants to tackle creating one.  So that's both mildly validating and exhausting.  Nice guy, talked a mile a minute and clearly was excited about storage.

* Running Openshift on Openstack

Based on http://www.bu.edu/hic/research/highlighted-sponsored-projects/massachusetts-open-cloud/

I went to this for obvious reasons :)  It was mostly a features pitch from Redhat.  The guy who spoke most of the time was a Redhat employee, who did know the product inside and out, intermixed with example use cases.

Learned that Openshift FaaS is settling on Openwhisk.  Kuryr is an integration project coming to more directly tie Openshift (and k8s vanilla) into neutron to avoid flannel and in general overlay and (in some cases) the double encapsulation insanity of vxlan inside vxlan that would happen.  That's been on my mind so it was an interesting note.

This setup is so incredibly similar to ours in ideology and deployment that we could have given a mirror presentation. 

Post I spoke with Robert who was the person from Boston U there to represent and a few things were clarified:
* This was only really deployed over the summer and has had mostly trivial usage.  A large 300 person use cases was meant to happen but did not.
* They have some awesome open data partnerships with MIT
* I got his contact info and I think we should follow up.  I want to know in 6 months if they are as high on openshift as they are now.

Openshift is the only thing operating in its space that has maturity now.  Redhat is basically killing it here through happenstance because of a well executed project and every alternative being gobbled up and mismanaged AFAICT.  Kudos to them.

* Move to Cellsv2

(This was a followup to a talk given in boston 6m ago I need to track that down)

In Ocata Nova splits into multiple DB's for components, and all deployments become an inherent cell even if there is only the implicit cell0.  So everyone is moving to cellsv2 no matter what.  Cells all the way down.

Reasons to spin off a cell:
- message queue is melting from load
- isolate failure domains
- remove cells of compute

Noted that clustering rabbit and rabbit load is tough.

A cellv2 is basically a distinct nova-api instance, associated DB, message queue (rabbit), and compute farm.  

Lots of weaknesses even up to and including Pike.
- no migration between cells
- sorting and reporting is weird
- no scheduler retries
- anti-affinity does not work

3/4 of that is in theory fixed in Queens.   They made the appeal for more engaged users and esp at scale.  

Summary (because I found cellsv2 a bit hard to reason about before this session): cellsv2 are entirely a nova sharding scheme.  They are ignorant to any and all associated Neutron topology.  Much of the examples I have seen had implicit relationships between Neutron scaling and Cellsv2 but that is only a function of the operators outline.  Modern-ish Neutron has "routed network segments" that can be aligned with a cellv2 providing l3 segmentation that overlaps the cellv2 arch but it's all decoupled management wise in the control plane.  Neutron network segmentation is a thing we should look at in more detail but it's not ready by Mitaka.  It seems difficult to me that the scaling methodologies for Nova and Neutron are completely ships in the night, and most examples I have seen do not address this well so that is a good primer.

* Thoughts

Big bold letters on my note pad from day 1: Ceph has won.  It's over.  -- Cinder has something like 20+ backends (most of them proprietary) and even the proprietary backends are trying to figure out how to play nice with Ceph as non-commodity raw storage.  It's not a total sweep though as it's big and complex and really not a one-size fits all.  Ceph has no native QoS style bindings or management at this time.  [combo notes from other days] Both Andrew and I talked with different folks or were in different presentations where this was acknowledged separately.  The work arounds I saw were to use flavor integration with cgroup IOPS limiting at the host level.  This has several assumptions to work, most of which we don't satisfy atm, and comes with many of the downsides of our current low-brow scheme with TC for NFS.  But it's on the "roadmap" and in the minds of the Ceph devs for sure.

Openshift is killing it.

This conference was really useful for me and I got to ask pointed questions to Neutron, Ceph, and Cinder developers that are not at all clear from documentation (esp regarding roadmaps).  There was a really great dev and operators interaction vibe with onboarding for project contribution alongside sessions purely about operating at scale.   Well run and very dense in engaging content.  I mainly stayed with storage and networking sessions but that's self selection based on deep problem sets we have that I was trying to gather as much insight as I could.  I think these two arenas are also some of the most complex with a high rate of change in the last 6 releases.  We /need/ to find a way to come to these in an ongoing fashion.  I think Andrew came away feeling very much the same. 


[0] Post Neutron
[1] Openstack

--
Chase Pettet
chasemp on phabricator and IRC