It Started with a Simple Customer Question
Should I oversubscribe memory in our OpenStack POC environment?
Then another customer asked the exact same question later the same day.
I responded to both customers explaining that you should not do that with OpenStack because it’s not a mature feature, but I needed to validate the customer’s priorities with their journey to cloud and ultimately to influence those priorities to ensure success to their first customers “the developers.”
Many IaaS initiatives are born out of interests in increasing efficiency and optimizing the utilization of hardware, and these two customers had this priority. What concerned me was that the question of optimization came up as a part of the first step in their journey : their IaaS Proof of Concept. Having had little face time with these customers up to this point, and realizing they were very busy, I answered their specific question about why oversubscribing memory is not recommended in OpenStack, and then I elaborated. Here is my response.
One reason not to oversubscribe memory in OpenStack is that it’s not a mature feature. VMware has demonstrated competency with memory oversubscription for years, but it’s a long way off for nova | FWIW – AWS doesn’t do this in EC2, and I don’t know of any public cloud providers that do so. Moreover, from an operational best practice perspective, I would defer ever trying this (even with a mature capability in the underlying software) until I had really successful adoption of the IaaS platform with well know metrics about its performance and behavior with production workloads. Memory oversubscription, even if implemented robustly by an IaaS technology, presents greater risk to an environment when the workloads begin to fluctuate or more likely when a memory leak in an application code push creates an edge case. Compounding memory leaks across multiple instances poses tangible risk to the fleet of instances running in the IaaS platform.
For organizations that are still adopting cloud/IaaS and embracing what it is to treat infrastructure as code, such optimizations should considered relative to making investments in automation to make your people (operators and infrastructure consumers) more efficient. Having a stable platform that is simple and least at risk of exposing everyone to edge cases is key for the early adoption era for your cloud. Problem-free, successful use will be important to the earliest users (your developers). Their accolades will come if you deliver what matters to them “which is accessibility, the ability to self” serve, automate/codify, repeat with consistency. The traditional values of cost reduction that are key to IT matter less to them, so it’s important to set expectations to upper management that investing in IaaS (whether public or private) is to prioritize accelerating the business’ access to infrastructure so the business can innovate and make the people more efficient first. Optimization can be prioritized later once predictability and efficiency of access is provided.
Oversubscription in any capacity is also a tricky early initiative as your first workloads may be something you have running elsewhere, and any time you move a workload, the performance and behavior will be different in its new home. Having as few differences between environments is ideal, so applying the KISS principal to the architecture itself (and not introducing optimizations or other changes) is important in step 1.
An important framework to align to in your cloud journey is the Cloud Maturity Model.
This table below illustrates a bit of what I’m talking about in terms of when to optimize (or at least in what relative order). Optimizing (for cost) is typically at the end of journey. The rest of the model uses these stages of maturity to illustrate how the business, the technology, and various IaaS/PaaS/SaaS platforms need to mature through stages of adoption.
Here is another illustration that characterizes the value/perception of your customers (developers) in contrast that with what IT’s values/perceptions often are”¦
Illustration credit to James Staten from Microsoft who presented this (and more) at OpenStack Silicon Valley 2015. Here’s his whole presentation.
I learned the lessons above the hard way by drinking from a fire hose at Zynga while running its Infrastructure Operations & Engineering team through the growth from 2008-2012. My comrades were very capable and it was our focus on simplicity wherever possible that made our private cloud a success. Artifacts like the Cloud Maturity Model are a great resource to respect through everyone’s journey.