Redux: Patching the Cloud
Back in 2008 I wrote a piece titled “Patching the Cloud” in which I highlighted the issues associated with the black box ubiquity of Cloud and what that means to patching/upgrading processes:
Your application is sitting atop an operating system and underlying infrastructure that is managed by the cloud operator. This “datacenter OS” may not be virtualized or could actually be sitting atop a hypervisor which is integrated into the operating system (Xen, Hyper-V, KVM) or perhaps reliant upon a third party solution such as VMware. The notion of cloud implies shared infrastructure and hosting platforms, although it does not imply virtualization.
A patch affecting any one of the infrastructure elements could cause a ripple effect on your hosted applications. Without understanding the underlying infrastructure dependencies in this model, how does one assess risk and determine what any patch might do up or down the stack? How does an enterprise that has no insight into the “black box” model of the cloud operator, setup a dev/test/staging environment that acceptably mimics the operating environment?
What happens when the underlying CloudOS gets patched (or needs to be) and blows your applications/VMs sky-high (in the PaaS/IaaS models?)
How does one negotiate the process for determining when and how a patch is deployed? Where does the cloud operator draw the line? If the cloud fabric is democratized across constituent enterprise customers, however isolated, how does a cloud provider ensure consistent distributed service? If an application can be dynamically provisioned anywhere in the fabric, consistency of the platform is critical.
I followed this up with a practical example when Microsoft’s Azure services experienced a hiccup due to this very thing. We see wholesale changes that can be instantiated on a whim by Cloud providers that could alter service functionality and service availability such as this one from Google (Published Google Documents to appear in Google search) — have you thought this through?
So now as we witness ISP’s starting to build Cloud service offerings from common Cloud OS platforms and espouse the portability of workloads (*ahem* VM’s) from “internal” Clouds to Cloud Providers — and potentially multiple Cloud providers — what happens when the enterprise is at v3.1 of Cloud OS, ISP A is at version 2.1a and ISP B is at v2.9? Portability is a cruel mistress.
Pair that little nugget with the fact that even “global” Cloud providers such as Amazon Web Services have not maintained parity in terms of functionality/services across their regions*. The US has long had features/functions that the european region has not. Today, in fact, AWS announced bringing infrastructure capabilities to parity for things like elastic load balancing and auto-scale…
It’s important to understand what happens when we squeeze the balloon.
/Hoff
*corrected – I originally said “availability zones” which was in error as pointed out by Shlomo in the comments. Thanks!
I appreciate your well thought out and prescient articles!
A small nit: you confuse Amazon "Availability Zones" with "Regions".
Regions: US or EU (us-east-1, eu-west-1)
Availability Zones: "Data Center" within a region (us-east-1a, us-east-1b)
Thanks! I spaced. Noted the correction.
/Hoff
Eh, a continuation of the (nothing we can do about) hostage situation major providers have us in when it comes to patch management. Whether they patch there or we patch here, no one really has a good idea of what they’re implementing (in fact the developer who made the change can’t offer any guarantees) and the same old testing approach applies (Look I installed the patch in twenty machines and nothing broke, roll it out).
We’re left with reasonable levels of testing balanced against the need to upgrade/fix technology systems.
And yeah, one company that can’t keep their production systems consistent smacks of disorganization. Time to bring back the gold build.