Rogue VM Sprawl? Really?
I keep hearing about the impending doom of (specifically) rogue VM sprawl — our infrastructure overrun with the unchecked proliferation of virtual machines running amok across our enterprises. Oh the horror!
Most of the examples use the consolidation of server VM's onto hosts as delivered by virtualization as their example.
I have to ask you though, given what it takes to spin up a VM on a platform such as VMware, how can you have a "rogue" VM sprawling its way across your enterprise!?
Someone — an authorized administrator — had to have loaded it into inventory, configured its placement on a virtual switch, and spun it up via VirtualCenter or some facsimile thereof depending upon platform.
That's the definition of a rogue? I can see where this may be a definitional issue, but the marketeers are getting frothed up over this very issue, whispering in your ear constantly about the impending demise of your infrastructure…and undetectable hypervisor rootkits, too. 🙂
It may be that the ease of which a VM *can* be spun up legitimately can lead to the overly-exhuberant deployment of VM's without understanding the impact this might have on the infrastructure, but can we please stop grouping stupidity and poor capacity planning/impact analysis with rogueness? They're two different things.
If administrators are firing off VMs that are unauthorized, unhardened, and unaccounted for, you have bigger problems than that of virtualization and you ought to consider firing them off.
The inventory of active VMs is a reasonably easy thing to keep track of; if it's running, I can see it.
I know "where" it is and I can turn it off. To me, the bigger problem is represented by the offline VMs which can live outside that inventory window, just waiting to be reactivated from their hypervisorial hibernation.
But does that represent "rogue?"
You want an example of a threat which represents truly rogue VM "sprawl" that people ought to be afraid of? OK, here's one, and it happened to me. I talk about it all the time and people usually say "Oh, man, I never thought of that…" usually because we're focused on server virtualization and not the client side.
We take distributed sniffer traces. Trackback through firewall, IDS and IPS logs and isolate the MAC address in the CAM tables of the 96 port switch to which the offending DHCP server appears to be plugged, although we can't ping it.
My analyst is now on a mission to unplug the port, so he undocks his laptop and the alarms silence.
I look over at him. He has a strange look on his face. He docks his laptop again. Seconds later the alarms go off again.
The Culprit: Turns out said analyst was doing research at home on our W2K AD/DHCP server hardening scripts. He took our standard W2K server image, loaded it as a VM in VMware Workstation and used it at home to validate funtionality.
The image he used had AD/DHCP services enabled.
When he was done at home the night before, he minimized VMware and closed his laptop.
When he came in to work the next morning, he simply docked and went about reading email, forgetting the VMW instance was still running. Doing what it does, it started responding to DHCP requests on the network.
Because he was using shared IP addresses for his VM and was "behind" the personal firewall on his machine which prohibits ICMP requests based on the policy (but obviously not bootp/DHCP) we couldn't ping it or profile the workstation…
Now, that's a rogue VM. An accidental rogue VM. Imagine if it were an attacker. Perhaps he/she was a legitimate user but disgruntled. Perhaps he/she decided to use wireless instead of wired. How much fun would that be?
Stop with the "rogue (server) VM" FUD, wouldya?
Kthxbye.
/Hoff
I dunno, Hoff. With some of the things I have seen in some IT environments, I'm to the point now I believe that pigs can fly.
Backwards.
JamesNT
Hoff you are taking some things for granted in your analysis, for example:
"The inventory of active VMs is a reasonably easy thing to keep track of; if it's running, I can see it."
is not necessarily correct… We have at least 350 *different* Vmware images not counting snapshots or those that maybe stored on individual desktop systems or downloded from the vmappliance site on short notice for quick tests. These 350+ images are *required* to have unpatched vulnerabilities but even if they weren't it is hard to imagine how to keep constantly updated and remove the ability to revert to an unpatched state.
Every single one of those images is automatically spun up and shutdown several times a day and the same is done repeatedly during the night for automated regression testing almost every night of the year.
Our case may be an extremely pathological example of how some uses of virtualization can compound security problems already known and bring in a few new ones
@Ivan:
To be clear, what I mean by "active VMs" are those that are running, as I said.
That's why the sentence that followed us read "To me, the bigger problem is represented by the offline VMs which can live outside that inventory window, just waiting to be reactivated from their hypervisorial hibernation."
This, I believe, is the problem you were referring to, no? Although your spin up/down cycle is really extreme. That's a great example though! It's one that will ultimately BECOME a problem with RTI and Cloud if the governance and provisioning layers associated with the automation remain unchecked.
I believe in discussions with Lori on Twitter, that definitions are important here. Rogue and sprawl impart meanings to me that are different than others.
What I found is that what many people mean when they talk about sprawl of VMs are "unmanaged" VMs. When I ask what unmanaged means, I get definitions like:
"Unmanaged = no monitoring and no checks and balances. Nothing saying 'Hey, this VM is consuming 30% of resource cluster A.'" <– This notion of unseen/unknown versus unmanaged is important for me, especially since we're talking about "security."
You're clearly "managing" your VMs.
I just want to make the distinction and interconnectedness of "sprawl" and "rogue" from the management and security perspectives which are different.
Does this make any sense?
/Hoff
I’ve heard of rogue VMs, which is really more a function of client-side (type 2) virtualization, and VM sprawl, which I agree can be a bit strange given this stuff is deterministic (though a problem nonetheless)… but I’ve never heard anyone talk about “rogue vm sprawl.” A google of “rogue vm sprawl” only results in this post and references to this post.
The only similar reference I could find was on VMBlog where the assertion was that trying to minimize sprawl could cause users to go rogue – an assumption that I think is reasonable.
Hoff,
I think you're being too pedantic on your obsession with the "managed" keyword. In Ivan's case, he states that they have 350 VMs that, as you correctly address, are managed. However sprawl comes from those VMs that aren't part of his 350 VM pool or are run outside his specific vCenter domain. He doesn't mention if/how he deals with new VMs, known or unknown.
One of the larger problems with sprawl and management is the segmentation of vCenter management responsibilities. It is extremely common to see different groups managing their own vCenter domains: the server team may manage the live webserver VMs; the QA team may manage a cluster of VMs that are spun up and down for regression testing (as Ivan mentions with intentionally old code); the Dev team manage a cluster of build servers; etc. These will 1) be completely independent management domains with no coordination between teams and 2) each be open to localized sprawl. I picture a good ol' fashioned "star of stars" network design picture with VMs starring off like bunnies from each management domain hub. 🙂
But even if each domain is tightly managed, let's take the build farm example. Barring network-level inspection and protection, there's very little to stop a developer with root on an ESX box from dropping a "rogue" VM on the ESX server (or even with VMware Server on his own box), giving it a free IP, and then pushing builds to that new VM. If that VM relies on build data from other, registered VMs, then you have a rouge VM that's participating in the cluster w/o being managed. To me, that's a completely realistic scenario, and that's the fear. It's no different than users throwing up rogue DHCP servers by mistake; a real threat that still happens today on enterprise networks every day.
But as always, love that you're thinking about this stuff. 🙂
-Alan
Hoff,
The other thing to bear in mind is that VMs do consume resources such as CAM table space, FC world wide names and add to things like STP complexity. When adding physical servers, there is some natural throttling of the rate of increase of the consumption of resources (or growth in complexity) because, most organizations are limited in the rate at which they can physically deploy new servers. The concern I hear chatting with network and storage folks is that, with VMs, that natural throttling goes away, so it could become very easy to outpace their ability to respond and adapt.
Omar