Archive

Archive for the ‘Cloud Computing’ Category

Amazon Web Services: It’s Not The Size Of the Ship, But Rather The Motion Of the…

October 16th, 2009 3 comments
From Hoff's Preso: Cloudifornication - Indiscriminate Information Intercourse Involving Internet Infrastructure

From Hoff's Preso: Cloudifornication - Indiscriminate Information Intercourse Involving Internet Infrastructure

Carl Brooks (@eekygeeky) gets some fantastic, thought-provoking interviews.  His recent article wherein he interviewed Peter DeSantis, VP of EC2, Amazon Web Services, was titled: “Amazon would like to remind you where the hype started” is another great example.

However, this article left a bad taste in my mouth and ultimately invites more questions than it answers. Frankly I felt like there was a large amount of hand-waving in DeSantis’ points that glossed over some very important issues related to security issues of late.

DeSantis’ remarks implied, per the title of the article, that to explain the poor handling and continuing lack of AWS’ transparency related to the issues people like me raise,  the customer is to blame due to hype and overly aggressive, misaligned expectations.

In short, it’s not AWS’ fault they’re so awesome, it’s ours.  However, please don’t remind them they said that when they don’t live up to the hype they help perpetuate.

You can read more about that here “Transparency: I Do Not Think That Means What You Think That Means…

I’m going to skip around the article because I do agree with Peter DeSantis on the points he made about the value proposition of AWS which ultimately appear at the end of the article:

“A customer can come into EC2 today and if they have a website that’s designed in a way that’s horizontally scalable, they can run that thing on a single instance; they can use [CloudWatch[] to monitor the various resource constraints and the performance of their site overall; they can use that data with our autoscaling service to automatically scale the number of hosts up or down based on demand so they don’t have to run those things 24/7; they can use our Elastic Load Balancer service to scale the traffic coming into their service and only deliver valid requests.”

“All of which can be done self-service, without talking to anybody, without provisioning large amounts of capacity, without committing to large bandwidth contracts, without reserving large amounts of space in a co-lo facility and to me, that’s a tremendously compelling story over what could be done a couple years ago.”

Completely fair.  Excellent way of communicating the AWS value proposition.  I totally agree.  Let’s keep this definitional firmly in mind as we go on.

Here’s where the story turns into something like a confessional that implies AWS is sadly a victim of their own success:

DeSantis said that the reason that stories like the DDOS on Bitbucket.org (and the non-cloud Sidekick story) is because people have come to expect always-on, easily consumable services.

“People’s expectations have been raised in terms of what they can do with something like EC2. I think people rightfully look at the potential of an environment like this and see the tools, the multi- availability zone, the large inbound transit, the ability to scale out and up and fundamentally assume things should be better. “ he said.

That’s absolutely true. We look at what you offer (and how you offered/described it above) and we set our expectations accordingly.

We do assume that things should be better as that’s how AWS has consistently marketed the service.

You can’t reasonably expect to bitch about people’s perception of the service based on how it’s “sold” and then turn around when something negative happens and suggest that it’s the consumers’ fault for setting their expectational compass with the course you set.

It *is* absolutely fair to suggest that there is no release from not using common sense, not applying good architectural logic to deployment of services on AWS, but it’s also disingenuous to expect much of the target market to whom you are selling understands the caveats here when so much is obfuscated by design.  I understand AWS doesn’t say they protect against every threat, but they also do not say they do not…until something happens where that becomes readily apparent 😉

When everything is great AWS doesn’t go around reminding people that bad things can happen, but when bad things happen it’s because of incorrectly-set expectations?

Here’s where the discussion turns to an interesting example —  the BitBucket DDoS issue.

For instance, DeSantis said it would be trivial to wash out standard DDOS attacks by using clustered server instances in different availability zones.

Okay, but four things come to mind:

  1. Why did it take 15 hours for AWS to recognize the DDoS in the first place? (They didn’t actually “detect” it, the customer did)
  2. Why did the “vulnerability” continue to exist for days afterward?
  3. While using different availability zones makes sense, it’s been suggested that this DDoS attack was internal to EC2, not externally-generated
  4. While it *is* good practice and *does* make sense, “clustered server instances in different avail. zones, costs money

Keep those things in the back of your mind for a moment…

“One of the best defenses against any sort of unanticipated spike is simply having available bandwidth. We have a tremendous amount on inbound transit to each of our regions. We have multiple regions which are geographically distributed and connected to the internet in different ways. As a result of that it doesn’t really take too many instances (in terms of hits) to have a tremendous amount of availability – 2,3,4 instances can really start getting you up to where you can handle 2,3,4,5 Gigabytes per second. Twenty instances is a phenomenal amount of bandwidth transit for a customer.” he said.

So again, here’s where I take issue with this “bandwidth solves all” answer. The solution being proposed by DeSantis here is that a customer should be prepared to launch/scale multiple instances in response to a DoS/DDoS, in effect making it the customers’ problem instead of AWS detecting and squelching it in the first place?

Further, when you think of it, the trickle-down effect of DDoS is potentially good for AWS’ business. If they can absorb massive amounts of traffic, then the more instances you have to scale, the better for them given how they charge.  Also, per my point #3 above, it looks as though the attack was INTERNAL to EC2, so ingress transit bandwidth per region might not have done anything to help here.  It’s unclear to me whether this was a distributed DoS attack at all.

Lori MacVittie wrote a great post on this very thing titled “Putting a Price on Uptime” which basically asks who pays for the results of an attack like this:

A lack of ability in the cloud to distinguish illegitimate from legitimate requests could lead to unanticipated costs in the wake of an attack. How do you put a price on uptime and more importantly, who should pay for it?

This is exactly the point I was raising when I first spoke of Economic Denial Of Sustainability (EDoS) here.  All the things AWS speaks to as solutions cost more money…money which many customers based upon their expectations of AWS’ service, may be unprepared to spend.  They wouldn’t have much better options (if any) if they were hosting it somewhere else, but that’s hardly the point.

I quote back to something I tweeted earlier “The beauty of cloud and infinite scale is that you get the benefits of infinite FAIL”

The largest DDOS attacks now exceed 40Gbps. DeSantis wouldn’t say what AWS’s bandwidth ceiling was but indicated that a shrewd guesser could look at current bandwidth and hosting costs and what AWS made available, and make a good guess.

The tests done here showed the capability  to generate 650 Mbps from a single medium instance that attacked another instance which, per Radim Marek, was using another AWS account in another availability zone.  So if the “largest” DDoS attacks now exceed 40 Gbps” and five EC2 instances can handle 5Gb/s, I’d need 8 instances to absorb an attack of this scale (unknown if this represents a small or large instance.)  Seems simple, right?

Again, this about absorbing bandwidth against these attacks, not preventing them or defending against them.  This is about not only passing the buck by squeezing more of them out of you, the customer.

“ I don’t want to challenge anyone out there, but we are very, very large environment and I think there’s a lot of data out there that will help you make that case.” he said.

Of course you wish to challenge people, that’s the whole point of your arguments, Peter.

How much bandwidth AWS has is only one part of the issue here.  The other is AWS’ ability to respond to such attacks in reasonable timeframes and prevent them in the first place as part of the service.  That’s a huge part of what I expect from a cloud service.

So let’s do what DeSantis says and set our expectations accordingly.

/Hoff

Transparency: I Do Not Think That Means What You Think That Means…

October 12th, 2009 7 comments

vizziniHa ha! You fool! You fell victim to one of the classic blunders – The most famous of which is “never get involved in a cloud war in Asia” – but only slightly less well-known is this: “Never go against Werner when availability is on the line!”

As an outsider, it’s easy to play armchair quarterback, point fingers and criticize something as mind-bogglingly marvelous as something the size and scope of Amazon Web Services.  After all, they make all that complexity disappear under the guise of a simple web interface to deliver value, innovation and computing wonderment the likes of which are really unmatched.

There’s an awful lot riding on Amazon’s success.  They set the pace by which an evolving industry is now measured in terms of features, functionality, service levels, security, cost and the way in which they interact with customers and the community of ecosystem partners.

An interesting set of observations and explanations have come out of recent events related to degraded performance, availability and how these events have been handled.

When something bad happens, there’s really two ways to play things:

  1. Be as open as possible, as quickly as possible and with as much detail as possible, or
  2. Release information only as needed, when pressured and keep root causes and their resolutions as guarded as possible

This, of course, is an over-simplification of the options, complicated by the need for privacy, protection of intellectual property, legal issues, compliance or security requirements.  That’s not really any different than any other sort of service provider or IT department, but then again, Amazon’s Web Services aren’t like any other sort of service provider or IT department.

So when something bad happens, it’s been my experience as a customer (and one that admittedly does not pay for their “extra service”) that sometimes notifications take longer than I’d like, status updates are not as detailed as I might like and root causes sometimes cloaked in the air of the mysterious “network connectivity problem” — a replacement for the old corporate stand-by of “blame the firewall.”  There’s an entire industry cropping up to help you with these sorts of things.

Something like the BitBucket DDoS issue however, is not a simple “network connectivity problem.”  It is, however, a problem which highlights an oft-played pantomime of problem resolution involving any “managed” service being provided by a third party to which you as the customer have limited access at various critical points in the stack.

This outage represents a disconnect in experience versus expectation with how customers perceive the operational underpinnings of AWS’ operations and architecture and forces customers to reconsider how all that abstracted infrastructure actually functions in order to deliver what — regardless of what the ToS say — they want to believe it delivers.  This is that perception versus reality gap I mentioned earlier.  It’s not the redonkulous “end-of-cloud” scenarios parroted by the masses of the great un(cloud)washed, but it’s serious nonetheless.

As an example, BitBucket’s woes of over 20+ hours of downtime due to UDP (and later TCP) DDoS floods led to the well-documented realization that support was inadequate, monitoring insufficient and security defenses lacking — from the perspective of both the customer and AWS*.  The reality is that based on what we *thought* we knew about how AWS functioned to protect against these sorts of things, these attacks should never have wrought the damage they did.  It seems AWS was equally as surprised.

It’s important to note that these were revelations made in near real-time by the customer, not AWS.

Now, this wasn’t a widespread problem, so it’s understandable to a point as to why we didn’t hear a lot from AWS with regards to this issue, but after this all played out, when we look at what has been disclosed publicly by AWS, it appears the issue is still not remedied and despite the promise to do better, a follow-on study seems to suggest that the problem may not yet be well understood or solved by AWS (See: Amazon EC2 vulnerable to UDP flood attacks) (Ed: After I wrote this, I got a notification that this particular issue has been fixed which is indeed, good news.)

Now, releasing details about any vulnerability like this could put many many customers at risk from similar attack, but the lack of transparency  of service and architecture means that we’re left with more questions than answers. How can a customer (like me) today defend themselves against an attack like this in the lurch of not knowing what causes it or how to defend against it? What happens when the next one surfaces?

Can AWS even reliably detect this sort of thing given the “socialist security” implementation of good enough security spread across its constituent customers?

Security by obscurity in cloud cannot last as the gold standard.

This is the interesting part about the black-box abstraction that is Cloud, not just for Amazon, but any massively-scaled service provider; the more abstracted the service, the more dependent upon the provider or third parties we will become to troubleshoot issues and protect our assets.  In many cases, however, it will simply take much more time to resolve issues since visibility and transparency are limited to what the provider chooses or is able to provide.

We’re in the early days still of what customers know to ask about how security is managed in these massively scaled multi-tenant environments and since in some cases we are contractually prevented from exercising tests designed to understand the limits, we’re back to trusting that the provider has it handled…until we determine they don’t.

Put that in your risk management pipe and smoke it.

The network and systems that make up our cloud providers offerings must do a better job in stopping bad things from occurring before they reach our instances and workloads or customers should simply expect that they get what they pay for.  If the provider capabilities do not improve, combined with less visibility and an inability to deploy compensating controls, we’re potentially in a much worse spot than having no protection at all.

This is another opportunity to quietly remind folks about the Audit, Assertion, Assessment and Assurance API (A6) API that is being brought to life; there will hopefully be some exciting news here shortly about this project, but I see A6 as playing a very important role in providing a solution to some of the issues I mention here.  Ready when you are, Amazon.

If only it were so simple and transparent:

Inigo Montoya: You are using Bonetti’s Defense against me, ah?
Man in Black: I thought it fitting considering the rocky terrain.
Inigo Montoya: Naturally, you must suspect me to attack with Capa Ferro?
Man in Black: Naturally… but I find that Thibault cancels out Capa Ferro. Don’t you?
Inigo Montoya: Unless the enemy has studied his Agrippa… which I have.

/Hoff

*It’s only fair to mention that depending upon a single provider for service, no matter how good they may be and not taking advantage of monitoring services (at an extra cost,) is a risk decision that comes with consequences, one of them being longer time to resolution.

Cloud: The Other White Meat…On Service Failures & Hysterics

October 12th, 2009 6 comments

Cloud: the other white meat…

To me, cloud is the “other white meat” to the Internet’s array of widely-available chicken parts.  Both are tasty and if I order parmigiana made with either, they may even look or taste the same.  If someone orders it in a restaurant, all they say they care about is how it tastes and how much they paid for it.  They simply trust that it’s prepared properly and hygienically.   The cook, on the other hand, cares about the ingredients that went into making it, its preparation and delivery.  Expectations are critical on both sides of the table.

It’s all a matter of perspective.

Over the last few days I have engaged in spirited debate regarding cloud computing with really smart people whose opinions I value but wholeheartedly disagree with.

The genesis of these debates stem from enduring yet another in what seems like a never-ending series of “XYZ Fails: End of Cloud Computing” stories, endlessly retweeted and regurgitated by the “press” and people who frankly wouldn’t know cloud from a hole in the (fire)wall.

When I (and others) have pointed out that a particular offering is not cloud-based for the purpose of dampening the madness and restoring calm, I have been surprised by people attempting to suggest that basically anything connected to the Internet that a “consumer” can outsource operations to is cloud computing.

In many cases, examples are raised in which set of offerings that were quite literally yesterday based upon traditional IT operations and architecture and aren’t changed at all are today magically “cloud” based.  God, I love marketing.

I’m not trying to be discordant, but there are services that are cloud-based and there are those that aren’t, there are even SaaS applications that are not cloud services because they lack certain essential characteristics that differentiate them as such.  It’s a battle of semantics — ones that to me are quite important.

Ultimately, issues with any highly-visible service cause us to take a closer look at issues like DR/BCP, privacy, resiliency, etc.  This is a good thing.  It only takes a left turn when non-cloud failure causality gets pinned on the donkey that is cloud.

The recent T-Mobile/Danger data loss incident is a classic example; it’s being touted over and over as a cloudtastrophe of epic proportions.  Hundreds of blog posts, tweets and mainstream press articles proclaiming the end of days. In light of service failures lately that truly are cloud issues, this is hysterical.  I’m simply out of breath in regards to debating this specific incident, so I won’t bother rehashing it here.

Besides, I would think that Miley Cyrus leaving Twitter is a far more profound cloudtastophe than this…

When I point out that T-Mobile/Danger isn’t a cloud service, I get pushback from folks that argue vehemently that it is.  When I ask these folks what the essential differentiating characteristics of this (or any) cloud service are from an architectural, technology and operations perspective, what I find is that the answers I get back are generally marketing ones, and these people are not in marketing.

It occurs to me that the explanation for this arises from two main perspectives that frame the way in which people discuss cloud computing:

  1. The experiential consumer’s view where anything past or present connected via the Internet to someone/thing where data and services are provided and managed remotely on infrastructure by a third party is cloud, or
  2. The operational provider’s view where the service architecture, infrastructure, automation and delivery models matter and fitting within a taxonomic box for the purpose of service description and delivery is important.

The consumer’s view is emotive and perceptive: “I just put my data in The Cloud” without regard to what powers it or how it’s operated.  This is a good thing. Consumers shouldn’t have to care *how* it’s operated. They should ultimately just know it works, as advertised, and that their content is well handled.  Fair enough.

The provider’s view, however, is much more technical, clinical, operationally-focused and defined by architecture and characteristics that consumers don’t care about: infrastructure, provisioning, automation, governance, orchestration, scale, programmatic models, etc…this is the stuff that makes the magical cloud tick but is ultimately abstracted from view.  Fair enough.

However, context switching between “marketing” and “architecture” is folly; it’s an invalid argument, as is speaking from the consumer’s perspective to represent that of a provider and vice-versa.

So when a service fails, those with a consumer’s perspective simply see something that no longer works as it used to.  They think of these — and just about anything else based on Internet connectivity — as cloud.  Thus, it becomes a cloud failure. Those with a provider’s view want to know which part of the machine failed and how to fix it, so understanding if this is truly a cloud problem matters.

If the consumer sees the service as cloud, the folks that I’m debating with claim then, that it is cloud, even if the provider does not.  This is the disconnect. That’s really what the folks I’m debating with want to tell me; don’t bang my head against the wall saying “this is cloud, that isn’t cloud” because the popular view (the consumer’s) will win and all I’m doing is making things more complex.

As I mentioned, I understand their point, I just disagree with it. I’m an architect/security wonk first and a consumer second. I’ll always be in conflict with myself, but I’m simply not willing to be cloudwashed into simply accepting that everything is cloud.  It’s not.

It’s all a matter of perspective.  Now, Miley, please come back to Twitter, the cloud’s just not the same without you… 😉

/Hoff

AMI Secure? (Or: Shared AMIs/Virtual Appliances – Bot or Not?)

October 8th, 2009 6 comments

angel-devilTo some of you, this is going to sound like obvious and remedial advice that you would consider common sense.  This post is not for you.

Some of you — and you know who you are — are going to walk away from this post with a scratching sound coming from inside your skull.

The convenience of pre-built virtual appliances offered up for use in virtualized environments such as VMware’s Virtual Appliance marketplace or shared/community AMIs on AWS EC2 make for a tempting reduction of time spent getting your virtualized/cloud environments up to speed; the images are there just waiting for a a quick download and then a point and click activation.  These juicy marketplaces will continue to sprout up with offerings of bundled virtual machines for every conceivable need: LAMP stacks, databases, web servers, firewalls…you name it.  Some are free, some cost money.

There’s a darkside to this convenience. You have no idea as to the trustworthiness of the underlying operating systems or applications contained within these tidy bundles of cloudy joy.  The same could be said for much of the software in use today, but cloud simply exacerbates this situation by adding abstraction, scale and the elastic version of the snuggie that convinces people nothing goes wrong in the cloud…until it does

While trust in mankind is noble, trust in software is a palm-head-slapper.  Amazon even tells you so:

AMIs are launched at the user’s own risk. Amazon cannot vouch for the integrity or security of AMIs shared by other users. Therefore, you should treat shared AMIs as you would any foreign code that you might consider deploying in your own data center and perform the appropriate due diligence.

Ideally, you should get the AMI ID from a trusted source (a web site, another user, etc). If you do not know the source of an AMI, we recommended that you search the forums for comments on the AMI before launching it. Conversely, if you have questions or observations about a shared AMI, feel free to use the AWS forums to ask or comment.

Remember that in IaaS-based service offerings, YOU are responsible for the security of your instances.  Do you really know where an AMI/VM/VA came from, what’s running on it and why?  Do you have the skills to be able to answer this question?  How would you detect if something was wrong? Are you using hardening tools?  Logging tools?  Does any of this matter if the “box” is rooted anyway?

As I talk about in my Frogs and Cloudifornication presentations — and as the guys from Sensepost have shown — there’s very little to stop someone from introducing a trojaned/rootkitted AMI or virtual appliance that gets utilized by potentially thousands of people.  Instead of having to compromise clients on the Internet, why not just pwn system images that have the use of elastic cloud resources instead?

Imagine someone using auto-scaling and using a common image to spool up hundreds (more?) instances — infected instances.  Two words: instant Botnet.

There’s no outbound filtering (via security groups) via AWS, so exfiltrating your data would be easy. Registering C&C botnet channels would be trivial, especially over common ports.  Oh, don’t forget that in most IaaS offerings, resource consumption is charged incrementally, so the “owner” gets to pay doubly for the fun — CPU, storage and network traffic could be driven sky high.  Another form of EDoS (economic denial of sustainability.)

Given the fact that we’ve seen even basic DDoS attacks go undetected by these large providers despite their claims, the potential is frightening.

As the AWS admonishment above suggests, apply the same (more, actually) common sense regarding using these shared AMIs and virtual machines as you would were you to download and execute applications on your workstation or visit a website, or…oh, man…this is just a losing proposition. ;(

If you can avoid it, please build your own AMIs or virtual machines or consider trusted sources that can be vetted and for which the provenance and relative integrity can be derived. Please don’t use shared images if you can avoid it.  Please ensure that you know what you’re getting yourself into.

Play safe.

/Hoff

* P.S. William Vambenepe (@vambenepe) reminded me of the other half of this problem when he said (on Twitter) “…it’s not just using someone’s AMI that’s risky. Sharing your AMI can be too http://bit.ly/1qMxgN ” < A great post on what happens when people build AMIs/VMs/VAs with, um, unintended residue left over…check out his great post.

Cloud Providers and Security “Edge” Services – Where’s The Beef?

September 30th, 2009 16 comments

usbhamburgerPreviously I wrote a post titled “Oh Great Security Spirit In the Cloud: Have You Seen My WAF, IPS, IDS, Firewall…” in which I described the challenges for enterprises moving applications and services to the Cloud while trying to ensure parity in compensating controls, some of which are either not available or suffer from the “virtual appliance” conundrum (see the Four Horsemen presentation on issues surrounding virtual appliances.)

Yesterday I had a lively discussion with Lori MacVittie about the notion of what she described as “edge” service placement of network-based WebApp firewalls in Cloud deployments.  I was curious about the notion of where the “edge” is in Cloud, but assuming it’s at the provider’s connection to the Internet as was suggested by Lori, this brought up the arguments in the post
above: how does one roll out compensating controls in Cloud?

The level of difficulty and need to integrate controls (or any “infrastructure” enhancement) definitely depends upon the Cloud delivery model (SaaS, PaaS, and IaaS) chosen and the business problem trying to be solved; SaaS offers the least amount of extensibility from the perspective of deploying controls (you don’t generally have any access to do so) whilst IaaS allows a lot of freedom at the guest level.  PaaS is somewhere in the middle.  None of the models are especially friendly to integrating network-based controls not otherwise supplied by the provider due to what should be pretty obvious reasons — the network is abstracted.

So here’s the rub, if MSSP’s/ISP’s/ASP’s-cum-Cloud operators want to woo mature enterprise customers to use their services, they are leaving money on the table and not fulfilling customer needs by failing to roll out complimentary security capabilities which lessen the compliance and security burdens of their prospective customers.

While many provide commoditized solutions such as anti-spam and anti-virus capabilities, more complex (but profoundly important) security services such as DLP (data loss/leakage prevention,) WAF, Intrusion Detection and Prevention (IDP,) XML Security, Application Delivery Controllers, VPN’s, etc. should also be considered for roadmaps by these suppliers.

Think about it, if the chief concern in Cloud environments is security around multi-tenancy and isolation, giving customers more comfort besides “trust us” has to be a good thing.  If I knew where and by whom my data is being accessed or used, I would feel more comfortable.

Yes, it’s difficult to do properly and in many cases means the Cloud provider has to make a substantial investment in delivery platforms and management/support integration to get there.  This is why niche players who target specific verticals (especially those heavily regulated) will ultimately have the upper hand in some of these scenarios – it’s not socialist security where “good enough” is spread around evenly.  Services like these need to be configurable (SELF-SERVICE!) by the consumer.

An example? How about Google: where’s DLP integrated into the messaging/apps platforms?  Amazon AWS: where’s IDP integrated into the VMM for introspection?

I wrote a couple of interesting posts about this (that may show up in the automated related posts lists below):

My customers in the Fortune 500 complain constantly that the biggest providers they are being pressured to consider for Cloud services aren’t listening to these requests — or aren’t in a position to respond.

That’s bad for everyone.

So how about it? Are services like DLP, IDP, WAF integrated into your Cloud providers’ offerings something you’d like to see rather than having to add additional providers as brokers and add complexity and cost back into Cloud?

/Hoff

The Emotion of VMotion…

September 29th, 2009 8 comments
VMotion - Here's Where We Are Today

VMotion - Here's Where We Are Today

A lot has been said about the wonders of workload VM portability.

Within the construct of virtualization, and especially VMware, an awful lot of time is spent on VM Mobility but as numerous polls and direct customer engagements have shown, the majority (50% and higher) do not use VMotion.  I talked about this in a post titled “The VM Mobility Myth:

…the capability to provide for integrated networking and virtualization coupled with governance and autonomics simply isn’t mature at this point. Most people are simply replicating existing zoned/perimertized non-virtualized network topologies in their consolidated virtualized environments and waiting for the platforms to catch up. We’re really still seeing the effects of what virtualization is doing to the classical core/distribution/access design methodology as it relates to how shackled much of this mobility is to critical components like DNS and IP addressing and layer 2 VLANs.  See Greg Ness and Lori Macvittie’s scribblings.

Furthermore, Workload distribution (Ed: today) is simply impractical for anything other than monolithic stacks because the virtualization platforms, the applications and the networks aren’t at a point where from a policy or intelligence perspective they can easily and reliably self-orchestrate.

That last point about “monolithic stacks” described what I talked about in my last post “Virtual Machines Are the Problem, Not the Solution” in which I bemoaned the bloat associated with VM’s and general purpose OS’s included within them and the fact that VMs continue to hinder the notion of being able to achieve true workload portability within the construct of how programmatically one might architect a distributed application using an SOA approach of loosely coupled services.

Combined with the VM bloat — which simply makes these “workloads” too large to practically move in real time — if one couples the annoying laws of physics and current constraints of virtualization driving the return to big, flat layer 2 network architecture — collapsing core/distribution/access designs and dissolving classical n-tier application architectures — one might argue that the proposition of VMotion really is a move backward, not forward, as it relates to true agility.

That’s a little contentious, but in discussions with customers and other Social Media venues, it’s important to think about other designs and options; the fact is that the Metastructure (as it pertains to supporting protocols/services such as DNS which are needed to support this “infrastructure 2.0”) still isn’t where it needs to be in regards to mobility and even with emerging solutions like long-distance VMotion between datacenters, we’re butting up against laws of physics (and costs of the associated bandwidth and infrastructure.)

While we do see advancements in network-driven policy stickiness with the development of elements such as distributed virtual switching, port profiles, software-based vSwitches and virtual appliances (most of which are good solutions in their own right,) this is a network-centric approach.  The policies really ought to be defined by the VM’s themselves (similar to SOA service contracts — see here) and enforced by the network, not the other way around.

Further, what isn’t talked about much is something that @joe_shonk brought up, which is that the SAN volumes/storage from which most of these virtual machines boot, upon which their data is stored and in some cases against which they are archived, don’t move, many times for the same reasons.  In many cases we’re waiting on the maturation of converged networking and advances in networked storage to deliver solutions to some of these challenges.

In the long term, the promise of mobility will be delivered by a split into three four camps which have overlapping and potentially competitive approaches depending upon who is doing the design:

  1. The quasi-realtime chunking approach of VMotion via the virtualization platform [virtualization architect,]
  2. Integration distribution and “mobility” at the application/OS layer [application architect,] or
  3. The more traditional network-based load balancing of traffic to replicated/distributed images [network architect.]
  4. Moving or redirecting pointers to large pools of storage where all the images/data(bases) live [Ed. forgot to include this from above]

Depending upon the need and capability of your application(s), virtualization/Cloud platform, and network infrastructure, you’ll likely need a mash-up of all three four.  This model really mimics the differences today in architectural approach between SaaS and IaaS models in Cloud and further suggests that folks need to take a more focused look at PaaS.

Don’t get me wrong, I think VMotion is fantastic and the options it can ultimately delivery intensely useful, but we’re hamstrung by what is really the requirement to forklift — network design, network architecture and the laws of physics.  In many cases we’re fascinated by VM Mobility, but a lot of that romanticization plays on emotion rather than utilization.

So what of it?  How do you use VM mobility today?  Do you?

/Hoff

Incomplete Thought: Virtual Machines Are the Problem, Not the Solution…

September 25th, 2009 40 comments

simplicity_complexityI’m an infrastructure guy. A couple of days ago I had a lightbulb go on.  If you’re an Apps person, you’ve likely already had your share of illumination.  I’ve just never thought about things from this perspective.  Please don’t think any less of me 😉

You can bet I’m talking above my pay grade here, but bear with my ramblings for a minute and help me work through this (Update: I’m very happy to see that Surendra Reddy [@sureddy – follow him] did just that with his excellent post – cross-posted in the comments below, here. Also, check out Simon Crosby’s (Citrix CTO) post “Wither the venerable OS“)

It comes down to this:

Virtual machines (VMs) represent the symptoms of a set of legacy problems packaged up to provide a placebo effect as an answer that in some cases we have, until lately, appeared disinclined and not technologically empowered to solve.

If I had a wish, it would be that VM’s end up being the short-term gap-filler they deserve to be and ultimately become a legacy technology so we can solve some of our real architectural issues the way they ought to be solved.

That said, please don’t get me wrong, VMs have allowed us to take the first steps toward defining, compartmentalizing, and isolating some pretty nasty problems anchored on the sins of our fathers, but they don’t do a damned thing to fix them.

VMs have certainly allowed us to (literally) “think outside the box” about how we characterize “workloads” and have enabled us to begin talking about how we make them somewhat mobile, portable, interoperable, easy to describe, inventory and in some cases more secure. Cool.

There’s still a pile of crap inside ’em.

What do I mean?

There’s a bloated, parasitic resource-gobbling cancer inside every VM.  For the most part, it’s the real reason we even have mass market virtualization today.

It’s called the operating system:

Virtualization

If we didn’t have resource-inefficient operating systems, handicapped applications that were incestuously hooked to them, and tons of legacy networking stuff to deal with that unholy affinity, imagine the fun we could have.  Imagine how agile and flexible we could become.

But wait, isn’t server virtualization the answer to that?

Not really.  Server virtualization like that pictured in the diagram above is just the first stake we’re going to drive into the heart of the frankenmonster that is the OS.  The OS is like Cousin Eddie and his RV.

The approach we’ve taken today is that the VMM/Hypervisor abstracts the hardware from the OS.  The applications are still stuck on top of operating systems that don’t provide much in the way of any benefit given the emergence of development frameworks/languages such as J2EE, PHP, Ruby, .NET, etc. that were built around the notions of decoupled, distributed and mashable application “fabrics.”

Every ship travels with an anchor, in the case of the VM it’s the OS.

Imagine if these applications didn’t have to worry about the resource-hogging, control-freak, I/O limiting, protected mode schizophrenia and de-privileged ring spoofing of hypervisors associated with trying not conflict with or offend the OS’s sacred relationship with the hardware beneath it.

Imagine if these application constructs were instead distributed programmatically, could intercommunicate using secure protocols and didn’t have to deal with legacy problems. Imagine if the VMM/Hypervisor really was there to enable scale, isolation, security, and management.  We’d be getting rid of an entire layer.

If that crap in the middle of the sandwich makes for inefficiency, insecurity and added cost in virtualized enterprises, imagine what it does at the Infrastructure as a Service (IaaS) layer in Cloud deployments where VMs — in whatever form — are the basis for the operational models.  We have these fat packaged VMs with OS overhead and attack surfaces that really don’t need to be there.

For example, most of the pre-packaged AMIs found on AWS are bloated general purpose operating systems with some hardening applied (if at all) but there’s just all that code… sitting there…doing nothing except taking up storage, memory and compute resources.

Why do we need this?   Why don’t we at at least see more of a push towards JEOS (Just Enough OS) in the meantime?

I think most virtualization vendors today who are moving their virtualization offerings to adapt to Cloud, are asking themselves the same questions and answering them by realizing that the real win in the long term — once enterprises are done with consolidation and virtualization and hit the next “enterprise application modernization” cycle — will be  to develop and engineer applications directly around platforms which obviate the OS.

So these virtualization players are  making acquisitions to prepare them for this next wave — the real emergence of Platform as a Service (PaaS.)

Some like Microsoft with Azure are simply starting there.  Even SaaS vendors have gone down-stack and provided PaaS offerings to further allow for connectivity, integration and security in the place they think it belongs.

In the case of VMware and their acquisition of SpringSource, that piece of bloat in the middle can be seen as simply going away; whatever you call it, it’s about disintermediating the OS completely and it seems to me that the entire notion of vApps addresses this very thing.  I’m sure there are a ton of other offerings that I simply didn’t get before that are going to make me go “AHA!” now.

I’m not sure organizationally or operationally that most enterprises can get their arms around what it means to not have that OS layer in the middle in the short term, but this new platform-oriented Cloud is really interesting.  It makes those folks who may have made the conversion from server-hugger to VM-hugger and think they were done adapting, quite uncomfortable.

It makes me uncomfortable…and giddy.

All the things I know and understand about how things at the Infrastructure layer s interacts with applications and workloads at the Infostructure layer will drastically change.  The security models will change.  The solutions will change.  Even the notion of vMotion — moving VM’s around — will change.  In fact, in this model, vMotion isn’t really relevant.

Admittedly, I’ve had to call into question over the last few days just how relevant the notion of “Infrastructure 2.0” is within this model — at least how it’s described today.

Cloud v1.0 with all it’s froth and hype is going to be nothing compared to Cloud 2.0 — the revenge of SOA, web services, BPM, enterprise architecture and the developer.  Luckily for the sake of us infrastructure folks we still have time to catch up and see the light as the VMM buys us visibility and a management plane.  However, the protocols and models for how applications interact with the network are sure going to change and accelerate due to Cloud — at least they should.

Just look at how developments such as XMPP and LISP are going to play in a PaaS-centric world…

Like I said, it’s an incomplete thought and I’m not enlightened enough to frame a better discussion in written form, but I can’t wait until the next Infrastructure 2.0 Working Group to bring this up.

I have an appreciation for such a bigger piece of the conversation now.  I just need to get more edumacated.

My head ahsplodes.

Any of this crap make sense to you?

/Hoff

Google & AWS: Just Goes To Prove You Can Have Your Cloud and, um, Eat It Too…

September 25th, 2009 4 comments

…and by “eat it” I mean that how you think I mean that.  I feel for these guys, they have big targets on their backs, but that’s what happens when you’re a market leader.

To wit, there are two polarized views expressed every time Google or Amazon have an outage or service interruption given that both are constantly held up as the poster children for Cloud Computing:

  1. Cloud Computing isn’t ready for prime time; if Google or Amazon can go down, why/how can I trust them with my most critical assets!?
  2. Google and Amazon are just service providers; service providers have issues.  This isn’t a Cloud issue, it’s just a service issue.

The truth is somewhere in the middle.

Here’s my $0.02.  You may not like it.  Refunds will be processed by mail.

If you market yourself as the shit, you can expect some back when it hits the fan:

From Hoff's Preso: Cloudifornication - Indiscriminate Information Intercourse Involving Internet Infrastructure

From Hoff's Preso: Cloudifornication - Indiscriminate Information Intercourse Involving Internet Infrastructure

Stop apologizing and live up to the hype you’re helping create.

/Hoff

Redux: Patching the Cloud

September 23rd, 2009 3 comments

Back in 2008 I wrote a piece titled “Patching the Cloud” in which I highlighted the issues associated with the black box ubiquity of Cloud and what that means to patching/upgrading processes:

Your application is sitting atop an operating system and underlying infrastructure that is managed by the cloud operator.  This “datacenter OS” may not be virtualized or could actually be sitting atop a hypervisor which is integrated into the operating system (Xen, Hyper-V, KVM) or perhaps reliant upon a third party solution such as VMware.  The notion of cloud implies shared infrastructure and hosting platforms, although it does not imply virtualization.

A patch affecting any one of the infrastructure elements could cause a ripple effect on your hosted applications.  Without understanding the underlying infrastructure dependencies in this model, how does one assess risk and determine what any patch might do up or down the stack?  How does an enterprise that has no insight into the “black box” model of the cloud operator, setup a dev/test/staging environment that acceptably mimics the operating environment?

What happens when the underlying CloudOS gets patched (or needs to be) and blows your applications/VMs sky-high (in the PaaS/IaaS models?)

How does one negotiate the process for determining when and how a patch is deployed?  Where does the cloud operator draw the line?   If the cloud fabric is democratized across constituent enterprise customers, however isolated, how does a cloud provider ensure consistent distributed service?  If an application can be dynamically provisioned anywhere in the fabric, consistency of the platform is critical.

I followed this up with a practical example when Microsoft’s Azure services experienced a hiccup due to this very thing.  We see wholesale changes that can be instantiated on a whim by Cloud providers that could alter service functionality and service availability such as this one from Google (Published Google Documents to appear in Google search) — have you thought this through?

So now as we witness ISP’s starting to build Cloud service offerings from common Cloud OS platforms and espouse the portability of workloads (*ahem* VM’s) from “internal” Clouds to Cloud Providers — and potentially multiple Cloud providers — what happens when the enterprise is at v3.1 of Cloud OS, ISP A is at version 2.1a and ISP B is at v2.9? Portability is a cruel mistress.

Pair that little nugget with the fact that even “global” Cloud providers such as Amazon Web Services have not maintained parity in terms of functionality/services across their regions*. The US has long had features/functions that the european region has not.  Today, in fact, AWS announced bringing infrastructure capabilities to parity for things like elastic load balancing and auto-scale…

It’s important to understand what happens when we squeeze the balloon.

/Hoff

*corrected – I originally said “availability zones” which was in error as pointed out by Shlomo in the comments. Thanks!

Incomplete Thought: Forget VM Sprawl, Worry More About SaaSprawl…

September 19th, 2009 17 comments

A lot of fuss has been made about run-away VM sprawl in enterprises who are heavily virtualized due to the ease with which a VM can constructed and operationalized.

I’m not convinced about the reality versus the potential of VM Sprawl, meaning that I have no evidence from anyone facing this issue to date.  I wrote about this a while ago here.

As virtualization and the attendant vendors push more from enterprise virtualization to enterprise Clouds, what I’m actually more concerned with is SaaSprawl.

This scenario describes how enterprises will deal with managing what could amount to dozens of “CloudSourced” SaaS vendors as companies edge toward Cloud adoption by cherry picking applications for externalization using SaaS as the platforms, technologies and standards catch up to allow those pesky workloads that used to run internally, to do the same externally…

Outsource email, security, CRM, ERP, Legal/HR, Purchasing, Desktop apps — all from different vendors, each with different contracts, SLA’s, data integration issues, security concerns, audit constraints, regulatory compliance hiccups.

What we likely could end up with is another illustration of a “squeezing the balloon” problem; trading off CapEx for what I call OopsEx — realizing what might amount to substituting one problem for another as you trade reduced upfront (and on-going) capital investment for what amounts to on-going management, security, compliance and service-level management issues in the long term.

Thoughts?