The A6 (Audit, Assertion, Assessment, and Assurance API) Working Group is Live. Please join & read the intro.

November 16th, 2009 No comments

For those of you following along at home, the A6 (Audit, Assertion, Assessment, and Assurance API) Working Group is Live.

I’ve setup the Google group so please join & read the introduction here.

Hope to see you there.

/Hoff

Categories: A6 Tags: , , ,

Silent Lucidity: IaaS — Already A Dinosaur? The Evolution of PaaSasaurus Rex…

November 12th, 2009 8 comments

dinosaurSitting in an impressive room at the Google campus in Mountain View last month, I asked the collective group of brainpower a slightly rhetorical question:

How much longer do you feel pure-play Infrastructure-As-A-Service will be a relevant service model within the spectrum of cloud services?

I couched the question with previous “incomplete thoughts*” relating to the move “up-stack” by IaaS providers — providing value-added, at-cost services to both differentiate and soften the market for what I call the “PaaSification” of the consumer.  I also highlighted the move “down-stack” by SaaS vendors building out platforms to support a broader ecosystem and value proposition.

In the long term, I think ultimately the trichotomy of the SPI model will dissolve thanks to commoditization and the need for providers to differentiate — even at mass scale.  We’ll ultimately just talk about service delivery and the platform(s) used to deliver them.  Infrastructure will enable these services, of course, but that’s not where the money will come from.

Just look at the approach of providers such as Amazon, Terremark and Savvis and how they are already clawing their way up the PaaS stack, adding more features and functions that either equalize public cloud capabilities with those of the enterprise or even differentiate from it.  Look at Microsoft’s Azure.  How about Heroku, Engine Yard, Joyent?  How about VMware and Springsource?  All platform plays. Develop, click, deploy.

As I mention in my Cloudifornication presentation, I think that from a security perspective, PaaS offers the potential of eliminating entire classes of vulnerabilities in the application development lifecycle by enforcing sanitary programmatic practices across the derivate works built upon them.  I look forward also to APIs and standards that allow for consistency across providers. I think PaaS has the greatest potential to deliver this.

There are clearly trade-offs here, but as we start to move toward the two key differentiators (at least for public clouds) — management and security — I think the value of PaaS will really start to shine.

Probably just another bout of obviousness, but if I were placing bets, this is where I’d sink my nickels.

You?

/Hoff

* The most relevant “incomplete thought” is the one titled “Incomplete Thought: Virtual Machines Are the Problem, Not the Solution…” in which I kicked around the notion that virtualization-enabled IaaS and the VM containers they enable are simply an ugly solution to an uglier problem…

Dear Santa: All I Want For Christmas On My Amazon Wishlist Is a Straight Answer…

October 31st, 2009 4 comments

A couple of weeks ago amidst another interesting Amazon Web Services announcement featuring the newly-arrived Relational Database Service, Werner Vogels (Amazon CTO) jokingly retweeted a remark that someone made suggesting he was like “…Santa for nerds.”

All I want for Christmas is my elastic IP...

All I want for Christmas is my elastic IP...

So, now that I have Werner following me on Twitter and a confirmed mailing address (clearly the North Pole) I thought I’d make my Christmas wish early this year.  I’ve put a lot of thought into this.

Just when I had settled on a shiny new gadget from the bookstore side of the house, I saw Amazon’s response to Eran Tromer’s (et al) research on Cloud Cartography featured in this Computerworld article written by my old friend Jaikumar Vijayan titled “Amazon downplays report highlighting vulnerabilities in its cloud service.”

I feature Eran and his team’s work in my Cloudifornication presentation.  You can read more about it on Craig’s blog here.

I quickly cast aside my yuletyde treasure list and instead decided to ask Santa (Werner/AWS) for a most important present: a straight answer from AWS that isn’t delivered by a PR spokeshole that instead speaks openly, transparently and in an engaging fashion with customers and the security community.

Here’s what torqued me (emphasis is mine):

In response, Amazon spokeswoman Kay Kinton said today that the report describes cloud cartography methods that could increase an attacker’s probability of launching a rogue virtual machine (VM) on the same physical server as another specific target VM.

What remains unclear, however, is how exactly attackers would be able to use that presence on the same physical server to then attack the target VM, Kinton told Computerworld via e-mail.

The research paper itself described how potential attackers could use so-called “side-channel” attacks to try and try and steal information from a target VM. The researchers had argued that a VM sitting on the same physical server as a target VM, could monitor shared resources on the server to make highly educated inferences about the target VM.

By monitoring CPU and memory cache utilization on the shared server, an attacker could determine periods of high-activity on the target servers, estimate high-traffic rates and even launch keystroke timing attacks to gather passwords and other data from the target server, the researchers had noted.

Such side-channel attacks have proved highly successful in non-cloud contexts, so there’s no reason why they shouldn’t work in a cloud environment, the researchers postulated.

However, Kinton characterized the attack described in the report as “hypothetical,” and one that would be “significantly more difficult in practice.”

“The side channel techniques presented are based on testing results from a carefully controlled lab environment with configurations that do not match the actual Amazon EC2 environment,” Kinton said.

“As the researchers point out, there are a number of factors that would make such an attack significantly more difficult in practice,” she said.

So while the Amazon spokesperson admits the vulnerability/capability exists, rather than rationally address that issue, thank the researchers for pointing this out and provide customers some level of detail regarding how this vulnerability is mitigated, we get handwaving that attempts to have us not focus on the vulnerability, but rather the difficulty of a hypothetical exploit.  That example isn’t the point of the paper. The fact that I could deliver a targeted attack is.

Earth to Amazon: this sort of thing doesn’t work. It’s a lousy tactic.  It simply says that either you think we’re all stupid or you’re suffering from a very bad case of incident handling immaturity. Take a look around you, there are plenty of companies doing this right. You’re not one of them.  Consistently.

Tromer and crew gave a single example of how this vulnerability might be exploited that was latched on to by the AWS spokesperson as a way of defusing the seriousness of the underlying vulnerability by downplaying this sample exploit.  There are potentially dozens of avenues to be explored here.  Craig talked about many of them in his blog (above.)  What we got instead was this:

At the same time, Amazon takes all reports of vulnerabilities in its cloud infrastructure very seriously, she said. The company will continue to investigate potential exploits thoroughly and continue to develop features bolster security for users of its cloud service, she said.

Amazon Web Services has already rolled out safeguards that prevent potential attackers from using the cartography techniques described in the paper, Kinton said without offering any details.

She also pointed to the recently launched Amazon Web Service Multi-Factor Authentication (AWS MFA) as another example of the company’s continuing effort to bolster cloud security. AWS MFA is designed to provide an extra layer access control to a customer’s Web services account, Kinton said.

Did you catch “…without offering any details” or were you simply overwhelmed by the fact that you can use a token to authenticate your single-key driven AWS console instead?

I’m not interested in getting into a “full disclosure” battle here, but being dismissive, not providing clear-cut answers and being evasive without regard for transparency about issues like this or the DDoS attacks we saw with Bitbucket, etc. are going to backfire.  I posted about this before in previous blogs here and here.

If you want to be taken seriously by large enterprises and government agencies that require real answers to issues like this, you can engage with the security community or ignore us and get focused on by it (and me) until you decide that it’s a much better idea to do the former.  You’ll gain much more credibility and an eagerness to work with you instead of against you if you choose to use the force wisely 😉

Until then, may I suggest this?  I found it in the Amazon.com bookstore:

beinghonest…you can download it to your Kindle in under a minute.

/Hoff

Cloud/Cloud Computing Definitions – Why they Do(n’t) Matter…

October 28th, 2009 No comments

A couple of weeks ago I wrote a piece titled Cloud: The Other White Meat…On Service Failures & Hysterics in which I summarized why Cloud/Cloud Computing (or what I now refer to as Cloudputing 😉 has become such a definitional Super-Fund clean up site:

To me, cloud is the “other white meat” to the Internet’s array of widely-available chicken parts.  Both are tasty and if I order parmigiana made with either, they may even look or taste the same.  If someone orders it in a restaurant, all they say they care about is how it tastes and how much they paid for it.  They simply trust that it’s prepared properly and hygienically.   The cook, on the other hand, cares about the ingredients that went into making it, its preparation and delivery.  Expectations are critical on both sides of the table.

It’s all a matter of perspective.

and

It occurs to me that the explanation for this arises from two main perspectives that frame the way in which people discuss cloud computing:

  1. The experiential consumer’s view where anything past or present connected via the Internet to someone/thing where data and services are provided and managed remotely on infrastructure by a third party is cloud, or
  2. The operational provider’s view where the service architecture, infrastructure, automation and delivery models matter and fitting within a taxonomic box for the purpose of service description and delivery is important.

The consumer’s view is emotive and perceptive: “I just put my data in The Cloud” without regard to what powers it or how it’s operated. This is a good thing. Consumers shouldn’t have to care *how* it’s operated. They should ultimately just know it works, as advertised, and that their content is well handled.  Fair enough.

The provider’s view, however, is much more technical, clinical, operationally-focused and defined by architecture and characteristics that consumers don’t care about: infrastructure, provisioning, automation, governance, orchestration, scale, programmatic models, etc…this is the stuff that makes the magical cloud tick but is ultimately abstracted from view.  Fair enough.

However, context switching between “marketing” and “architecture” is folly; it’s an invalid argument, as is speaking from the consumer’s perspective to represent that of a provider and vice-versa.

Here are the graphical representations of those statements from my Cloudifornication presentation:

Cloud-Provider's View

Cloud-Provider's View

Cloud-Consumer's View

Cloud-Consumer's View

Don’t Hassle the Hoff: Recent Press & Podcast Coverage & Upcoming Speaking Engagements

October 26th, 2009 1 comment

Microphone

Here is some of the recent coverage from the last month or so on topics relevant to content on my blog, presentations and speaking engagements.  No particular order or priority and I haven’t kept a good record, unfortunately.

Press/Technology & Security eZines/Website/Blog Coverage/Meaningful Links:

Podcasts/Webcasts/Video:

Recent Speaking Engagements/Confirmed to  speak at the following upcoming events:

  • Enterprise Architecture Conference, D.C.
  • Intel Security Summit 2009, Hillsboro OR
  • SecTor 2009, Toronto CA
  • EMC Innovation Forum, Franklin MA
  • NY Technology Forum, NY, NY
  • Microsoft Bluehat v9, Redmond WA
  • Office of the Comptroller & Currency, San Antonio TX
  • Intercloud Working Group, GooglePlex CA 😉
  • CSC Leading Edge Forum, VA
  • DojoCon, VA

I also forgot to thank Eric Siebert for putting together the VMware Top 20 blog list and putting me on it as well as the fact that Rational Survivability made the Datamation 2009 Top 200 Tech Blogs list.

/Hoff

Can We Secure Cloud Computing? Can We Afford Not To?

October 22nd, 2009 2 comments

[The following is a re-post from the Microsoft (Technet) blog I did as a lead up to my Cloudifornication presentation at Bluehat v9 I’ll be posting after I deliver the revised edition tomorrow.]

There have been many disruptive innovations in the history of modern computing, each of them in some way impacting how we create, interact with, deliver, and consume information. The platforms and mechanisms used to process, transport, and store our information likewise endure change, some in subtle ways and others profoundly.

Cloud computing is one such disruption whose impact is rippling across the many dimensions of our computing experience. Cloud – in its various forms and guises — represents the potential cauterization of wounds which run deep in IT; self-afflicted injuries of inflexibility, inefficiency, cost inequity, and poor responsiveness.

But cost savings, lessening the environmental footprint, and increased agility aren’t the only things cited as benefits. Some argue that cloud computing offers the potential for not only equalling what we have for security today, but bettering it. It’s an interesting argument, really, and one that deserves some attention.

To address it, it requires a shift in perspective relative to the status quo.

We’ve been at this game for nearly forty years. With each new (r)evolutionary period of technological advancement and the resultant punctuated equilibrium that follows, we’ve done relatively little to solve the security problems that plague us, including entire classes of problems we’ve known about, known how to fix, but have been unable or unwilling to fix for many reasons.

With each pendulum swing, we attempt to pay the tax for the sins of our past with technology of the future that never seems to arrive.

Here’s where the notion of doing better comes into play.

Cloud computing is an operational model that describes how combinations of technology can be utilized to better deliver service; it’s a platform shuffle that is enabling a fierce and contentious debate on the issues surrounding how we secure our information and instantiate trust in an increasingly open and assumed-hostile operating environment which is in many cases directly shared with others, including our adversaries.

Cloud computing is the natural progression of the reperimeterization, consumerization, and increasingly mobility of IT we’ve witnessed over the last ten years. Cloud computing is a forcing function that is causing us to shine light on the things we do and defend not only how we do them, but who does them, and why.

To set a little context and simplify discussion, if we break down cloud computing into a visual model that depicts bite-sized chunks, it looks like this:

Infostructure/Metastructure/Infrastructure

Infostructure/Metastructure/Infrastructure

At the foundation of this model is the infrastructure layer that represents the traditional computer, network and storage hardware, operating systems, and virtualization platforms familiar to us all.

Cresting the model is the infostructure layer that represents the programmatic components such as applications and service objects that produce, operate on, or interact with the content, information, and metadata.

Sitting in between infrastructure and infostructure is the metastructure layer. This layer represents the underlying set of protocols and functions such as DNS, BGP, and IP address management, which “glue” together and enable the applications and content at the infostructure layer to in turn be delivered by the infrastructure.

We’ve made incremental security progress at the infrastucture and infostructure layers, but the technology underpinnings at the metastructure layer have been weighed, measured, and found lacking. The protocols that provide the glue for our fragile Internet are showing their age; BGP, DNS, and SSL are good examples.

Ultimately the most serious cloud computing concern is presented by way of the “stacked turtles” analogy: layer upon layer of complex interdependencies predicated upon fragile trust models framed upon nothing more than politeness and with complexities and issues abstracted away with additional layers of indirection. This is “cloudifornication.”

The dynamism, agility and elasticity of cloud computing is, in all its glory, still predicated upon protocols and functions that were never intended to deal with these essential characteristics of cloud.

Without re-engineering these models and implementing secure protocols and the infrastructure needed to support them, we run the risk of cloud computing simply obfuscating the fragility of the supporting layers until the stack of turtles topples as something catastrophic occurs.

There are many challenges associated with the unique derivative security issues surrounding cloud computing, but we have the ability to remedy them should we so desire.

Cloud computing is a canary in the coal mine and it’s chirping wildly for now but that won’t last.  It’s time to solve the problems, not the symptoms.

/Hoff

[Edited the last sentence for clarity]

Reblog this post [with Zemanta]

Incomplete Thought: The Cloud Software vs. Hardware Value Battle & Why AWS Is Really A Grid…

October 18th, 2009 2 comments

Some suggest in discussing the role and long-term sustainable value of infrastructure versus software in cloud that software will marginalize bespoke infrastructure and the latter will simply commoditize.

I find that an interesting assertion, given that it tends to ignore the realities that both hardware and software ultimately suffer from a case of Moore’s Law — from speeds and feeds to the multi-core crisis, this will continue ad infinitum.  It’s important to keep that perspective.

In discussing this, proponents of software domination almost exclusively highlight Amazon Web Services as their lighthouse illustration.  For the purpose of simplicity, let’s focus on compute infrastructure.

Here’s why pointing to Amazon Web Services (AWS) as representative of all cloud offerings in general to anchor the hardware versus software value debate is not a reasonable assertion:

  1. AWS delivers a well-defined set of services designed to be deployed without modification across a massive number of customers; leveraging a common set of standardized capabilities across these customers differentiates the service and enables low cost
  2. AWS enjoys general non-variability in workload from *their* perspective since they offer fixed increments of compute and memory allocation per unit measure of exposed abstracted and virtualized infrastructure resources, so there’s a ceiling on what workloads per unit measure can do. It’s predictable.
  3. From AWS’ perspective (the lens of the provider) regardless of the “custom stuff” running within these fixed-sized containers, the main focus of their core “cloud” infrastructure actually functions like a grid — performing what amounts to a few tasks on a finely-tuned platform to deliver such
  4. This yields the ability for homogeneity in infrastructure and a focus on standardized and globalized power efficient, low cost, and easy-to-replicate components since the problem of expansion beyond a single unit measure of maximal workload capacity is simply a function of scaling out to more of them (or stepping up to one of the next few rungs on the scale-up ladder)

Yup, I just said that AWS is actually a grid whose derivative output is a set of cloud services.

Why does this matter?  Because not all IaaS cloud providers are architected to achieve this — by design — and this has dramatic impact on where hardware and software, leveraged independently or as a total solution, play in the debate.

This is because AWS built and own the entire “CloudOS” stack from customized hardware through to the VMM, management and security planes (much as Google does the same) versus other providers who use what amounts to more generic software offerings from the likes of VMware and lean on API’s and an ecosystem to extend it’s capabilities as well as big iron to power it.  This will yield more customizable offerings that likely won’t scale as highly as AWS.

That’s because they’re not “grids” and were never designed to be.

Many other IaaS providers that have evolved from hosting are building their next-generation offerings from unified fabric and unified computing platforms (so-called “big iron”) which are the furtherest thing from “commodity” hardware you can get.  Further, SaaS and PaaS providers generally tend to do the same based on design goals and business models.  Remember, IaaS is not representative of all things cloud — it’s only one of the service models.

Comparing AWS to most other IaaS cloud providers is a false argument upon which to anchor the hardware versus software debate.

/Hoff

“Open” means more than just an API…Google’s Data Liberation Project Ponies Up

October 16th, 2009 2 comments

datalibThis is chewy goodness.

Short and sweet from the Googleborg via a Webmonkey article titled “Pack Up Your Data and Leave Whenever You Want, It’s the New Rule of the Cloud:

Users should be able to control the data they store in any of Google’s products. Our team’s goal is to make it easier for them to move data in and out.

Bravo.  Brian Fitzpatrick and his team gains major street cred here; they’re up-front about the benefits to both end-users and Google.  Openness and transparency benefit everyone.

Read more about the project at dataliberation.org

/Hoff

Amazon Web Services: It’s Not The Size Of the Ship, But Rather The Motion Of the…

October 16th, 2009 3 comments
From Hoff's Preso: Cloudifornication - Indiscriminate Information Intercourse Involving Internet Infrastructure

From Hoff's Preso: Cloudifornication - Indiscriminate Information Intercourse Involving Internet Infrastructure

Carl Brooks (@eekygeeky) gets some fantastic, thought-provoking interviews.  His recent article wherein he interviewed Peter DeSantis, VP of EC2, Amazon Web Services, was titled: “Amazon would like to remind you where the hype started” is another great example.

However, this article left a bad taste in my mouth and ultimately invites more questions than it answers. Frankly I felt like there was a large amount of hand-waving in DeSantis’ points that glossed over some very important issues related to security issues of late.

DeSantis’ remarks implied, per the title of the article, that to explain the poor handling and continuing lack of AWS’ transparency related to the issues people like me raise,  the customer is to blame due to hype and overly aggressive, misaligned expectations.

In short, it’s not AWS’ fault they’re so awesome, it’s ours.  However, please don’t remind them they said that when they don’t live up to the hype they help perpetuate.

You can read more about that here “Transparency: I Do Not Think That Means What You Think That Means…

I’m going to skip around the article because I do agree with Peter DeSantis on the points he made about the value proposition of AWS which ultimately appear at the end of the article:

“A customer can come into EC2 today and if they have a website that’s designed in a way that’s horizontally scalable, they can run that thing on a single instance; they can use [CloudWatch[] to monitor the various resource constraints and the performance of their site overall; they can use that data with our autoscaling service to automatically scale the number of hosts up or down based on demand so they don’t have to run those things 24/7; they can use our Elastic Load Balancer service to scale the traffic coming into their service and only deliver valid requests.”

“All of which can be done self-service, without talking to anybody, without provisioning large amounts of capacity, without committing to large bandwidth contracts, without reserving large amounts of space in a co-lo facility and to me, that’s a tremendously compelling story over what could be done a couple years ago.”

Completely fair.  Excellent way of communicating the AWS value proposition.  I totally agree.  Let’s keep this definitional firmly in mind as we go on.

Here’s where the story turns into something like a confessional that implies AWS is sadly a victim of their own success:

DeSantis said that the reason that stories like the DDOS on Bitbucket.org (and the non-cloud Sidekick story) is because people have come to expect always-on, easily consumable services.

“People’s expectations have been raised in terms of what they can do with something like EC2. I think people rightfully look at the potential of an environment like this and see the tools, the multi- availability zone, the large inbound transit, the ability to scale out and up and fundamentally assume things should be better. “ he said.

That’s absolutely true. We look at what you offer (and how you offered/described it above) and we set our expectations accordingly.

We do assume that things should be better as that’s how AWS has consistently marketed the service.

You can’t reasonably expect to bitch about people’s perception of the service based on how it’s “sold” and then turn around when something negative happens and suggest that it’s the consumers’ fault for setting their expectational compass with the course you set.

It *is* absolutely fair to suggest that there is no release from not using common sense, not applying good architectural logic to deployment of services on AWS, but it’s also disingenuous to expect much of the target market to whom you are selling understands the caveats here when so much is obfuscated by design.  I understand AWS doesn’t say they protect against every threat, but they also do not say they do not…until something happens where that becomes readily apparent 😉

When everything is great AWS doesn’t go around reminding people that bad things can happen, but when bad things happen it’s because of incorrectly-set expectations?

Here’s where the discussion turns to an interesting example —  the BitBucket DDoS issue.

For instance, DeSantis said it would be trivial to wash out standard DDOS attacks by using clustered server instances in different availability zones.

Okay, but four things come to mind:

  1. Why did it take 15 hours for AWS to recognize the DDoS in the first place? (They didn’t actually “detect” it, the customer did)
  2. Why did the “vulnerability” continue to exist for days afterward?
  3. While using different availability zones makes sense, it’s been suggested that this DDoS attack was internal to EC2, not externally-generated
  4. While it *is* good practice and *does* make sense, “clustered server instances in different avail. zones, costs money

Keep those things in the back of your mind for a moment…

“One of the best defenses against any sort of unanticipated spike is simply having available bandwidth. We have a tremendous amount on inbound transit to each of our regions. We have multiple regions which are geographically distributed and connected to the internet in different ways. As a result of that it doesn’t really take too many instances (in terms of hits) to have a tremendous amount of availability – 2,3,4 instances can really start getting you up to where you can handle 2,3,4,5 Gigabytes per second. Twenty instances is a phenomenal amount of bandwidth transit for a customer.” he said.

So again, here’s where I take issue with this “bandwidth solves all” answer. The solution being proposed by DeSantis here is that a customer should be prepared to launch/scale multiple instances in response to a DoS/DDoS, in effect making it the customers’ problem instead of AWS detecting and squelching it in the first place?

Further, when you think of it, the trickle-down effect of DDoS is potentially good for AWS’ business. If they can absorb massive amounts of traffic, then the more instances you have to scale, the better for them given how they charge.  Also, per my point #3 above, it looks as though the attack was INTERNAL to EC2, so ingress transit bandwidth per region might not have done anything to help here.  It’s unclear to me whether this was a distributed DoS attack at all.

Lori MacVittie wrote a great post on this very thing titled “Putting a Price on Uptime” which basically asks who pays for the results of an attack like this:

A lack of ability in the cloud to distinguish illegitimate from legitimate requests could lead to unanticipated costs in the wake of an attack. How do you put a price on uptime and more importantly, who should pay for it?

This is exactly the point I was raising when I first spoke of Economic Denial Of Sustainability (EDoS) here.  All the things AWS speaks to as solutions cost more money…money which many customers based upon their expectations of AWS’ service, may be unprepared to spend.  They wouldn’t have much better options (if any) if they were hosting it somewhere else, but that’s hardly the point.

I quote back to something I tweeted earlier “The beauty of cloud and infinite scale is that you get the benefits of infinite FAIL”

The largest DDOS attacks now exceed 40Gbps. DeSantis wouldn’t say what AWS’s bandwidth ceiling was but indicated that a shrewd guesser could look at current bandwidth and hosting costs and what AWS made available, and make a good guess.

The tests done here showed the capability  to generate 650 Mbps from a single medium instance that attacked another instance which, per Radim Marek, was using another AWS account in another availability zone.  So if the “largest” DDoS attacks now exceed 40 Gbps” and five EC2 instances can handle 5Gb/s, I’d need 8 instances to absorb an attack of this scale (unknown if this represents a small or large instance.)  Seems simple, right?

Again, this about absorbing bandwidth against these attacks, not preventing them or defending against them.  This is about not only passing the buck by squeezing more of them out of you, the customer.

“ I don’t want to challenge anyone out there, but we are very, very large environment and I think there’s a lot of data out there that will help you make that case.” he said.

Of course you wish to challenge people, that’s the whole point of your arguments, Peter.

How much bandwidth AWS has is only one part of the issue here.  The other is AWS’ ability to respond to such attacks in reasonable timeframes and prevent them in the first place as part of the service.  That’s a huge part of what I expect from a cloud service.

So let’s do what DeSantis says and set our expectations accordingly.

/Hoff

Transparency: I Do Not Think That Means What You Think That Means…

October 12th, 2009 7 comments

vizziniHa ha! You fool! You fell victim to one of the classic blunders – The most famous of which is “never get involved in a cloud war in Asia” – but only slightly less well-known is this: “Never go against Werner when availability is on the line!”

As an outsider, it’s easy to play armchair quarterback, point fingers and criticize something as mind-bogglingly marvelous as something the size and scope of Amazon Web Services.  After all, they make all that complexity disappear under the guise of a simple web interface to deliver value, innovation and computing wonderment the likes of which are really unmatched.

There’s an awful lot riding on Amazon’s success.  They set the pace by which an evolving industry is now measured in terms of features, functionality, service levels, security, cost and the way in which they interact with customers and the community of ecosystem partners.

An interesting set of observations and explanations have come out of recent events related to degraded performance, availability and how these events have been handled.

When something bad happens, there’s really two ways to play things:

  1. Be as open as possible, as quickly as possible and with as much detail as possible, or
  2. Release information only as needed, when pressured and keep root causes and their resolutions as guarded as possible

This, of course, is an over-simplification of the options, complicated by the need for privacy, protection of intellectual property, legal issues, compliance or security requirements.  That’s not really any different than any other sort of service provider or IT department, but then again, Amazon’s Web Services aren’t like any other sort of service provider or IT department.

So when something bad happens, it’s been my experience as a customer (and one that admittedly does not pay for their “extra service”) that sometimes notifications take longer than I’d like, status updates are not as detailed as I might like and root causes sometimes cloaked in the air of the mysterious “network connectivity problem” — a replacement for the old corporate stand-by of “blame the firewall.”  There’s an entire industry cropping up to help you with these sorts of things.

Something like the BitBucket DDoS issue however, is not a simple “network connectivity problem.”  It is, however, a problem which highlights an oft-played pantomime of problem resolution involving any “managed” service being provided by a third party to which you as the customer have limited access at various critical points in the stack.

This outage represents a disconnect in experience versus expectation with how customers perceive the operational underpinnings of AWS’ operations and architecture and forces customers to reconsider how all that abstracted infrastructure actually functions in order to deliver what — regardless of what the ToS say — they want to believe it delivers.  This is that perception versus reality gap I mentioned earlier.  It’s not the redonkulous “end-of-cloud” scenarios parroted by the masses of the great un(cloud)washed, but it’s serious nonetheless.

As an example, BitBucket’s woes of over 20+ hours of downtime due to UDP (and later TCP) DDoS floods led to the well-documented realization that support was inadequate, monitoring insufficient and security defenses lacking — from the perspective of both the customer and AWS*.  The reality is that based on what we *thought* we knew about how AWS functioned to protect against these sorts of things, these attacks should never have wrought the damage they did.  It seems AWS was equally as surprised.

It’s important to note that these were revelations made in near real-time by the customer, not AWS.

Now, this wasn’t a widespread problem, so it’s understandable to a point as to why we didn’t hear a lot from AWS with regards to this issue, but after this all played out, when we look at what has been disclosed publicly by AWS, it appears the issue is still not remedied and despite the promise to do better, a follow-on study seems to suggest that the problem may not yet be well understood or solved by AWS (See: Amazon EC2 vulnerable to UDP flood attacks) (Ed: After I wrote this, I got a notification that this particular issue has been fixed which is indeed, good news.)

Now, releasing details about any vulnerability like this could put many many customers at risk from similar attack, but the lack of transparency  of service and architecture means that we’re left with more questions than answers. How can a customer (like me) today defend themselves against an attack like this in the lurch of not knowing what causes it or how to defend against it? What happens when the next one surfaces?

Can AWS even reliably detect this sort of thing given the “socialist security” implementation of good enough security spread across its constituent customers?

Security by obscurity in cloud cannot last as the gold standard.

This is the interesting part about the black-box abstraction that is Cloud, not just for Amazon, but any massively-scaled service provider; the more abstracted the service, the more dependent upon the provider or third parties we will become to troubleshoot issues and protect our assets.  In many cases, however, it will simply take much more time to resolve issues since visibility and transparency are limited to what the provider chooses or is able to provide.

We’re in the early days still of what customers know to ask about how security is managed in these massively scaled multi-tenant environments and since in some cases we are contractually prevented from exercising tests designed to understand the limits, we’re back to trusting that the provider has it handled…until we determine they don’t.

Put that in your risk management pipe and smoke it.

The network and systems that make up our cloud providers offerings must do a better job in stopping bad things from occurring before they reach our instances and workloads or customers should simply expect that they get what they pay for.  If the provider capabilities do not improve, combined with less visibility and an inability to deploy compensating controls, we’re potentially in a much worse spot than having no protection at all.

This is another opportunity to quietly remind folks about the Audit, Assertion, Assessment and Assurance API (A6) API that is being brought to life; there will hopefully be some exciting news here shortly about this project, but I see A6 as playing a very important role in providing a solution to some of the issues I mention here.  Ready when you are, Amazon.

If only it were so simple and transparent:

Inigo Montoya: You are using Bonetti’s Defense against me, ah?
Man in Black: I thought it fitting considering the rocky terrain.
Inigo Montoya: Naturally, you must suspect me to attack with Capa Ferro?
Man in Black: Naturally… but I find that Thibault cancels out Capa Ferro. Don’t you?
Inigo Montoya: Unless the enemy has studied his Agrippa… which I have.

/Hoff

*It’s only fair to mention that depending upon a single provider for service, no matter how good they may be and not taking advantage of monitoring services (at an extra cost,) is a risk decision that comes with consequences, one of them being longer time to resolution.