Even M(o)ore on Purpose-built UTM Hardware
Alan Shimel made some interesting points today in regards to what he described as the impending collision between off the shelf, high-powered, general-purpose compute platforms and supplemental "content security hardware acceleration" technologies such as those made by Sensory Networks — and the ultimate lack of a sustainable value proposition for these offload systems:
I can foresee a time in the not to distant future, where a quad core,
quad proccessor box with PCI Express buses and globs of RAM deliver
some eye-popping performance. When it does, the Sensory Networks of
the world are in trouble. Yes there will always be room at the top of
the market for the Ferrari types who demand a specialized HW box for
their best-of-breed applications.
Like Alan, I think the opportunities that these multi-processor, multi-core systems with fast buses and large ram banks will deliver is an amazing price/performance point for applications such as security — and more specifically, multi-function security applications such as those that are used within UTM offerings. For those systems that architecturally rely on multi-packet cracking capability to inspect and execute a set of functional security dispositions, the faster you can effect this, the better. Point taken.
One interesting point, however, is that boards like Sensory’s are really deployed as "benign traffic accelerators" not as catch-all filters — as traffic enters a box equipped with one of these cards, the system’s high throughput potential enables a decision based on policy to send the traffic in question to the Sensory card for inspection or pass it through uninspected (accelerate it as benign — sort of like a cut-through or fast-path.) That "routing" function is done in software, so the faster you can get that decision made, the better your "goodput" will be.
Will this differential in the ability to make this decision and offload to a card like Sensory’s be eclipsed by the uptick on the system cpu speed, multicores and lots of RAM? That depends on one very critical element and its timing — the uptick in network connectivity speeds and feeds. Feed the box with one or more GigE interfaces, and the probability of the answer being "yes" is probably higher.
Feed it with a couple of 10GigE interfaces, the answer may not be so obvious, even with big, fat buses. The timing and nature of the pattern/expression matching is very important here. Doing line rate inspection focused on content (not just header) is a difficult proposition to accomplish without adding latency. Doing it within context is even harder so you don’t dump good traffic based on a false positive/negative.
So, along these lines, the one departure point for consideration is that the FPGA’s in cards like Sensory’s are amazingly well tuned to provide massively parallel expression/pattern matching capabilities with the flexibility of software and the performance benefits of an ASIC. Furthermore, the ability to parallelize these operations and feed them into a large hamster wheel designed to perform these activities not only at high speed but with high accuracy *is* attractive.
The algorithms used in these subsystems are optimized to deliver a combination of scale and accuracy that are not necessarily easy to duplicate by just throwing cycles or memory at the problem as the "performance" of the effective pattern matching taxonomy requirements is as much about accuracy as it is about throughput. Being faster doesn’t equate to being better.
These decisions rely on associative exposures to expressions that are not necessecarily orthagonal in nature (an orthogonal classification is one in which no item is a member of
more than one group, that is, the classifications are mutually
exclusive — thanks Wikipedia!) Depending upon what you’re looking for and where you find it, you could have multiple classifications and matches — you need to decide (and quick) if it’s "bad" or "good" and how the results relate to one another.
What I mean is that within context, you could have multiple matches that seem unrelated so flows may require iterative
inspection (of the entire byte-stream or offset) based upon "what" you’re looking for and what you find when
you do — and then be re-subjected to inspection somewhere else in the
byte-stream.
Depending upon how well you have architected the software to distribute/dedicate/virtualize these sorts of functions across multi-processors and multi-cores in a general purpose hardware solution driven by your security software, you might decide that having purpose-built hardware as an assist is a good thing to do to provide context and accuracy and let the main CPU(s) do what they do best.
Switching gears…
All that being said, signature-only based inspection is dead. If in the near future you don’t have behavioral analysis/behavioral anomaly capabilities to help provide context in addition to (and in parallel with) signature matching, all the cycles in the world aren’t going to help…and looking at headers and netflow data alone ain’t going to cut it. We’re going to see some very intensive packet-cracking/payload and protocol BA functions rise to the surface shortly. The algorithms and hardware required to take multi-dimensional problem spaces and convert them down into two dimensions (anomaly/not an anomaly) will pose an additional challenge for general-purpose platforms. Just look at all the IPS vendors who traditionally provide signature matching scurry to add NBA/NBAD. It will happen in the UTM world, too.
This isn’t just a high end problem, either. I am sure that someone’s going to say "the SMB doesn’t need or can’t afford BA or massively parallel pattern matching," and "good enough is good enough" in terms of security for them — but from a pure security perspective I disagree. Need and afford are two different issues.
Using the summary argument regarding Moore’s law, as the performance of systems rise and the cost asymptotically approaches zero, then accuracy and context become the criteria for purchase. But as I pointed out, speed does not necessarily equal accuracy.
I think you’ll continue to see excellent high performance/low cost general purpose platforms to provide innovative software-driven solutions being assisted by flexible, scalable and high performance subsystems designed to provide functional superiority via offload in one or more areas.
/Chris
Chris, excellent points. Taking what you said and going the next step, what can the network do to provide some of the behavior based analysis (net flow, s-flow, clearflow) and also in terms of siphoning off types of traffic to lighten the load on the packet inspection. For instance, Extreme Networks has an interesting prototype that works with ISS that allows Proventia to work on Extreme's 10 GBS chassis. They do it by just sending the 1Gbs of traffic to the Proventia box that matters. Also, some of the new packet inspection products seem pretty hot, though I don't know enough about them.
Chris, great article. I commented on Alan's blog:
"I think you are a little confused by the advances in computer architecture. PCI Express will provide a high performance point to point bus inside a standard pentium architecture – which incidentally will become fantastic for interfacing the Pentium with a whole range of coprocessor chips on cards at high bandwidth (rather than having to drop them on the motherboard). Quad-core will mean that applications are parallelized, meaning they will be able to better exploit hardware parallelism on those coprocessors (like Sensory's and others). However bulk, commodity DRAM will always suffer at a large latency penalty vs SRAM, RLDRAM, Network DRAM etc, meaning that coprocessors that work in the dataplane on network traffic interfacing to those memories (like Sensory) will be a great solution vs. custom network processor architectures (or the Pentium by itself). "
Fundamentally, there are several problems occuring: (1) networking speeds are getting faster (providing ever decreasing hard bounds on latency), (2) computation complexity per byte is skyrocketing as you need to do more work to decode protocols, unpack zip files and scan content as more and more layers are added at the application level, (3) the size of these databases is ever increasing as ever day more and more viruses, worms etc. are deployed and finally (4) covergence is driving multiple applications to be deployed on each node- again pushing up complexity per byte massively.
In regards to your comment:
"All that being said, signature-only based inspection is dead.", a couple of points: what technologies like Sensory do fundamentally are content decoding and feature extraction. While polymorphic viruses etc are increasing, at some fundamental level, some features need to be detected and some decoding needs to be done. Sensory's technology does the heavy lifting here- even for polymorphic viruses. Even if it can't for some reason, intelligent decisions can be made on which traffic is unlikely or unlikely to have a polymorphic virus. The same goes for anomaly detection – at some point features (whatever they are) need to be extracted from the network in order to pass into a classifier. SN's hardware also provides support for classifier algorithms as well BTW.
You are spot on when you say price/performance is not just the domain of the high end and that the low end needs it as well. We've found actually in many circumstances it's needed more at the low end because the bang for buck of a pentium decreases rapidly (and disproportionately so) as you spend less because the way the silicon gets cheaper is to remove or downscale features like the L2 cache (which contributes to a large cost on the silicon wafer).