There has been a lot of buzz recently around offload engines, specifically DPUs, and it’s not surprising. Not only are we seeing multiple companies jumping on the bandwagon, including VMware with their latest Project Monterrey announcement, but this buzz is also validation that the industry is shifting towards a less-restrictive, server-based approach to storage and networking. But whether you are considering an offload card, a SmartNIC, a GPU, a DPU, or an SPU there is one thing that is certain, not all are created equal.
The Good: SmartNICs & DPUs enable the separation of the workloads off of general-purpose CPUs
The DPU, more commonly known as a Data Processing Unit, is an optional device (usually in the form of a PCIe card) which is meant to offload networking or storage services that have traditionally run on your server CPU, onto the card itself. Why is this beneficial? Well, when you consider traditional server-based industry solutions like Hyper-Converged Infrastructure, running your storage or networking services off of the server CPU means you are taxing server CPU, network, and memory resources. The storage services alone can consume up to 30% of the server resources from the moment you power on your system (depending on which data services you enable), and these are server resources which could be used by applications. Another benefit is that some services perform and scale better with purpose-built hardware that is tailored to a specific task – much like it is the case with a GPU.
While network cards have been offloading networking services (often referred to as TCP offload) for some time we see a trend where the offload functionality is moving up the stack to higher-level and more complex network functions including security, for example, with Pensando. Traditionally, most higher-level network functions run on either a firewall device or on the main server CPU. These approaches are either not scalable or inevitably consume server resources as mentioned above and introduce software dependencies. By moving higher level networking functionality onto a device that is sitting in and scales with every server (vs. a single device) and doesn’t tax the server CPU (vs. software) makes for a good argument. In this architecture, the server’s operating system consumes networking services of the card without having to be involved in figuring out the technical details of the management of the network traffic and it scales with every node in the environment. The networking and execution of network services is done by the device, the ingress and egress, by the host.
The Bad: Is Your SmartNIC or DPU a True Network or Storage Engine or Just a Programmable Accelerator?
The reality is that DPUs mostly feature a programmable accelerator chip that can be used to offload certain services or functions off of the server CPU. What these services or functions are depends on the software that is etched onto the silicon. For storage services, the ecosystem is quite narrow and not many functions are available natively by DPUs today. This means with DPUs you mostly get benefits from individually accelerated functions, but you would still run into the same restrictions you face with Hyper-Converged infrastructure or Software-Defined Storage because storage software that delivers the complete enterprise class data services still runs on the server.
Let’s take a step back and really delve into why that matters. The design principles you take into consideration when offloading only some data services off the server CPU onto a DPU (the hardware characteristics, i.e. what the card provides, what the software on the server provides) is very different from the principles you put in place when running all of the enterprise data services on a card in the server. With the latter you can ensure zero CPU, networking, or memory resources are consumed and that there are no software dependencies or conflicts with the applications running on the server. Getting this right is important because if you have to run software on your servers that complete your enterprise data services, you have to:
- add (and maintain) software on every server and ensure it is working well with the applications for which the server was actually purchased
- buy more servers and software licenses to compensate for the resource overhead on memory and CPU (i.e. a 20% resource overhead requires buying 25% more servers and software licenses).
When we set out to build Cloud-Defined Storage and architect the Nebulon SPU, we took a look back at other examples, one being AWS Nitro. AWS Nitro is essentially an offload engine but mostly it runs everything on the host then accelerates it on a card. Because we had seen how this still taxes the server, we decided not to build another programmable storage offload engine, but just a storage engine. Instead of leaving it up to the consumer of a DPU to program a chip to do offloading of some storage services, a true storage engine has all enterprise data services on-board, including snapshots, replication, encryption, compression, deduplication, erasure coding, etc. There is a rich tapestry of functions that is needed to have a true enterprise software stack. It takes a sizeable amount of work in software for a company to develop this. It also needs a fair bit of hardware resources on the card to support that. But if done correctly, it can work beautifully.
There is also the question of persistence, when building an offload engine, be it for networking or storage, the DPU is not responsible for handling persistence in the face of faults such as power loss in the data center. This leaves some of the most challenging pieces of the implementation to the software (often requiring additional specialty hardware, for instance write intensive drives on the host). The Nebulon SPU approach handles the persistence model transparently delivering enterprise class robustness that is needed.
The Ugly: Project Monterrey, Pensando, etc. are great…but no one is mentioning the management layer
As mentioned earlier, using a card in the server helps reduce resource consumption on that server, but it also helps with scaling. Instead of having few firewall devices or storage arrays, every server is now contributing to the overall data services in a scale-out architecture. What you can’t forget is that the number of devices you have to manage drastically increases, and simply put, running networking and enterprise data services fully off of a DPU means nothing if there is not a simple management solution for all of these devices.
Putting this into perspective, most DPU vendors offer only an out-of-band management solution that sits at the enterprise layer, which means your ability to manage your solution is limited to that single layer, whether it be rack wide or data center wide. But what if you have thousands of servers? Or multiple data centers? Or data centers at different locations worldwide? With the data plane or DPU in every server in your data centers, management becomes a lot more complicated. But shouldn’t you be able to expect cloud-scale management across your entire WW deployment?
The only way for a customer to get simple cloud-scale management with an on-prem data plane or DPU is if the control plane is in cloud. But this is hard to do. With the control plane in the cloud, there needs to be a level of resiliency, so that even if you are disconnected from the cloud, your DPU can still handle failures and keep applications online, there is a need for a new level of security so that intruders cannot tamper with your on-premises infrastructure or data. But there is also a great benefit. By moving the control to the cloud, you are able to run an administrative model that is enterprise wide, vs. rack or data center room wide. The latter is what you will often see with traditional management models.
Our Cloud-Defined Storage solution, on the other hand, is made up of a cloud control plane, and provides the automated, API-driven cloud-scale management which customers can use for administration and get insights into their entire enterprise IT infrastructure. This level of flexibility is a must for enterprises who are looking to deploy a DPU model.
Ultimately, if you’re part of a team who is looking for a server-based solution, we would recommend a solution that runs all services off the card instead of taxing server resources and one that enables management at the cloud layer for true at-scale management. This is available today and we suggest you take a closer look at Cloud-Defined Storage.
Cloud-Defined Storage is a cloud-managed SaaS that automates storage operations, provides self-service compute and storage provisioning, and turns application-server SSDs into enterprise shared or local storage. The data plane is comprised of Nebulon Services Processing Units (SPUs) in the customer’s application servers, and is managed fully in the cloud by a control plane called nebulon ON.