VENDOR NEWS
BUILDING BLOCKS OF A CLOUD NATIVE AI DATA CENTER / Opinion piece by Werner Coetzee, Business Development Executive at Data Sciences Corporation. Credits: ITWeb for publishing the original article /
The Next Era of Enterprise AI Technologies The cloud revolution has changed the face of IT operations. Business leaders are looking for platforms that serve a broad userbase and deliver non-stop services dynamically while providing performance and the lowest cost. With HPC and Supercomputers becoming more prevalent in commercial use cases and forming part of primary compute environments, new supercomputers must be architected to deliver as close to bare-metal performance but do so in a multi-tenant fashion. Historically supercomputers were designed to run single applications (As discussed in our previous article Super Compute in Enterprise AI). Naturally, it stands to reason that a Cloud-Native Supercomputer is needed to deliver on these demands that are fast becoming de facto table stakes for AI in the Enterprise. Cloud-Native Supercomputer architecture aims to maintain the fastest possible compute, storage, and network performance while meeting cloud services requirements, such as least-privilege security policies and workload isolation, data protection, and instant, on-demand AI and HPC service. It seems so simple!
the food in serial like a CPU, basically going from one house to another. If they spend 5 minutes per delivery, it will take 100 minutes to complete the delivery. In contrast, if the delivery service behaved like a GPU, then all 20 orders would be delivered simultaneously by 20 scooters, reducing the delivery time by 95 minutes. (This is a hypothetical scenario, please bear with me!) Now that you wonder what fast food to get let's get back into the digital world. CPU with GPU acceleration is key to any AI & HPC success, but these building blocks alone don't meet the Cloud-Native requirements set out earlier. So, to move from traditional Supercomputing to CloudNative Supercomputing, we need the next fundamental building block, the Data Processing Unit (DPU).
Dealing with Cloud-Native Tasks As brilliant as CPUs and GPUs are in delivering the raw performance that AI workloads need, in a traditional supercomputer architecture, Cloud-Native tasks would have to take some of that processing power to handle the tasks associated, such as shared platform tasks, security, and isolation tasks, data protection task and many more infrastructure
management and communication tasks that make multi-tenant supercomputing work. The trade-off in AI processing power is potentially significant. This is where the DPU becomes critical. All the tasks required to enable Cloud-Native Supercomputing are offloaded from the CPU and GPU to the DPU for processing. Think about our fast-food example. Imagine the scooter driver had to package the order and decide which scooter the order needs to go to be delivered to the correct house. Well, not only would there be chaos with mixed-up orders, even if they did get it right, it would take time and impact the delivery time to the customer. The answer is to let the drivers focus on driving, navigating, and getting to the house as quickly as possible. The task of isolating the orders, packaging them correctly, and getting them to the right scooter is offloaded to an order controller at the fast-food place. In a similar principle, by migrating the Cloud-Native specific task from the CPU and GPU to the DPU, the architecture ensures maximum performance in executing AI & HPC tasks and the security and isolation required for multi-tenant environments that optimises performance and return on investment.
Cloud-Native Supercomputing Key Ingredients (CPU, GPU, DPU) Cloud-Native computing needs vital ingredients to deliver on those requirements. The core of these is CPUs and GPU accelerators. Without getting too technical, CPUs are the brains of any computing platform and handle all the standard management and precision tasks needed to run a server in AI. They have one weakness: they process all their instructions in serial. That's where GPUs come to the rescue. GPU's process tasks in parallel and deal with tasks and instructions that don't have to be precise, but they execute them way faster. To illustrate, imagine your favourite fast food that you just ordered on your favourite delivery app. Now let's say that 20 houses in your area also ordered from the same fast-food place. If the delivery service only had one scooter, they would have to deliver
26 SYNAPSE | SHOW EDITION 2023
A NVIDIA BlueField DPU supports offload of security, communications and management tasks to create an efficient cloud-native supercomputer.
...continues on page 28