Optimizing GPU enabled Data Centers with High Performance Networking

Introduction
The world's data centers are becoming more complex. Organizations need to address increasing demands for scale, cost and flexibility in their infrastructure. This is especially true for hyperscale cloud providers who must support a wide variety of workloads on a single platform.

GPUs are highly flexible compute engines that are rapidly gaining adoption for accelerating a wide variety of applications.
They are designed to perform large numbers of computations in parallel, using the CUDA^® programming model, which focuses on explicit parallelism and locality. As such, GPUs offer very high performance per watt when used for compute-intensive applications.

GPUs also provide significant benefits over CPUs in terms of memory bandwidth and throughput. This means that they can be used effectively even when there is not enough RAM available on the host machine (e.g., due to security concerns). In this way GPUs can help move data closer to your application where it needs it most: within the memory hierarchy of your server stack!

GPU-enabled hyperscale cloud data centers, which support a widely varying mix of workloads, must be architected to efficiently and cost-effectively meet the diverse needs of all workloads. GPUs are highly flexible compute engines that are rapidly gaining adoption for accelerating a wide variety of applications. To maximize their value as part of the data center infrastructure and ensure their optimal performance in GPU-accelerated environments, new high performance networking solutions must be designed with careful consideration for how they will support various application requirements. The following discussion outlines some key considerations when designing systems for optimal performance in GPU-accelerated environments:

SmartNICs can be used to extend offload capabilities from the CPU and reclaim capacity
The network can be used to extend offload capabilities from the CPU to the NIC. As mentioned earlier, this is a beneficial feature that enables data centers to increase their efficiency and performance without adding additional compute power. By extending offload capabilities from the CPU to the NIC, you can take advantage of all those extra cores you have available by allowing them to process more traffic than they would normally if there were no offloading capabilities in place.

The benefits are clear: with high performance networking (HPN) or software-defined networking (SDN), workloads can be dynamically placed on different hosts based on current load conditions and user requirements. This allows each host to handle its own workloads while also sharing some tasks across all hosts within an environment—thereby making sure nothing gets bottlenecked during peak periods when multiple people might want access at once.

Offload to SmartNICs and DPU cards can provide a measured performance improvement for many things such as packet processing, encryption/decryption, load balancing and compression over the network, up to 40% in some instances.

Leveraging NVIDIA NetQ telemetry
Spectrum Switches and ConnectX® SmartNICs also allow for other important functionalities like network telemetry and flow control (such as Adaptive Rate Limiting), which can better control traffic dynamics in data centers.

As we move towards a world where the network is operated by software rather than hardware, flows that are active on the network can be monitored in real-time with telemetry such as netflow or sFlow. This allows organizations to gain visibility into what traffic is being sent over their infrastructure and where it's going.

With this visibility, organizations can adjust their policies to ensure that they're optimizing performance while maintaining security. For example, if an organization sees a large number of users accessing sensitive data from outside the country, it may want to limit those connections until they are within acceptable guidelines for legal compliance reasons.

The benefits of NVMe over Fabric
Consulting engineers who work with organizations such as Microsoft Azure have been exploring NVMe over fabrics (NVMeoF) as a potential solution for data center storage challenges. NVMeoF is a new protocol that uses RDMA to send commands to NVMe devices over a network. The protocol supports RDMA over Converged Ethernet (RoCE) and Remote Direct Memory Access over Converged Ethernet (RoDMA). This allows the data center to use existing networks and switches, which helps reduce costs.

Benefits of using NVMeoF in your data center environment:
Scalability - You can scale up from 1GbE connections on servers all the way up to 100GbE speeds without having to change anything else besides increasing the number of ports on each switch or host bus adapter (HBA). This means less management overhead when adding new hosts into your cluster since there will be no need for reconfiguration or software changes required throughout all systems involved with data transfers between storage elements like disk drives or flash drives within each server rack.

Conclusion
The NVIDIA Spectrum Ethernet platform, with Cumulus Linux ecosystem, ConnectX SmartNICs and BlueField-2 DPUs, is available to help address modern networking challenges. As a leading open-source Linux operating system for network operating systems, it also provides a rich set of features and capabilities that help you address many challenges associated with data center networking.

Cumulus integrated network operating system (NOS) is designed to provide an open platform with low-cost extensibility to support today’s needs while keeping pace with future growth requirements. Combined with Spectrum high performance switches, users will gain an extensive increase of bandwidth and throughput with less hardware deployed, optimizing CAPEX and ROI.

The future of data centers is high performance and cost effectiveness. This can be achieved through the use of GPU accelerated applications, which have many benefits including lower power consumption and improved performance.

For more information attend PNY’s Upcoming Session at GTC 2022 September 19-22, 2022
Session ID: A41431: Optimizing GPU Accelerated Data Centers with High Performance Networking

Blog