NVIDIA A100 for PCIe – Accelerating the Most Important Work of Our Time

Posted by PNY Pro on Fri, Sep 10, 2021 @ 05:55 PM


The NVIDIA® A100 Tensor Core GPU for PCIe (40GB or 80GB version) delivers unprecedented acceleration at every scale to power the world’s highest performing elastic data centers for AI, data analytics, and HPC. Powered by the NVIDIA Ampere architecture, NVIDIA A100 is the engine of the NVIDIA data center platform, providing up to 20x higher performance over the prior generation, and can be uniquely partitioned into seven GPU instances to dynamically adjust to shifting demands. The NVIDIA A100 (40GB or 80GB) delivers the world’s fastest memory bandwidth at over 2 terabytes per second (TB/s) to run the largest models and datasets.

NVIDIA A100 PCIe Anchors the Most Powerful End-to-End AI and HPC Data Center Platform

NVIDIA A100 is a standout part of the complete NVIDIA data center solution that incorporates building blocks across hardware, networking, software, libraries, and optimized AI models and applications from NGC. Representing the most powerful end-to-end AI and HPC platform for data centers, it allows researchers to rapidly deliver real-world results and deploy solutions into production at scale. Let’s look at some specific examples of how the NVIDIA A100 PCIe is transforming deep learning training and inference, changing the face of high-performance computing (HPC), and remaking high-performance data analytics – all while delivering unprecedented enterprise-ready utilization.

Deep Learning Training and the NVIDIA A100 PCIe

AI models are exploding in complexity as they take on next-level challenges such as conversational AI. Training them requires massive compute power and scalability.

NVIDIA A100 Tensor cores with Tensor Float (TF32) provide up to 20x higher performance over NVIDIA Volta offerings with zero code changes and an additional 2x boost with automatic mixed precision and FP16. When combined with NVIDIA NVLink, PCIe Gen 4, NVIDIA InfiniBand, and the NVIDIA Magnum IO SDK, it’s possible to scale to thousands of NVIDIA A100 PCIe GPUs.

A training workload like BERT can be solved at scale in under a minute by 2,048 NVIDIA A100 GPUs, a world record for time to solution. For the largest models with massive data tables like deep learning recommendation models (DLRM), NVIDIA A100 80GB reaches up to 1.3 TB of unified memory per node and delivers up to a 3x throughput increase over NVIDIA A100 40GB.

Up to 3x Higher AI Training on largest Models | DLRM Training

Up to 3x Higher AI Training on largest Models | DLRM Training

DLRM on HugeCTR framework, precision = FP16 | NVIDIA A100 80 GB batch size = 48 | NVIDIA A100 40 GB batch size = 32 | NVIDIA V100 32 GB batch size = 32

NVIDIA A100 PCIe Sets a New Standard for Deep Learning Inference

NVIDIA A100 PCIe introduces groundbreaking features to optimize inference workloads. It accelerates a full range of precision, from FP32 to INT4. Multi-Instance GPU (MIG) technology lets multiple networks operate simultaneously on a single NVIDIA A100 for optimal utilization of compute resources. New structural sparsity support delivers up to 2x more performance on top of the NVIDIA A100 PCIe’s other inference performance gains. On state-of-the-art conversational AI models like BERT, NVIDIA A100 accelerates inference throughout up to 249x over CPUs. On the most complex models that are batch-size constrained like RNN-T for automatic speech recognition, NVIDIA A100 80 GB’s increased memory capacity doubles the size of each MIG and delivers up to 1.25x higher throughput than the NVIDIA A100 PCIe 40 GB.

NVIDIA’s market-leading performance was demonstrated in MLPerf Inference. A100 brings 20X more performance to further extend that leadership.

Up to 249x Higher AI Inference Performance Than CPUs | BERT Large inference

Up to 249x Higher Ai Inference Performance Than CPUs | BERT Large inference

BERT-Large Inference | CPU only: dual Xeon Gold 6240 at 2.60 GHz, precision = FP32, batch size = 128 | V100: NVIDIA Tensor RT (TRT) 7.2, precision = INT8 batch size = 256 | A100 40 GB and 80 GB batch size = 256, precision = INT8 with sparsity

Up to 1.25x Higher AI inference Performance Over NVIDIA A100 40 GB | RNN-T Inference: Single Stream

Up to 1.25x Higher AI inference Performance Over A100 40 GB | RNN-T Inference: Single Stream

MLPerf 0.7 RNN-T measured with (1/7) MIG slices | Framework Tensor RT 7.2, dataset = LibriSpeech, precision = FP16

NVIDIA A100 PCIe for High-Performance Computing

To unlock next-generation discoveries, scientists look to simulations to better understand the world around us. The NVIDIA A100 introduces double-precision Tensor Cores to deliver the biggest leap in HPC performance since the introduction of GPUs. Combined with 80 GB of the fastest GPU memory, researchers can reduce a 10-hour, double-precision simulation to under four hours on the NVIDIA A100. HPC applications can also leverage TF32 to achieve up to 11x higher throughput for single-precision, dense matrix-multiply operations.

For the HPC applications with the largest datasets, NVIDIA A100 80 GB’s additional memory delivers up to a 2x throughput increase with quantum expresso, a materials simulation tool. This massive memory and unprecedented memory bandwidth makes the A100 80 GB the ideal platform for next-generation workloads.

11x More HPC Performance in Four Years

11x More HPC Performance in four Years

Geometric mean of application speedups vs. P100. Benchmark application Amber (PME-Cellulose_NVE), Chroma[szscl21_24_128], GROMACS [ADH Dodec], MILC [Apex Medium], NAMD [stmv_nve_cuda], PyTorch [BERT-Large Fine Tuner], Quantum Expresso [AUSURF112-jR], Random Forest FP32 [make_blobs (160000 x64 : 10)], TensorFlow [ResNet-50], VASP 6 [Si Huge], GPU node with dual-socket CUPs with 4x NVIDIA P100, V100, or A100 GPUs

Up to 1.8x Higher Performance for HPC Applications | Quantum Expresso

Up to 1.8x Higher Performance for HPC Applications | Quantum Expresso

Quantum Expresso measured using SNT10POR8 dataset, precision = FP64

NVIDIA A100 PCIe High Performance Data Analytics

Data scientists need to be able to analyze, visualize, and turn massive datasets into insights. But scale-out solutions are often bogged down by datasets scattered across multiple servers.

Accelerated servers with the NVIDIA A100 PCIe provide the needed compute power – along with massive memory, over 2 TB/sec of memory bandwidth, and scalability with NVIDIA NVLink – to handle these demanding workloads. Combined with NVIDIA InfiniBand, NVIDIA Magnum IO, and the RAPIDS suite of open-source libraries, including the RAPIDS Accelerator for Apache Spark for GPU-accelerated data analytics, the NVIDIA data center platform accelerates these huge workloads at unprecedented levels of performance and efficiency.

On a big data analytics benchmark, NVIDIA A100 80 GB PCIe delivered insights with a 2x increase over the NVIDIA A100 40 GB, making it ideally suited for emerging workloads with exploding dataset sizes.

2x Faster than NVIDIA A100 40GB on Big Data Analytics Benchmark

2x Faster than A100 40GB on Big Data Analytics Benchmark

Big data analytics benchmark | 30 analytical retail queries, ETL, ML, NLP of 10 TB dataset | V100 32 GB, RAPIDS/Dask | A100 40 GB and A100 80 GB, RAPIDS/Dask/Blazing SQL

NVIDIA A100 PCIe Enterprise-Ready Utilization

The NVIDIA A100 with MIG maximizes utilization of GPU-accelerated infrastructure. With MIG, an NVIDIA A100 GPU can be partitioned into as many as seven independent instances, giving multiple user access to GPU acceleration with NVIDIA A100 40 GB, each MIG instance can be allocated up to 5 GB, and with NVIDIA A100 80 GB’s increased memory capacity, that size is doubled to 10 GB.

MIG works with Kubernetes, containers, and hypervisor-based server virtualization. MIG lets infrastructure managers offer a right-sized GPU with guaranteed quality of service (QoS) for every job, extending the reach of accelerated computing resources to every user.

7x Higher Inference Throughput with Multi-Instance GPU (MIG)

7x Higher Inference Throughput with Multi-Instance GPU (MIG)

BERT Large Inference | NVIDIA Tensor RT ((TRT) 7.1 | NVIDIA T4 Tensor Core GPU: TRT 7.1, precision = INT8, batch size = 256 | V100: TRT 7.1, precision = FP16, batch size = 256 | A100 with 1 or 7 MIG instances of 1 1GB to 5 GB: batch size = 94, precision = INT8 with sparsity

NVIDIA A100 PCIe 40 GB and 80 GB – Redefining the Boundaries of the Possible

NVIDIA A100 PCIe GPUs shatter past barriers and performance expectations to enable the research and services of the future. Learn more about these best-of-class products, the NVIDIA A100 40 GB or the NVIDIA A100 80 GB.


Topics: PNY, NVIDIA, Deep Learning, AI, NVIDIA GPU, PNYPRO, hpc, data center, high performance computing, Multi-Instance GPU, NVIDIA A100, NVIDIA Data Center GPUs, NVIDIA Ampere, NVIDIA A100 80GB, NVIDIA A100 40GB, Quantum Expresso

Subscribe to Email Updates

Connect With Us


Posts by Tag

see all

Most Popular Posts

Terms & Conditions

Blog Terms & Conditions