1.9K Views
July 11, 24
スライド概要
第13回 7月11日 HPC+AI
HPCで始まったGPUによる汎用計算(GPUコンピューティング)は、AI領域でも活用され、そして、AI技術がHPCを加速する相互作用の一助となっている。一講ではGPUコンピューティングのベースとなるGPUのアーキテクチャを、そのプログラミング環境であるCUDAやOpenACCを通して説明する。二講ではHPCとAIが相互にどのように活用されているのか、その現状をGPUコンピューティングの視点から説明する。
R-CCS 計算科学研究推進室
HPC + AI AKIRA NARUSE, DEVELOPER TECHNOLOGY, NVIDIA
AGENDA HPC + AI とは HPC for AI: 大規模モデルの分散学習 AI for HPC: 科学技術計算への AI 適用
HPC + AI とは
HPC + AI = “HPC for AI” + “AI for HPC” 二つの方向性 HPC for AI HPC で広く使われてきた技術を、AI の分野で活用する ディープラーニングの大規模化に伴う分散学習に関わる技術、行列演算の高速計算手法など AI for HPC AI を利用して、科学技術計算を高速化、高精度化する 高速なシミュレータとしての AI 活用する これまでモデル化が困難だった問題に対して AI を利用して近似モデルを作成する
HPC for AI 大規模 DL モデルの学習高速化に関する GTC 発表 Deepspeed: DL training and inference optimization library towards speed and scale (2021) Training and Profiling Large Scale Models with Pytorch (2022) Scaling Large Models with PAX on GPUs (2023) Training DL Models at Scale: How NCCL Enables Best Performance on AI Data Center Networks (2024)
IA for HPC シミュレーションを高速化するAI、近似モデルとしてのAI
HPC FOR AI 大規模モデルの分散学習
HPC for AI 分散学習 巨大な言語モデルを、スパコンでトレーニング 関連する GTC での発表 Deepspeed: DL training and inference optimization library towards speed and scale (2021) Training and Profiling Large Scale Models with Pytorch (2022) Scaling Large Models with PAX on GPUs (2023) Training DL Models at Scale: How NCCL Enables Best Performance on AI Data Center Networks (2024) データ並列だけでは不十分、モデル並列が必要 パイプライン並列, テンソル並列, FSDP (ZeRO-3)
Trends in Language Models GPT-4 (>1T) • • • Training the large transformer-based language model has been one of the best ways to achieve the SOTA in NLP applications Megatron-Tuning NLG (530B) Model size has increased by almost an order of magnitude every year from 2018 to 2022 Amount of memory required for parameters is roughly model size * 16 bytes (not including activation memory) • • • More than 2.8TB for GPT-3 Simple data parallel will not work Model parallel is a MUST 9
Model Parallel Approaches • Pipeline Parallel (Inter-Layer) • • Split contiguous sets of layers across multiple devices Layers 0-2 and layers 3-5 are on different devices L0 • L1 L2 L3 L4 L5 Tensor Parallel (Intra-Layer) • • Split individual layers across multiple devices Both devices compute different parts of Layers 0-5 10
Tradeoffs of Pipeline/Tensor Parallel • Pipeline Parallel (Inter-Layer) • • • • Less communication intensive Generalizable to almost all DNNs Can be hard to load balance computation across workers Can require large batch sizes for high throughput L0 • L1 L2 L3 L4 L5 Tensor Parallel (Intra-Layer) • • • • Works great for large matrices Simple to implement No restriction on batch size Communication intensive These two approaches can be used together 11
Approach for Transformer Models • Tensor parallel inside node • • Pipeline parallel between nodes • • Fast data exchange between GPUs over high bandwidth NVLINK Direct data exchange between GPUs without host memory even for inter-node communication using NCCL Data parallel to further scale training and reduce training time 12
Pipeline Parallel: micro-batch • Device 1 Device 2 Device 3 Device 4 • Naïve pipelining … lots of pipeline bubbles F F Time No performance improvement B B B F F B Split per-instance batch into smaller micro-batches to reduce pipeline bubbles 8 micro-batches 13
Pipeline Parallel: memory usage reduction • • Minimize number of active micro-batches to reduce memory usage Can also use activation re-computation to reduce memory usage 14
Pipeline Parallel: Interleaving • • Pipeline bubbles can be further reduced Note: Inter-node communication increases 15
Tensor Parallel: Transformer Layer Linear Layer MLP Y = GeLU(XA) Y YB Dropout XA GeLU X Z = Dropout(YB) Z Attention Y = Self-Attention(X) XV Y YB Dropout XK Dropout XQ Softmax X Z = Dropout(YB) Z 16
Tensor Parallel: Column and Row partitioning Weight (Parameter) W Y = XW X Y Output Input Column partitioning W1 Y1 = XW1 Y2 = XW2 Y = [Y1,Y2] all-gather X Y1 Row partitioning W2 Y2 X1 X2 W1 [X1, X2] = X W2 Y1 = X1W1 Y2 = X2W2 Y1+Y2 Y = Y1 + Y2 all-reduce17 17
Tensor Parallel: Column and Row partitioning Row partitioning Column partitioning W1 W1 GPU-1 X GPU-2 X Y1 Y1 Y2 W2 All-gather Y2 Y1 X1 X Y1 Y1+Y2 All-reduce W2 Y2 X X2 Y2 Y1+Y2 18
Communication Needed for All Linear Layers? Linear Layer MLP Y = GeLU(XA) Y YB Comm Dropout XA GeLU X Z = Dropout(YB) Z Comm Attention Y = Self-Attention(X) XV Z = Dropout(YB) Comm Y YB Dropout XK Dropout XQ Softmax X Z Comm 19
Tensor Parallel: What If Two Consecutive Linear Layers B A ReLU X Y = ReLU(XA) Z = YB Y Y Row partitioning Column partitioning A1 A2 Y2 B2 ReLU Y1 B1 ReLU X Z Y1 Y2 Z1+Z2 Y1 = ReLU(XA1) Y2 = ReLU(XA2) Z1 = Y1B1 Z2 = Y2B2 Z = Z1 + Z2 all-reduce 20 20
Partitioning Transformer Layer MLP Y = GeLU(XA) Y YB Comm Column partitioning Attention Z Row partitioning Y = Self-Attention(X) XV Y YB Dropout XK Dropout XQ Z = Dropout(YB) Softmax X Dropout XA GeLU X Z = Dropout(YB) Z Comm 21
3D Parallelism Pipeline, Tensor and Data Model Parallel model parallel Group 0 group 0 GPU-0 GPU-4 GPU-1 GPU-5 Data Parallel Group 0 Model Parallel model parallel Group 1 group 0 GPU-2 GPU-3 Model Parallel Group 2 Data Parallel Group 1 GPU-6 Model Parallel Group 3 GPU-7 22 22
FSDP (Fully Sharded Data Parallel), ZeRO-3 FSDP is a mixture of model and data parallel For N GPUs, model parameters are split into N, just as input data is split into N without duplication in data parallel. Data parallel ZeRO-3 https://www.deepspeed.ai/2021/03/07/zero3-offload.html 23 23
FSDP: Linear Layer Processing Y = XW Data parallel W GPU-1 X1 GPU-2 Y1 W X2 Y2 24
FSDP: Linear Layer Processing Y = XW Data parallel W1 W GPU-1 X1 GPU-2 Y1 X1 W X2 Y2 Each GPU has only a part of the parameter W, so Y cannot be calculated as it is. FSDP Y1 W2 X2 Y2 25
FSDP: Linear Layer Processing Y = XW Data parallel FSDP W GPU-1 X1 Y1 W1 W2 X1 all-gather GPU-2 W X2 Y2 Y1 Comm Each GPU has only a part of the parameter W, so Y cannot be calculated as it is. Communicate among GPUs so that all GPUs have all of parameter W, then compute Y. W1 W2 X2 Y2 26
FSDP: Linear Layer Processing Y = XW Data parallel FSDP W GPU-1 X1 GPU-2 Y1 W1 W2 X1 W X2 Y2 Y1 Each GPU has only a part of the parameter W, so Y cannot be calculated as it is. Communicate among GPUs so that all GPUs have all of parameter W, then compute Y. W1 W2 X2 Y2 Once Y has been calculated, discard all but each portion of the parameter W. Requires all-gather communication for each linear layer 27
3D Parallelism vs. FSDP Which is better? Advantages of FSDP • • Easier to use No need to set the number of layers to be a multiple of the number of divisions, as in pipeline parallel Disadvantages of FSDP • • • Requires all-gather communication for all parameters Inter-node communication volume tends to be high Increasing batch size can reduce relative communication costs, but doing so raises convergence concerns Communication targets • • 3D parallelism: Activations FSDP: Parameters 28
行列積アクセレレータ NVIDIA TENSOR CORES
チップ全体で528個の Tensor Cores NVIDIA H100 H100: 132 Stream Multiprocessor (SM) /chip
B’ B A’ D =A*B+C 行列積は、部分行列積に分解できる C’ A Tensor Cores による 行列積計算 C,D A’ * B’ + C’ の集合 部分行列積を、各 Tensor Core に割 り当てて計算 31
H100 Tensor Cores 8 基本 ブロック 16 B’ 16 8 A’ C’,D’ (m, n, k) = (8, 8, 16) (*) FP16/BF16 の場合 • 4 Tensor Cores /SM • 全行列積を、(m, n, k) = (8, 8, 16) の基本ブロックに分 解、各部分行列積を、各Tensor Coreに割り当て • F8をサポート、言語モデル向け • Transformer Engine • Tensor Memory Accelerator (TMA) 32
FP8 と Transformer Engine Adaptive Scaling for FP8 E5M2 と E4M3 E5M2 は「レンジ」重視 E4M3 は「精度」重視 Transformer モデルを、精度低下なくトレーニング 次レイヤーに適切な出力データ型を選択 Tensor Core の計算結果をモニター、その結果に 基づき、FP8 のレンジを有効活用できるよう、出 力をスケーリング
Tensor Memory Accelerator (TMA) 非同期メモリコピー SMEMとL2間のデータコピーでCUDAスレッドが使われる問題を解消 A100: LDGSTS H100: TMA SM SM Tensor Core Registers spin SMEM Threads スレッドが アドレス計算 Tensor Core TMAが アドレス計算 L1 Data Threads sleep SMEM Loads L2 or Global Memory TMA Registers Data + TransCnt L1 Loads L2 or Global Memory
Tensor Cores のピーク性能 行列積アクセラレータ Peak TFLOPS 第1世代 (2017) 第3世代 (2020) 第4世代 (2022) V100 A100 H100 dense dense sparse dense sparse FP32 FMA 15.7 19.5 NA 67 NA FP64 NA 19.5 NA 67 NA TF32 NA 156 312 495 990 FP16/BF16 125 312 624 990 1980 FP8 NA NA NA 1980 3960 (*) FP16 only
AI FOR HPC 科学技術計算への AI 適用
AI for HPC 高速なシミュレーションとしてのAI、近似モデルとしてのAI
AI for HPC
AI FOR HPC 異なる分類軸: モデル化の対象 Inductive bias Physics constrained Inductive bias Fully data driven Fully data driven Fully data driven Fully data driven Developing Digital Twins for Weather, Climate, and Energy [S41823] Physics constrained Fully data driven • Accelerating Simulation Process Using GPUs and Reliable Neural Networks [S42404] • Case Study on Developing Digital Twins for the Power Industry using Modulus and Omniverse [S41671] • Fourier Neural Operators and Transformers for Extreme Weather and Climate Prediction [S41936] • Bringing Rain to the Subseasonal Forecasting Desert with Deep Learning Weather Prediction [S41170] • Accelerating a 3D Conditional Generative Adversarial Network for Seismic Attenuation Compensation on a Multi-GPU Node [S41095] • Scalable Data-Driven Global Weather Predictions at High Spatial and Temporal Resolutions [S41019] • Accelerating End-to-end Deep Learning for Particle Reconstruction using CMS Open Data at CERN [S41394] • Developing Digital Twins for Weather, Climate, and Energy [S41823] • OpenFold: Democratizing Access to Predicting and Modeling Protein Structures [S41633]
AI FOR HPC データドリブンなアプローチからシミュレーション系の話題まで Scalable Data-Driven Global Weather Predictions at High Spatial and Temporal Resolutions [S41019] Fully data driven U-Net ベースでの降雨量および海水面温度予測 Accelerating Simulation Process Using GPUs and Reliable Neural Networks [S42404] Inductive bias Graph Neural Network (GNN) を利用した、回転および平行移動に非依存なシミュレーター Accelerating a 3D Conditional Generative Adversarial Network for Seismic Attenuation Compensation on a Multi-GPU Node [S41095] Pix2Pix を利用して、地質調査画像の減衰補償を試みている Fully data driven Fourier Neural Operators and Transformers for Extreme Weather and Climate Prediction [S41936] 現在の気象予測に関する課題感の紹介と、Fourier Neural Operator を軸とした解像度非 依存な学習に向けた取り組み Inductive bias
AI FOR HPC データドリブンなアプローチからシミュレーション系の話題まで Scalable Data-Driven Global Weather Predictions at High Spatial and Temporal Fully data driven Resolutions [S41019] U-Net ベースでの降雨量および海水面温度予測 Accelerating Simulation Process Using GPUs and Reliable Neural Networks [S42404] Graph Neural Network (GNN) を利用した、回転および平行移動に非依存なシミュレーター Accelerating a 3D Conditional Generative Adversarial Network for Seismic Attenuation Compensation on a Multi-GPU Node [S41095] Pix2Pix を利用して、地質調査画像の減衰補償を試みている Fourier Neural Operators and Transformers for Extreme Weather and Climate Prediction [S41936] 現在の気象予測に関する課題感の紹介と、Fourier Neural Operator を軸とした解像度非依存な 学習に向けた取り組み
SCALABLE DATA-DRIVEN GLOBAL WEATHER PREDICTIONS AT HIGH SPATIAL AND TEMPORAL RESOLUTIONS [S41019] 気象予測の現状整理と AI の活用方法
SCALABLE DATA-DRIVEN GLOBAL WEATHER PREDICTIONS AT HIGH SPATIAL AND TEMPORAL RESOLUTIONS [S41019] 降雨量予測に利用した手法 (U-Net) などについて
SCALABLE DATA-DRIVEN GLOBAL WEATHER PREDICTIONS AT HIGH SPATIAL AND TEMPORAL RESOLUTIONS [S41019] 評価結果や LSTM との組み合わせについての検討
SCALABLE DATA-DRIVEN GLOBAL WEATHER PREDICTIONS AT HIGH SPATIAL AND TEMPORAL RESOLUTIONS [S41019] 海水面温度予測への応用に関する初期検討結果の紹介
SCALABLE DATA-DRIVEN GLOBAL WEATHER PREDICTIONS AT HIGH SPATIAL AND TEMPORAL RESOLUTIONS [S41019] 論文と実装
AI FOR HPC データドリブンなアプローチからシミュレーション系の話題まで Scalable Data-Driven Global Weather Predictions at High Spatial and Temporal Resolutions [S41019] U-Net ベースでの降雨量および海水面温度予測 Accelerating Simulation Process Using GPUs and Reliable Neural Networks [S42404] Inductive bias Graph Neural Network (GNN) を利用した、回転および平行移動に非依存なシミュレーター Accelerating a 3D Conditional Generative Adversarial Network for Seismic Attenuation Compensation on a Multi-GPU Node [S41095] Pix2Pix を利用して、地質調査画像の減衰補償を試みている Fourier Neural Operators and Transformers for Extreme Weather and Climate Prediction [S41936] 現在の気象予測に関する課題感の紹介と、Fourier Neural Operator を軸とした解像度非依存な 学習に向けた取り組み
ACCELERATING SIMULATION PROCESS USING GPUS AND RELIABLE NEURAL NETWORKS [S42404] 現状のシミュレーションにおける問題と、解決策の提案
ACCELERATING SIMULATION PROCESS USING GPUS AND RELIABLE NEURAL NETWORKS [S42404] monolish: GPU でも GPU でも使える数値演算ライブラリ
ACCELERATING SIMULATION PROCESS USING GPUS AND RELIABLE NEURAL NETWORKS [S42404] 多様な形状の物体や、回転、平行移動を伴う状況に対するシミュレーションを実現する方法
ACCELERATING SIMULATION PROCESS USING GPUS AND RELIABLE NEURAL NETWORKS [S42404] Graph Neural Network をベースとした手法による解決
ACCELERATING SIMULATION PROCESS USING GPUS AND RELIABLE NEURAL NETWORKS [S42404] 適用範囲と評価結果
ACCELERATING SIMULATION PROCESS USING GPUS AND RELIABLE NEURAL NETWORKS [S42404] 手法詳細等
AI FOR HPC データドリブンなアプローチからシミュレーション系の話題まで Scalable Data-Driven Global Weather Predictions at High Spatial and Temporal Resolutions [S41019] U-Net ベースでの降雨量および海水面温度予測 Accelerating Simulation Process Using GPUs and Reliable Neural Networks [S42404] Graph Neural Network (GNN) を利用した、回転および平行移動に非依存なシミュレーター Accelerating a 3D Conditional Generative Adversarial Network for Seismic Attenuation Compensation on a Multi-GPU Node [S41095] Pix2Pix を利用して、地質調査画像の減衰補償を試みている Fully data driven Fourier Neural Operators and Transformers for Extreme Weather and Climate Prediction [S41936] 現在の気象予測に関する課題感の紹介と、Fourier Neural Operator を軸とした解像度非依存な 学習に向けた取り組み
ACCELERATING A 3D CONDITIONAL GENERATIVE ADVERSARIAL NETWORK FOR SEISMIC ATTENUATION COMPENSATION ON A MULTI-GPU NODE [S41095] 問題設定: 地質調査時の減衰補償
ACCELERATING A 3D CONDITIONAL GENERATIVE ADVERSARIAL NETWORK FOR SEISMIC ATTENUATION COMPENSATION ON A MULTI-GPU NODE [S41095] Pix2Pix (CGAN) を画像復元に活用
ACCELERATING A 3D CONDITIONAL GENERATIVE ADVERSARIAL NETWORK FOR SEISMIC ATTENUATION COMPENSATION ON A MULTI-GPU NODE [S41095] モデルの学習フローや計算機構成の工夫など
ACCELERATING A 3D CONDITIONAL GENERATIVE ADVERSARIAL NETWORK FOR SEISMIC ATTENUATION COMPENSATION ON A MULTI-GPU NODE [S41095] モデルの学習フローや計算機構成の工夫など
ACCELERATING A 3D CONDITIONAL GENERATIVE ADVERSARIAL NETWORK FOR SEISMIC ATTENUATION COMPENSATION ON A MULTI-GPU NODE [S41095] 出力例など
AI FOR HPC データドリブンなアプローチからシミュレーション系の話題まで Scalable Data-Driven Global Weather Predictions at High Spatial and Temporal Resolutions [S41019] U-Net ベースでの降雨量および海水面温度予測 Accelerating Simulation Process Using GPUs and Reliable Neural Networks [S42404] Graph Neural Network (GNN) を利用した、回転および平行移動に非依存なシミュレーター Accelerating a 3D Conditional Generative Adversarial Network for Seismic Attenuation Compensation on a Multi-GPU Node [S41095] Pix2Pix を利用して、地質調査画像の減衰補償を試みている Fourier Neural Operators and Transformers for Extreme Weather and Climate Prediction [S41936] 現在の気象予測に関する課題感の紹介と、Fourier Neural Operator を軸とした解像度非依存な 学習に向けた取り組み Inductive bias
FOURIER NEURAL OPERATORS AND TRANSFORMERS FOR EXTREME WEATHER AND CLIMATE PREDICTION [S41936] 気候科学の現状とさらなる高速化の必要性
FOURIER NEURAL OPERATORS AND TRANSFORMERS FOR EXTREME WEATHER AND CLIMATE PREDICTION [S41936] DestinE project とその意義
FOURIER NEURAL OPERATORS AND TRANSFORMERS FOR EXTREME WEATHER AND CLIMATE PREDICTION [S41936] Physics-ML の応用事例と、フレームワークとしての Modulus
FOURIER NEURAL OPERATORS AND TRANSFORMERS FOR EXTREME WEATHER AND CLIMATE PREDICTION [S41936] Modulus + Omniverse のデモ
FOURIER NEURAL OPERATORS AND TRANSFORMERS FOR EXTREME WEATHER AND CLIMATE PREDICTION [S41936] FourCastNet: 気象予測のための Physic-ML モデル
FOURIER NEURAL OPERATORS AND TRANSFORMERS FOR EXTREME WEATHER AND CLIMATE PREDICTION [S41936] 予測結果例
FOURIER NEURAL OPERATORS AND TRANSFORMERS FOR EXTREME WEATHER AND CLIMATE PREDICTION [S41936] 予測結果例
FOURIER NEURAL OPERATORS AND TRANSFORMERS FOR EXTREME WEATHER AND CLIMATE PREDICTION [S41936] Fourier Neural Operator: 解像度非依存なモデルの概要
SUMMARY
まとめ HPC for AI と AI for HPC HPC for AI: 巨大モデルをスパコンでトレーニング AI for HPC: 適用対象による分類と、データの扱い方による分類がある 学習高速化、大規模化の文脈では、フレームワーク等の整備が継続 特に大規模学習を、より簡単に実現できるようなアプローチが順次導入されている 科学技術計算における AI 活用、直接的なアプローチを超えた方法が増えてきている GNN の活用や、PDE をモデルに組み込むなど