Attention_Is_All_You_Need_Summary

2.1K Views

September 12, 25

#Transformer #Attentionメカニズム #自然言語処理 #機械翻訳 #ディープラーニング

スライド概要

プロンプトテスト用

kokuren

@kokuren333

スライド一覧

医学生。AIや業務・学習効率化に興味あり。

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

SP_Librosa

kokuren 2K

DL_PyTorch

kokuren 1.6K

ML_LinearAlgebra

kokuren 1.5K

NLP_spaCy

kokuren 1.1K

LoRA(Low Rank Adaptation)について

kokuren 1.1K

CV_OpenCV

kokuren 633

各ページのテキスト

Attention Is All You Need A. Vaswani et al. の論文解説 | 2025.09.12 Google Brain, Google Research, University of Toronto

Agenda 01 論文の概要と背景 02 Transformerモデルのアーキテクチャ 03 Attentionのメカニズム 04 自己注意(Self-Attention)の優位性 05 実験結果と考察 06 結論と今後の展望 Google Brain, Google Research, University of Toronto | Attention Is All You Need

1. 論文の概要と背景

従来の課題：系列変換モデルの限界主要なモデル(RNN, LSTM, GRU)は、データを逐次的に処理する再帰的な構造を持つ。この逐次的な性質が、計算の並列化を妨げ、特に長い系列データでの訓練を困難にしていた。系列が長くなるほど、遠い位置にある情報間の依存関係を学習することが難しい（長期依存性の問題）。 Attentionは導入されていたが、主にRNNと組み合わせて使用されていた。 Google Brain, Google Research, University of Toronto | Attention Is All You Need

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Google Brain, Google Research, University of Toronto | Attention Is All You Need

本論文の貢献：Transformerの提案再帰的構造(Recurrence)と畳み込み(Convolution)を完全に排除。 Attentionメカニズムのみで構成される、シンプルかつ強力なアーキテクチャを提案。大幅な並列処理を可能にし、訓練時間を劇的に短縮。機械翻訳タスクにおいて、既存のアンサンブルモデルさえも上回る最高性能(SOTA)を達成した。 Google Brain, Google Research, University of Toronto | Attention Is All You Need

2. Transformerモデルのアーキテクチャ

全体像：Encoder-Decoderモデル標準的な系列変換モデルであるEncoder-Decoder構造を踏襲。 Encoder（左）: 入力系列を連続的な表現に変換する。 Decoder（右）: Encoderの出力を元に、出力系列を1 トークンずつ生成する。両者とも、再帰構造の代わりに「自己注意(SelfAttention)」と「順伝播型(Feed-Forward)」ネットワークを積み重ねた構造を持つ。 Google Brain, Google Research, University of Toronto | Attention Is All You Need

主要コンポーネント Encoder Stack Decoder Stack 接続と正規化 N=6個の同一レイヤーを積層。各レイヤーは「Multi-Head SelfAttention」と「Position-wise Feed-Forward Network」の2つのサブレイヤーで構成される。 N=6個の同一レイヤーを積層。 Encoderの2つのサブレイヤーに加え、Encoderの出力を参照する「 Encoder-Decoder Attention」を持つ3層構造。各サブレイヤーの接続には Residual ConnectionとLayer Normalizationを適用し、勾配消失を防ぎ学習を安定させる。 Google Brain, Google Research, University of Toronto | Attention Is All You Need

10.

位置情報の埋め込み：Positional Encoding Transformerは再帰構造を持たないため、系列内のトークンの順序情報をモデルに与える必要がある。入力Embeddingに、トークンの「位置情報」を持つベクトル（Positional Encoding）を加算する。本論文では、異なる周期を持つsin関数とcos関数を利用した固定値を使用。 PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) これにより、モデルはトークンの相対的な位置関係を学習しやすくなる。 Google Brain, Google Research, University of Toronto | Attention Is All You Need

11.

3. Attentionのメカニズム

12.

Attentionの基本概念 Attentionは、Query(Q)とKey-Value(K-V)ペアをベクトル空間で関連付け、出力を計算する関数と表現できる。出力は、Valueベクトルの重み付き和で計算される。各Valueへの重みは、Queryと対応するKeyの類似度によって決定される。直感的には、ある情報（Query）に関連する情報（Key）に「注意」を向け、その情報の内容（Value）を重点的に取得するメカニズム。 Google Brain, Google Research, University of Toronto | Attention Is All You Need

13.

Scaled Dot-Product & Multi-Head Attention Scaled Dot-Product Attention: Transformerで利用される基本単位。QueryとKeyの内積で類似度を計算し、√d_k でスケーリングする。 Attention(Q, K, V) = softmax( (Q * K^T) / √d_k ) * V Multi-Head Attention: Attentionをh回並列実行。Q, K, Vを異なる線形射影で変換し、異なる表現部分空間で情報の関連性を捉える。これにより、モデルはより豊かな表現を獲得できる。 Google Brain, Google Research, University of Toronto | Attention Is All You Need

14.

Transformerにおける3種類のAttention Encoder Self-Attention Masked Decoder Self-Attention Encoder-Decoder Attention Encoder内で、入力系列内の全単語間の関連性を計算する。文の内部構造を捉える。 Decoder内で、出力系列の各位置がそれより前の位置のみを参照するようにする（未来の情報をマスク）。自己回帰性を担保。 Decoderが、Encoderからのどの入力単語に注目すべきかを決定する。翻訳の対応付けなどを行う、最も標準的なAttention。 Google Brain, Google Research, University of Toronto | Attention Is All You Need

15.

4. 自己注意(Self-Attention)の優位性

16.

各レイヤータイプの計算効率比較 (Table 1) レイヤータイプ層あたりの計算量逐次的な操作の最小数最大パス長 Self-Attention O(n^2 * d) O(1) O(1) Recurrent (RNN) O(n * d^2) O(n) O(n) Convolutional (CNN) O(k * n * d^2) O(1) O(log_k(n)) Self-Attention (Restricted) O(r * n * d) O(1) O(n/r) Google Brain, Google Research, University of Toronto | Attention Is All You Need

17.

なぜSelf-Attentionは強力なのか？計算の並列化: 逐次的処理がO(1)であり、層内の計算を完全に並列化できる。RNNのO(n)と比較して圧倒的に高速。長期依存の学習: 任意の2つのトークン間のパス長がO(1)と最短。これにより、遠い位置にある単語間の依存関係も直接的に捉え、学習しやすくなる。モデルの解釈性: Attentionの重みを可視化することで、モデルが文のどの部分に注目しているかを分析でき、解釈可能性の向上に繋がる。 Google Brain, Google Research, University of Toronto | Attention Is All You Need

18.

5. 実験結果と考察

19.

機械翻訳タスクでの成果 (WMT 2014) 英語→ドイツ語翻訳 (EN-DE): BLEUスコア 28.4 を達成。既存の最高性能モデル(アンサンブル含む)を2.0以上も上回り、新たなSOTA(State-of-the-Art) を樹立。英語→フランス語翻訳 (EN-FR): BLEUスコア 41.8 を達成。単一モデルとして当時のSOTAを更新。特筆すべきは、これらの成果を既存モデルの数分の一という非常に少ない訓練コスト（8 P100 GPUsで3.5日）で達成した点。 Google Brain, Google Research, University of Toronto | Attention Is All You Need

20.

モデルのバリエーション分析 Attention Head数(h): 多すぎても少なすぎても性能は劣化。h=8が最良の結果を示した。 Keyの次元数(d_k): 小さくすると性能が低下。類似度計算が単純な内積だけでは不十分で、ある程度の表現力が必要であることを示唆。モデルサイズ: モデルが大きいほど性能は向上する傾向が見られた。正規化: Dropoutは過学習の防止に非常に有効であった。 Positional Encoding: 学習可能な埋め込みと、論文で提案されたsin/cos関数でほぼ同等の結果。汎化性能を考慮しsin/cos 版を採用。 Google Brain, Google Research, University of Toronto | Attention Is All You Need

21.

他タスクへの汎化性能：英文構文解析特定のタスクへのチューニングをほとんど行わずに、英文構文解析タスクへ適用。小規模なデータセット(WSJ 4万文)のみの学習でも、当時の強力なベースラインであるBerkeleyParserを上回る性能を達成。大規模な半教師ありデータを用いることで、RNN Grammarを除く既存の全モデルを上回るF1スコア92.7を記録。 Transformerが特定タスクに特化したアーキテクチャではなく、汎用的な系列変換モデルとして非常に優れていることを証明した。 Google Brain, Google Research, University of Toronto | Attention Is All You Need

22.

6. 結論と今後の展望

23.

We are excited about the future of attention-based models and plan to apply them to other tasks. Google Brain, Google Research, University of Toronto | Attention Is All You Need

24.

まとめと結論本研究は、Attentionのみに基づいた初の系列変換モデル「Transformer」を提案した。再帰構造を排除することで、訓練の大幅な並列化と高速化を実現。機械翻訳タスクで新たなSOTAを達成し、その後の自然言語処理研究にパラダイムシフトをもたらした。 Self-Attentionは、計算効率、長期依存の学習、汎化性能の面で、RNNやCNNに対する強力な代替手段であることを示した。 Google Brain, Google Research, University of Toronto | Attention Is All You Need

25.

今後の展望テキスト以外のモダリティ（画像、音声、ビデオ）への応用。画像のような巨大な入出力を効率的に扱うための、局所的・制限的なAttentionメカニズムの調査。生成プロセスをより逐次的でなくす研究。（本論文の発表後、これらの展望はBERT, GPTシリーズなど数多くの後継モデルによって実現されていく） Google Brain, Google Research, University of Toronto | Attention Is All You Need