【DL輪読会】Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions

1.

DEEP LEARNING JP Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions [DL Papers] ⾼城頌太（東京⼤学⼯学系研究科松尾研 D1） http://deeplearning.jp/ 1

http://deeplearning.jp/

2.

書誌情報タイトル： Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions https://arxiv.org/abs/2310.03016 ICLR 2024 Oral 著者： Satwik Bhattamishra, Arkil Patel, Phil Blunsom, Varun Kanade University of Oxford, Mila and McGill University, Cohere 概要：離散関数の学習を通じて，Transformerがin context learningの能⼒を獲得する⽅式について理解する選定理由： transformerがin contextで解けるタスクの限界を知りたい & ⾔語にも応⽤したい 2

3.

⽬次 1. Introduction & Related Work 2. Set up for in-context learning 3. In-context learning Boolean Functions 4. In-context learning with Teaching Sequences 5. Investigations with Pretrained Models 6. Conclusion 3

4.

⽬次 1. Introduction & Related Work 2. Set up for in-context learning 3. In-context learning Boolean Functions 4. In-context learning with Teaching Sequences 5. Investigations with Pretrained Models 6. Conclusion 4

5.

In-context learningとは？ • デモンストレーションの形式でいくつかの例を与えられたときに⾔語モデルがパラメータの更新なしにタスクを学習すること 5

6.

関連研究: なぜIn-context learningができるのか？ • 事前学習データの分布に関する観点 – Data Distributional Properties Drive Emergent In-Context Learning in Transformers(Neurips 2022) • In-context learningの特性に関する観点 – What Can Transformers Learn In-Context? A Case Study of Simple Function Classes(Neurips 2022) • 理論的アプローチ(メタ学習，勾配降下法との類似点等) – Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers(ACL 2023) – What learning algorithm is in-context learning? Investigations with linear models(ICLR 2023) – Transformers learn in-context by gradient descent(ICML 2023) – Transformers as Algorithms: Generalization and Stability in In-context Learning(ICML 2023) 6

7.

本研究でのResearch Question 1. 2. 3. 4. TransformerにおけるIn-context learningの能⼒の限界とは？ In-context learningおいてattentionは必要か？ Transformerは効率的にexampleを活⽤することができるか？タスク特化で学習していないLLMが⽂脈に従って学習アルゴリズムを実装することができるのか？ 𝑃! = 𝑥" , 𝑦" , … , 𝑥!#" , 𝑦!#" , 𝑥! → 𝑓(𝑥! ) 𝐷$ 𝐷ℱ 7

8.

本研究での貢献 • In-context learning Boolean functions – ブール関数におけるin-context learningについて調査 • Teaching Sequences – 関数を⼀意に識別することができるexampleの集合 – ⼊⼒にteaching sequencesが含まれているかどうかでin context learningの能⼒がどのように変化するかを調査 • Investigation with LLMs – LLMで同様の調査 8

9.

⽬次 1. Introduction & Related Work 2. Set up for in-context learning 3. In-context learning Boolean Functions 4. In-context learning with Teaching Sequences 5. Investigations with Pretrained Models 6. Conclusion 9

10.

問題設定 • Baseline – What Can Transformers Learn In-Context? A Case Study of Simple Function Classes(Neurips 2022) • Training sequences – (𝒙" , 𝑓(𝒙" ), … 𝒙& , 𝑓(𝒙& )) • Prompt – 𝑃! = (𝒙" , 𝑦" , … 𝒙!#" , 𝑦!#" , 𝒙! ) • Loss M: Model, N: number of sequences, m: number of examples 𝑓: 0, 1 ! → 0, 1 , 𝑥 ~ 𝐷" , 𝑓 ~ 𝐷ℱ 10

https://arxiv.org/abs/2208.01066

11.

実験設定 • Transformer • LSTM • DSS(state-space model) • Hyena(long convolutional model) • RetNet Model M ※スクラッチで学習 Lossにはcross enorpy, 様々なdepthとwidthでチューニング 11

12.

⽬次 1. Introduction & Related Work 2. Set up for in-context learning 3. In-context learning Boolean Functions 4. In-context learning with Teaching Sequences 5. Investigations with Pretrained Models 6. Conclusion 12

13.

Question 1. トランスフォーマーがコンテキスト内で学習できるブール関数の種類とその制限は？ 2. Attentionはin-context learningに必要か？Atentionとattention freeのモデルの違いは？検証タスク • Conjunctions and Disjunctions • DNFs and CNFs • Parities 13

14.

Conjunctions and Disjunctions • 𝑋$ = 0, 1 ! , 𝒙 ∈ 𝑋! の時，⼊⼒の論理和か論理積によって出⼒値が決定される • 以下の論理積(conjunction)の例では n=10 で, x2=1, x6=0, x7=1の時に出⼒が1となる – ex 0101101000 → 1 • 関数の種類は2%! 個存在する 𝑥! ⋀ 𝑥" ⋀ 𝑥# conjunction 𝑥! ⋁ 𝑥" ⋁ 𝑥# disjunction 14

15.

DNFs and CNFs • DNFs: 論理和標準形 3-DNF: 3つの論理積項の論理和で記述できる形式 ex: (𝑥' ⋀ 𝑥( ⋀ 𝑥) )⋁(𝑥" ⋀ 𝑥* ⋀ 𝑥+ ) ⋁ (𝑥, ⋀ 𝑥( ⋀ 𝑥) ⋀ 𝑥-) • CNFs: 論理積標準形 3-CNF: 3つの論理和項の論理積で記述できる形式 ex: (𝑥' ⋁ 𝑥( ⋁ 𝑥) )⋀(𝑥" ⋁ 𝑥* ⋁ 𝑥+ )⋀(𝑥, ⋁ 𝑥( ⋁ 𝑥) ⋁ 𝑥-) 15

16.

Parities • パリティ関数 – ⼊⼒のsubsetをXORしたものが出⼒ – 1が奇数個あれば1, 偶数個だと0となる関数 – Ex: 11000 → 0, 11111000 → 1 • PARITY-n – 0, 1 . 上の2. 個のすべてのパリティ関数を含む – ex: PARITY-10 → (𝑥" ⊕ 𝑥, ), (𝑥' ⊕ 𝑥* ⊕ 𝑥-), … • PARITY-(n,k) – 0, 1 . 上の変数のうちk個を関連変数として含む – ex: PARITY-(10,3) → (𝑥' ⊕ 𝑥( ⊕ 𝑥) ), … 16

17.

結果 • 複雑なタスクになるほど精度が落ちていき, Parityは全く精度が出ていない • Attention freeのアーキテクチャでも同等の性能が出ているが，⼀番良いのは transoformer • Nearest Neighborでは差が顕著(attentionがあると得意？) 17

18.

結果 • 論理和や論理積のような問題では，勾配降下法で学習したFFNや既存のPAC学習と同等の性能で，学習に必要なサンプル数も同等 • さらにOODや間違いに対してもロバストであった 18

19.

結果 • Examplesの数を変化させた時の正解率 19

20.

結果 20

21.

⽬次 1. Introduction & Related Work 2. Set up for in-context learning 3. In-context learning Boolean Functions 4. In-context learning with Teaching Sequences 5. Investigations with Pretrained Models 6. Conclusion 21

22.

Question • Transformerは効率的にexampleを活⽤することができるか？ Teaching Sequences(Goldman & Kearns, 1995) - 関数を⼀意に識別することができるexampleの集合(𝑥&, 𝑓(𝑥&), … 𝑥' , 𝑓(𝑥' )) - 作り⽅はAppendix Fを参照 22

23.

実験設定 Model: transformer Data: 最初のt個のexampleをteaching sequences，残りのm-t個は𝐷" からサンプル Task: Conjunctions, Disjunctions, CNFs and DNFs, and sparse parities 23

24.

結果 • 全てのタスクにおいてtransoformerが100%に近い精度 • Teaching sequenceが与えられとParityでも性能が出る – Vanillaだと学習できなかった 24

25.

学習時にTeaching sequencesを混合させる • 学習時にteaching sequencesを⽤いてtestに⽤いない場合 • 学習時にteaching sequencesをも⽤いずに，testに⽤いる場合 • 学習時に1/2の確率でteaching sequencesを⽤いてtestに⽤いない場合 • 学習時に1/2の確率でteaching sequencesを⽤いてtestに⽤いる場合 Mixした場合は2つのアルゴリズムを学習している?(と著者が⾔っているが不明) 25

26.

余談: two distinct algorithms • Reviewerからも質問されていた 26

27.

⽬次 1. Introduction & Related Work 2. Set up for in-context learning 3. In-context learning Boolean Functions 4. In-context learning with Teaching Sequences 5. Investigations with Pretrained Models 6. Conclusion 27

28.

Question • 事前学習されたLLMが 𝑃( = 𝑥&, 𝑦&, … , 𝑥()& , 𝑦()&, 𝑥( という形式のプロンプトを与えられた時にどのようなタスクなら解けるか Transformer persorms Good! LLM persorms Good? 28

29.

実験設定 1. Frozen GPT model: Pretrained GPT-2(input and output layerのみ学習, その他はfrozen) task: conjunctions, nearest neighbours 2. Direct Evaluation model: gpt-4, gpt-3.5-turbo, llama2-70b task: conjunctions, majorities 29

30.

結果: Frozen GPT • 他のベースラインと⽐較して良い性能だが，fullly trainingの性能には届かない • GPT-2はNearest Neighboursを実装できる – Exampleの最初の20個にラベルをつけ，続く80点を最近傍により予測 – ほぼ100%に近い精度を達成 – モデルのattention headを調査すると，これまでのプロンプトから最近傍のxのラベルyに着⽬するようなheadが発⾒された(induction head) 30

https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

31.

結果: Direct Evaluation • {0, 1}^n の次元数が7までの場合はNNと同等以上の性能 • 15を超えると怪しい，gpt-4はそれでも⾼い性能 31

32.

⽬次 1. Introduction & Related Work 2. Set up for in-context learning 3. In-context learning Boolean Functions 4. In-context learning with Teaching Sequences 5. Investigations with Pretrained Models 6. Conclusion 32

33.

Conclution Contributions • In-context learning Boolean functions • Teaching Sequences • Investigation with LLMs Future works • LLMで⾔語タスクでサンプル効率よくin-context learningできないか • 陽にmeta-learningを⾏なっていないにも関わらず，アルゴリズムをin-contextで学習できる能⼒が⾝につくのはなぜか • In-context learningでparityの学習が難しい理論的な背景とは • 様々なアーキテクチャでin-context learningができるメカニズムは何か 33

34.

感想(適当に⾔ってます) • 基本的にはcontextのx, yペアに線形な関係がある場合のみうまくいってそうに⾒えるがteaching sequencesが⾒つかればPARITYのように⾮線形な関係でもICLできる？ • teaching sequencesが⾒つかる = 完全にタスクが同定できている状態 – ICLにおけるタスクベクトルなど．． – https://arxiv.org/abs/2212.04089 • ブール関数だけじゃなく，⾔語でやりたい – その場合の teaching sequencesとは？ • LLMだとICLは結局最近傍を⾒つけているだけ？ – そんなわけはなさそう(cotとかは説明できなそう) – そもそもpretrainでなぜそのような能⼒が⾝につくのか – 他のアーキテクチャだと最近傍⾒つける能⼒低いのなぜ 34

https://arxiv.org/abs/2212.04089

35.

Thank you. 35

【DL輪読会】Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions

Deep Learning JP

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

【拡散モデル勉強会】拡散モデルの数理

【拡散モデル勉強会】Introduction to Diffusion Models

【DL輪読会】Conditional Flow Matching

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

各ページのテキスト