【DL輪読会】The Surprising Effectiveness of TestTime Training for Abstract Reasoning

2.1K Views

November 21, 24

#テスト時学習 #大規模言語モデル #抽象推論 #ARCベンチマーク #少ショット学習

スライド概要

YouTubeはこちら→https://youtu.be/1ikEklohRgI?si=wWDwgA4X0mvph_Oj

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 86.7K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.8K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 57.6K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 40.9K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 36.1K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 36.1K

各ページのテキスト

DEEP LEARNING JP [DL Papers] The Surprising Effectiveness of Test-Time Training for Abstract Reasoning 2024.11.21 Kexin Song Matsuo-Iwasawa Lab M2 http://deeplearning.jp/ 1

http://deeplearning.jp/

Information • The Surprising Effectiveness of Test-Time Training for Abstract Reasoning – arXiv: https://arxiv.org/abs/2411.07279 – Github: https://github.com/ekinakyurek/marc?tab=readme-ov-file • Author：Ekin Akyürek Mehul Damani Linlu Qiu Han Guo Yoon Kim Jacob Andreas (Massachusetts Institute of Technology) 2

1. Abstract Researchers used Test-Time Training (TTT) to temporarily update model parameters during reasoning with loss functions derived from input data. They validated TTT’s effectiveness in enhancing LLM reasoning on the ARC(Abstraction and Reasoning Corpus) benchmark, analyzed key components for applying TTT, and proposed two innovations: TTT Data Generation and a self-consistency component. Results showed that models with TTT rivaled or outperformed symbolic reasoning models on ARC. 3

2. Test-Time Training – First introduced in 2020 for visual models by UC Berkeley and UCSD: https://arxiv.org/abs/1909.13231 – Core Mechanism: • Dynamic parameter updates during inference using explicit gradient steps. Allowing it to handle unseen or shifted data distributions more effectively. – Operates under low-data conditions: • Unsupervised learning with single inputs. • Supervised learning with one or two labeled examples. 4

https://arxiv.org/abs/1909.13231

The general TTT process 1. Start with pre-trained model parameters 𝜃0 . 2. Generate a small training dataset 𝐷𝑇𝑇𝑇 from the test data. 3. Update parameters 𝜃𝑑 by minimizing a loss function 𝐿( 𝐷𝑇𝑇𝑇 ; 𝜃). 4. Use the updated parameters for predictions. 5. Reset the parameters to the original state 𝜃0 after each test input or batch. 5

Key Design Choices in TTT • Challenges: – TTT’s design space is vast, but effective strategies for new task learning remain unclear. – Need to understand interactions with pretraining and sampling strategies. • Contributions: – Systematic study of TTT design choices and their effects. – Identification of critical components for few-shot learning: • Initial fine-tuning on similar synthetic tasks. • Enhanced leave-one-out task generation. • Training per-instance adapters. • Self-consistency under reversible transformations. 6

ARC Challenge 1. ARC Challenge Overview: – The Abstraction and Reasoning Corpus (ARC) tests abstract reasoning abilities through solving visual puzzles. – Each task consists of 2D grids with shapes or patterns involving up to 10 colors, created using shared transformation rules 𝑦 = 𝑓(𝑥) . – The goal is to predict 𝑦 𝑡𝑒𝑠𝑡 for 𝑥 𝑡𝑒𝑠𝑡 by reasoning about transformations. 2. Two Approaches: – Program synthesis methods: Discover the transformation function 𝑓 and apply it to test examples. – Fully neural methods: Predict outputs 𝑦 𝑡𝑒𝑠𝑡 without explicitly modeling 𝑓 . (used in this paper) 3. Method in this Study: – A formatting function (denoted 𝑠𝑡𝑟) converts 2D grids into string representations for input to LMs. Example of the task in ARC: 7

3.1 Data Generation for TTT Two-step process: – Leave-one-out tasks: Exclude one pair as the test example, and use the remaining pairs as training examples. – Rule-based augmentations: Apply reversible transformations like rotation, flipping, scaling, etc. 8

Comparison with E2E Learning: • End to End Approach: – Treats input-output pairs independently as supervised examples. – Does not maintain in-context demonstrations. – More computationally efficient. 9

10.

3.2 Optimization Objectives in TTT • LoRA Optimization: – Task-specific parameter updates while freezing most base model weights. – Efficient computation with retained general model capabilities. 10

11.

Results and Impact of TTT • TTT Accuracy Boost: – Accuracy increased sixfold (from 5 to 29). • In-Context Learning (ICL): – Outperformed end-to-end tasks. – E2E approach showed a 38% performance drop under identical conditions. 11

12.

4.1 Inference Enhancements • Challenges: No chain-of-thought (CoT) in ARC, making majority voting inapplicable. • Augmented inference strategy: – Use transformations to generate diverse predictions. – Apply hierarchical voting to select the best predictions. – Advantages: Reduces bias in handling demonstration sequences. 12

13.

4.2 Integrated Prediction: Voting Strategy • Two-Stage Voting Process: 1. Intra Transformation Voting: – Group predictions by transformation type t . – Select top 3 predictions with the highest frequency in each group. – Supplement with row-majority and column-majority predictions if necessary. 2. Global Voting: – Combine candidates from intra-transformation voting. – Select the top 2 overall predictions. – Prioritize identity transformation predictions in case of ties. • Results: – Self-consistency voting enhances accuracy and aligns with prior findings. – Hierarchical voting outperforms flattened voting for accuracy improvement. 13

14.

5.1 Data Preparation for Fine-Tuning 1. Using Existing Generators: REARC Generators 2. Few-Shot Prompting with Large Models: – Use few-shot examples to create new generator functions g’. – Uniformly sample m examples from the existing dataset, repeat to produce many tasks. – Enhance generators with task descriptions (categories, summaries, and descriptions). 3. Geometric Transformations 14

15.

5.2 Impact of Fine-Tuning Data: • Best Performing Data: REARC and rule-based augmentation yielded the best results. • LM-Generated Tasks: Caused a 5% drop in performance, indicating a need for better filtering mechanisms. • Fine-Tuning vs. TTT Performance: No significant correlation between the two. • Fine-Tuning Performance: TTT enabled smaller models (1B, 3B) to achieve accuracy comparable to larger models. 15

16.

6. Limitations and Future Directions Limitations: • High computational costs • Limited statistical analysis due to computational constraints. Future work: • Optimizing augmentation strategies. • Exploring broader applications of TTT. 16

17.

Conclusion • TTT is a powerful mechanism for improving reasoning tasks. • Combines test-time compute with task-specific adaptation. • Bridges the gap between neural and symbolic reasoning approaches. 17