【DL輪読会】The Topological Trouble With Transformers

848 Views

June 04, 26

#Transformer #State Tracking #リカレントニューラルネットワーク #機械学習 #深層学習

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 92.4K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 71.6K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 61.6K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 55.2K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 52.1K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 50.2K

各ページのテキスト

The Topological Trouble With Transformers Kohsei Matsutani, Matsuo Lab 1

書籍情報 - The Topological Trouble With Transformers - Author: Michael C. Mozer, Shoaib Ahmed Siddiqui, Rosanne Liu Institution: Google Deepmind arXiv: https://arxiv.org/abs/2604.17121 Position paperで、実験や理論などはない 2

https://arxiv.org/abs/2604.17121

概要 - Transformer の純粋な feedforward 構造は、状態を逐次的に更新し続ける「state tracking」にトポロジー上の限界を持つ - state tracking := the iterative updating of latent variables reflecting an evolving environment - Recurrent and continuous-thought Transformerのアーキテクチャの分類をする。 - Temporally extended cognitionには、reccurent architectureによって、 explicitなCoT tracesからimplicitなactivation dynamicsに焦点を移す必要がある。 - 有望な研究の方向性を提示する。 3

State Tracking - state tracking := the iterative updating of latent variables reflecting an evolving environment - state ≈ brief state, world state, sufficient summary of the knowledge an agent has about its environment. - LLMのstate trackingの失敗例1 : The game of Twenty Questions - LLMが思い浮かべた1-100の数字の中を当てるゲーム (Gemini 3) CoTのtraceに明示的に42と書いていても ←矛盾 ←矛盾 higher: 正解は、あなたの推測より大きい lower: 正解は、あなたの推測より小さい you got it: 正解! c.f. Laban et al., LLMs Get Lost in Multi-Turn Conversation, ICLR 2026 Outstanding Paper, 2026. 4

State Tracking - LLMのstate trackingの失敗例2 : Polysemous word (多義語) ← Bank = River Bank ← Bank = Financial Bank ←矛盾 - あるゆる可能な環境の状態について確率分布を完全に保持・追跡することは、 AIにも人間にも不可能、なぜなら次元が爆発するから - Fred は本当に川岸に行ったのか - 釣り堀かもしれない - 銀行の近くの川岸かもしれない - ATM のある施設かもしれない - ユーザーが意図的に曖昧な質問をしているのかもしれない - 人間は、①いくつかの候補だけsamplingする、②複雑な分布を典型的な分布に潰す（fishing pole + bank -> river bank）、③前提に最も合うメンタルモデルを作る（Fred は釣り竿を持って川岸に行ったという具体的な場面を頭の中に作る; もっともありそうな解釈; maximum a posteriori (MAP) estimate）らしい。 5

State Tracking - しかし、Transformerはそもそも有限メモリで決定的なstate trackingでさえ失敗する Transformerでは、activationが下の層から上の層に流れる State representationは上にpushされるので、state trackingは層数にupper boundされる State Progression: 6

State Tracking - - Transformerのstate trackingは層数にupper boundされる - Transformerは、毎stepでself-attentionで過去を全て見る - RNNやSSMはそんなことない: 注意: ただし、すべての state-tracking 問題が、系列長に線形に比例する深さを必要とするわけではない。 - 例: 長さ n までの正規言語の認識, n頂点のグラフ連結性問題はlognの層数で十分 - 系列を左から右に処理しなくても、木構造のようにペアを作れば良い - - 例えば8個の入力であれば、1層目で隣同士をまとめて、2層目で２つのペアをまとめて、3層目で全体をまとめれば、層で解ける - しかし、これはconductivity (expressivity) であって、learnabilityではない - Merrill and Sabharwal (2025). など Brief state cascade - 層数に限界があると、深い層で得られた表現（bank = river bank）が、次の浅い層で使えない。 Merrill, W. and Sabharwal, A. (2025). A little depth goes a long way: The expressive power of log-depth transformers. https://arxiv.org/abs/2503.03961. 7

State Tracking - でも、Transformer (LLMなどの大規模並列モデル) は、特にCoTすると、GPUのおかげでpracticalにそこそこうまく行っている - なぜか -> state trackingの問題をworking memoryの問題に置き換えてきた - Transformerはshortcut solutionを構成する - 特に、lookback, associative scans, formal language understanding - 例: inherently sequentialな問題を並列に解く - 前のスライドのタスクなど（eg, bit parity） - Transformerは、state compositionalityをsupprtできる - state representationをembeddingに分散して非同期に更新できる - Aliceの情報、Bobの情報、イベント情報などは複数のtoken/embedding表現に分離できる - RNNだと、という一つのvectorに異なるstateの情報を詰め込むことになる - RecurrentなアーキテクチャがTransfomerに必要 8

Recurrence Taxonomy - Transformerのrecurrenceは、3軸で分類できる - layer/depth: 層方向のrecurrence - autoregressive steps: 各stepでstateを更新し、次のstepに渡すrecurrence - input steps: 入力tokenの位置・step - Transformer / latent-thought model / SSMをこの観点で分類する 9

10.

Recurrence in Transformer (b) looped transformer, universal transformer (c ) block-recurrent transformer (d ) まだない 10

11.

Recurrence in Latent Thought Model COCONUT 11

12.

Recurrence in SSM 12

13.

Promising Directions 1. SSMを拡張する a. Linear SSMはTransformerのexpressivityを超えないが、DeltaNet (linear attention + delta rule) などはTransformerとhybridにするとよりexpressivityを高くできる可能性がある（Merrill et al. 2026.） circuit complexity classes Merrill et al., Olmo Hybrid: From Theory to Practice and Back, arXiv:2604.03444, 2026. 2. Feedforward Transformerにstate trackingを近似させる b. BST, NextLatは良いが、compositional state representationsを考慮するべき Hidden statesのtrajectoryが線形になる正則化JEPA Teoh et al., Next-Latent Prediction Transformers Learn Compact World Models, arXiv:2511.05963, 2025. Huang et al., Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA, arXiv:2602.22617, 2026. 13

14.

Promising Directions 3. Coarse recurrence a. reccurenceの単位を粗くすることで、reccurenceのbottleneckのcomputation costを改善する i. 例 Thought Gestalt Model: 各文を latent thought vector に圧縮し、それらを working memory として参照しながら次の token を予測することで、言語をtoken列ではな thought列としてモデル化しようとするアーキテクチャ Borazjanizadeh and McClelland, Modeling Language as a Sequence of Thoughts, arXiv:2512.25026, 2025. 4. representation alignmentを活用する b. residual connectionによってTransformerの各層での表現はある程度揃っている c. 層の再利用、層のスキップ、反復的推論、計算量を入力ごとに変える adaptive computation、途中層の表現を別の場所で再利用する手法などをできないか i. 例えば、canon layer 1. 各トークンの表現に、近くの過去トークンの表現を軽く混ぜるための、局所的な横方向residual connection 14 Allen-Zhu, Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers, arXiv:2512.17351, 2025.

15.

Promising Directions 5. Efficient training of recurrence a. Feedforward Transformerと異なり、reccurenceを入れると並列化ができない。 i. Feedforward Transformerとして学習して、その後のstageでreccurenceを入れるようにpost-trainingする。 ii. truncated gradient methodsを使う 1. 最後のstepの勾配だけを戻してあとはdetach iii. attractor dynamics 用のrecurrent backpropagation iv. arithmetic intensityを高めて、GPU utilizationをあげる実装 15