【論文読み会】Latent Dirichlet Allocation

>100 Views

June 28, 25

#topic model #LDA #潜在的ディリクレ配分法 #トピックモデリング #変分ベイズ推論 #pLSI

スライド概要

京都大学人工知能研究会KaiRA

@kyoto-kaira

スライド一覧

AI・機械学習を勉強したい学生たちが集まる、京都大学の自主ゼミサークルです。私たちのサークルに興味のある方はX(Twitter)をご覧ください！

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【論文読み会】GraphCast: Learning skillful medium-range global weather forecasting

京都大学人工知能研究会KaiRA 29K

【論文読み会】NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

京都大学人工知能研究会KaiRA 21.9K

【IT Text 自然言語処理の基礎】第7章：事前学習済みモデルと転移学習

京都大学人工知能研究会KaiRA 17.8K

【物体検出】ResNet・EfficientNet (v2)

京都大学人工知能研究会KaiRA 15.7K

【Pythonで学ぶ音声認識】第5章：GMM-HMMによる音声認識（5.3節まで）

京都大学人工知能研究会KaiRA 12.5K

【Pythonで学ぶ音声認識】第5章：GMM-HMMによる音声認識（5.5節）

京都大学人工知能研究会KaiRA 11.9K

各ページのテキスト

潜在的ディリクレ配分法 (Latent Dirichlet Allocation) Blei et al., 2003 KaiRA 社会人メンバー栗林雷旗

Background Drawbacks of pLSI ( probabilistic Latent Semantic Indexing ) • The number of parameters grows linearly with the size of the corpus→Overfitting • It is not clear how to assign probability to a document outside of the training set. (Reference: Blei et al., 2003) Dimensionality Reduction Methods such as pLSI Bag of Words Assumption (Exchangeability) 2

LDA https://www.youtube.com/@SerranoAcademy 6

https://www.youtube.com/@SerranoAcademy

LDA’s Role in Machine Learning Workflow LDA approaches the training phase in machine learning workflow. Data Collection Data Cleaning and Preparation Model Choosing Training Evaluation Tuning Deployment Monitoring and updating 7

LDA Architecture LDA ( Latent Dirichlet Allocation ) is a generative probabilistic model of a corpus. Creation of Topics Number of words in a given document 1 2 3 4 𝛽 Dataset 𝛼 𝜃𝑑 𝑧𝑑𝑛 𝑤𝑑𝑛 Frequency of topics per document N M observed word Document-topic Distribution Dirichlet parameter Word-topic assignment Number of documents D1 D2 D3 D4 8

LDA Generative Process Given the parameters 𝛼 and 𝛽, the joint distribution of 𝜃, 𝒛, 𝒘 is given by: 𝛼 𝑁 𝑝(𝜃, 𝒛, 𝒘) = 𝑝 𝜃 𝛼) ෑ 𝑝 𝑧𝑛 𝜃 𝑝(𝑤𝑛 |𝑧𝑛 , 𝛽) 𝑛=1 𝜃 𝜃: 𝑎 𝑡𝑜𝑝𝑖𝑐 𝑚𝑖𝑥𝑡𝑢𝑟𝑒 ቐ 𝒛: 𝑎 𝑠𝑒𝑡 𝑜𝑓 𝑁 𝑡𝑜𝑝𝑖𝑐𝑠 𝒘: 𝑎 𝑠𝑒𝑡 𝑜𝑓 𝑁 𝑤𝑜𝑟𝑑𝑠 𝑧𝑛 Example 𝜷 (𝒌 × 𝑽) 𝒘𝟏 𝒘𝟐 𝒘𝟑 𝒘𝟒 𝑧1 0.3 0.1 0.4 0.1 𝑧2 0.1 0.6 0.1 0.1 𝑧3 0.2 0.2 0.2 0.2 𝛽 𝑤𝑛 9

クラスタリングの流れ Step1: コーパスに変分ベイズ推論を適用してディリクレ分布のパラメータを推定 Step2: 推定されたパラメータの値を使って周辺化ギブスサンプリングを行い、各文書内の各単語にトピックを割り当て 10

Parameter Inference(Topic Distribution in Each Document) 𝛼Ԧ = (𝛼1 , 𝛼2 , … 𝛼𝑘 ) 11

Parameter Inference(Word Distribution in Each Topic) 𝛽Ԧ = (𝛽, 𝛽, … , 𝛽) 12