【DL輪読会】Large Language Models With Contrastive Decoding

>100 Views

October 02, 25

スライド概要

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

各ページのテキスト
1.

DEEP LEARNING JP [DL Papers] Large Language Models With Contrastive Decoding Algorithm for Hallucination Mitigation in Low‐Resource Languages Animesh Harsh, Matsuo Iwasawa Lab http://deeplearning.jp/ 1

2.

書誌情報 • タイトル:Robust Function-Calling for On-Device Language Model via Function Masking • 会議:n CAAI Transactions on Intelligence Technology · April 2025 • 著者:Zan Hongying, Arifa Javed, Muhammad Abdullah, Javed Rashid, Muhammad Faheem • TL;DR:Refines contrastive decoding by dynamically weighting expert and amateur LLM outputs to reduce hallucinations in lowresource neural machine translation, achieving high translation accuracy and robustness with advanced models like ChatGLM and LLaMA. 2

3.

背景 Key Challenge: Neural Machine Translation for low-resource languages (ex. Chinese-Urdu) suffers from hallucinations due to data scarcity Main Contribution: • Dynamically adjusts weights between expert and amateur models • Reduces hallucinations significantly • Achieves ~30% BLEU score improvement over baselines 3

4.

背景 Current Problems • LLMs require 5-6 orders of magnitude more data than humans • Low-resource languages lack sufficient training data • Hallucinations: translations diverge from source text Why It Matters • Massive computational costs • Poor performance for underserved languages • Trust and reliability issues in NMT systems Research Question: How can we reduce hallucinations in low-resource NMT while improving translation quality? 4

5.

背景 What are Hallucinations? Definition: Generated content that appears plausible but is factually incorrect or unrelated to source text Main Causes: Insufficient Context Understanding Training Data Limitations Model Overfitting Impact: Undermines trust in NMT systems, especially critical for low-resource language pairs 5

6.

手法 Methodology Overview Three-Pronged Approach: 1. Large Language Models 2. Refined Contrastive Decoding 3. Multi-metric Evaluation • ChatGLM 2-6B • LLaMA 65B • LLaMA 2-7B Dynamic weight adjustment between expert/amateur models • BLEU score • Semantic similarity • NER accuracy 6

7.

手法 Experimental Setup Models Amateur: Pre-trained LLMs Expert: Fine-tuned versions Dataset Combined Chinese-Urdu corpus: • OPUS • WMT • WiLi • Custom data 1.16M sentences | 70% train | 15% test | 15% validation 7

8.

手法 Refined CD Algorithm Key Innovation: Dynamic weight adjustment 𝛾 = 1 − 𝑚𝑎𝑥𝑃𝑒𝑥𝑝𝑒𝑟𝑡 𝑥𝑖 𝛽 𝐶𝐷 𝑖 = 𝑙𝑜𝑔𝑃𝑒𝑥𝑝𝑒𝑟𝑡 𝑦𝑖 𝑥 − γ𝑙𝑜𝑔𝑃𝑎𝑚𝑎𝑡𝑒𝑢𝑟 𝑦𝑖 𝑥 • Adjusts amateur model influence based on expert confidence • Normalizes scores for stable beam search • Prioritizes high-confidence translations 8

9.

手法 Proposed Model Architecture Key Innovation: Dynamic weight adjustment 9

10.

実験 Results - Performance Metrics Benchmarks USED: • • • • • • • • OPUS: Large multilingual parallel corpus for machine translation evaluation. WMT: Standard annual benchmark with diverse language pairs for translation competitions. WiLi: Wikipedia-based dataset for multilingual short text translation and identification. Custom Chinese-Urdu Corpus: Web-mined and human-curated, focused on low-resource language evaluation. BLEU: Measures n-gram overlap with human references; standard for translation quality. METEOR, ROUGE-L, chrF: Additional metrics for fluency, recall, and character-level quality. Semantic Similarity: Checks preservation of meaning against source sentence. NER Accuracy: Assesses correct handling of named entities in translation. 10

11.

実験 Results - Performance Metrics LLaMA 2-7B: 47.0% → 79.1% ChatGLM 2-6B: 46.0% → 76.0% LLaMA 65B: 44.0% → 74.0% BLEU Score (%) BLEU Scores (Combined Dataset): 100 90 80 70 60 50 40 30 20 10 0 LLaMA 2-7B ChatGLM 2-6B LLaMA 65B Model Baseline Expert Model 33% fewer tokens needed 11

12.

実験 Hallucination Reduction Baseline: 0.31 → 0.15 Expert: 0.10 → 0.04 Hallucination Rate Over Training 0.35 Hallucination Rate 0.3 0.25 0.2 0.15 Baseline 0.1 Expert 0.05 0 Epoch 0 Epoch 2 Epoch 4 Epoch 6 Epoch 8 Epoch 10 Training Progress 12

13.

実験 Comparative Analysis Model BLEU Score Previous SOTA (Transformer) 0.52 CD with M2M 0.79 Our ChatGLM + CD 0.76 Our LLaMA 65B + CD 0.74 Our LLaMA 2-7B + CD 79.17% Best performance among all Chinese-Urdu NMT studies 13

14.

実験 Ablation Study Results Configuration BLEU (%) Hallucination Rate Baseline 48.57 0.31 + Refined CD 62.34 0.18 + Fine-tuning 68.45 0.15 + Multi-metric eval 64.78 0.19 Full model 79.17 0.09 Each component contributes significantly to performance 14

15.

まとめ • Problem Recognition 1. Neural machine translation for low-resource languages often suffers from hallucinations, where generated translations diverge from the source content due to insufficient training data and context limitations. • Key Contributions 1. 2. 3. Introduces a refined Contrastive Decoding algorithm that dynamically adjusts the weights of expert and amateur model outputs, significantly reducing hallucinations and improving translation accuracy. Leverages large language models like ChatGLM and LLaMA to enhance contextual understanding and cross-lingual translation capabilities. Uses multiple evaluation metrics—BLEU, semantic similarity, and named entity recognition (NER)—to select the best translation candidates, ensuring both accurate meaning and handling of named entities. • Results 1. 2. 3. Fine-tuned models show substantial performance gains in BLEU scores and hallucination reduction compared to baseline and previous models. The LLaMA 2 7B model achieves a BLEU score of 79.17, demonstrating high translation quality even in low-resource settings. The proposed technique reduces hallucination rates from 0.31 (baseline) to 0.09 and demonstrates robust improvements across multilingual benchmarks and diverse metrics. The approach consistently improves translation reliability and quality in multi-language, multi-metric settings. 15

16.

まとめ • Significance 1. Enables lightweight yet robust agent deployment on resource-constrained edge and mobile devices, improving practicality in real-world applications. 2. Enhanced generalization reduces translation errors even in dynamic environments where available APIs and tools frequently change. • Future Directions 1. Further strengthen multi-turn and multi-step translation capabilities, expanding model adaptability to complex dialog and multi-instruction scenarios. 2. Explore methods to support APIs and data sources with limited documentation or scarce contextual clues. 3. Accelerate practical deployment by integrating with large-scale models in hybrid systems, optimizing both performance and resource efficiency 16