A universal framework for offline serendipity evaluation in recommender systems via large language models

>100 Views

November 11, 25

#国際会議 #ポスター #レコメンデーションシステム #セレンディピティ #大規模言語モデル #オフライン評価 #プロンプトエンジニアリング

スライド概要

Yu Tokutake, Kazushi Okamoto, Kei Harada, Atsushi Shibata, Koki Karube: A universal framework for offline serendipity evaluation in recommender systems via large language models, The 34th ACM International Conference on Information and Knowledge Management (CIKM 2025), 5294-5298, 2025.11, Seoul, Republic of Korea.

Okamoto Lab. (The Univ. of Electro-Communications)

@okmt_lab

スライド一覧

Data Science Research Group, The University of Electro-Communications

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

Is it really complementary? Revisiting behavior-based labels for complementary recommendation

論文読み会

Okamoto Lab. (The Univ. of Electro-Communications) 2.1K

アスペクトに着目した読者に影響を与える映画レビューの分析

国内会議

Okamoto Lab. (The Univ. of Electro-Communications) 1.1K

大規模言語モデルを用いた推薦システムにおけるセレンディピティ判断の検討

国内会議

Okamoto Lab. (The Univ. of Electro-Communications) 735

Evaluation of session segmentation methods using behavior and text embeddings

国際会議

Okamoto Lab. (The Univ. of Electro-Communications) 612

深層学習を用いた物件外観画像による築年代推定法の検討

国内会議

Okamoto Lab. (The Univ. of Electro-Communications) 561

大規模言語モデルを用いた料理レシピの曖昧表現補完

国内会議

Okamoto Lab. (The Univ. of Electro-Communications) 400

各ページのテキスト

122 A Universal Framework for Offline Serendipity Evaluation in Recommender Systems via LLMs Yu Tokutake, Kazushi Okamoto, Kei Harada, Atsushi Shibata, Koki Karube The University of Electro-Communications Summary Propose a universal offline serendipity evaluation framework independent of datasets or models by leveraging LLM-as-a-Judge. Improve the serendipity prediction accuracy of LLMs by prompt engineering. In evaluations using the proposed framework, general RS sometimes demonstrated higher serendipity performance than serendipity-oriented RS. Why Needs Universal Serendipity Evaluation? Background Evaluating serendipity is challenging because of its subjective nature and the lack of ground truth. Many previous works rely on offline evaluation metrics. Existing Offline Approaches Custom metrics. Depend on set thresholds and offer limited comparability between models and datasets. Serendipity-labelled datasets. Cover only a few domains and are expensive to expand. Our Goal Methods Custom metrics Serendipitous datasets Ours Framework Overview Experimental Questions EQ1 EQ2 Which prompt strategies are most effective for the LLM-as-a-Judge system to evaluate serendipity? How does serendipity performance using the proposed framework vary across different RSs and datasets? Prompt Selection Experiment (EQ1) Evaluate the system's capability using the Serendipity-2018 dataset annotated with serendipitous ground truth. LLMs: GPT-4o-mini (GPT), Llama-3.1-70B Instruct (Llama). Prompt Strategies: LS: Likert-scale guidance to five serenipity levels. CoT: Chain-of-thought (relevance → unexpectedness → serendipity). LtM: Least-to-Most. separate sub-prompts and use as input to the next. Improve performance through stepwise prompting. With GPT, CoT keeps MAE under 1.0, letting the judge separate serendipitous from non-serendipitous. → Adopt GPT + CoT for EQ2. RS Performance Evaluation Experiment (EQ2) Datasets: MovieLens-1M, Goodreads, Amazon Beauty (w/o serendipity labels). Compared RSs General RS: BPRMF, SASRec Serendipity-oriented RS: KFN, UAUM, DESR, PURS Evaluation Metrics Type Metrics Based on Accuracy Precision_acc, NDCG_acc user behavior Serendipity Precision_ser, NDCG_ser, Avg. Score proposed framework's scores Flexibility Scalability Generalizability ✗ ◯ ◯ Input: user 's last interacted items Output: serendipity score ◯ ✗ ◯ ✗ ✗ ◯ and a candidate item . . (3: neutral) Base Prompt Template : 10-shot (2 examples each of 5-level scores) Reasonable recommendation accuracy can coexist with strong serendipity; RSs like PURS and UAUM balance both without sacrificing accuracy. No single RS consistently delivers the best serendipitous performance across all datasets, underscoring domain-specific behavior. General RSs such as BPRMF occasionally outperform serendipity-oriented RSs in serendipity metrics. Simpler serendipity-oriented RSs (e.g., UAUM) sometimes surpass more complex architectures in serendipity performance. Future Work Improve the proposed evaluation framework. Evaluate a broader range of RSs and datasets.