A universal framework for offline serendipity evaluation in recommender systems via large language models

>100 Views

November 11, 25

スライド概要

Yu Tokutake, Kazushi Okamoto, Kei Harada, Atsushi Shibata, Koki Karube: A universal framework for offline serendipity evaluation in recommender systems via large language models, The 34th ACM International Conference on Information and Knowledge Management (CIKM 2025), 5294-5298, 2025.11, Seoul, Republic of Korea.

profile-image

Data Science Research Group, The University of Electro-Communications

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

各ページのテキスト
1.

122 A Universal Framework for Offline Serendipity Evaluation in Recommender Systems via LLMs Yu Tokutake, Kazushi Okamoto, Kei Harada, Atsushi Shibata, Koki Karube The University of Electro-Communications Summary Propose a universal offline serendipity evaluation framework independent of datasets or models by leveraging LLM-as-a-Judge. Improve the serendipity prediction accuracy of LLMs by prompt engineering. In evaluations using the proposed framework, general RS sometimes demonstrated higher serendipity performance than serendipity-oriented RS. Why Needs Universal Serendipity Evaluation? Background Evaluating serendipity is challenging because of its subjective nature and the lack of ground truth. Many previous works rely on offline evaluation metrics. Existing Offline Approaches Custom metrics. Depend on set thresholds and offer limited comparability between models and datasets. Serendipity-labelled datasets. Cover only a few domains and are expensive to expand. Our Goal Methods Custom metrics Serendipitous datasets Ours Framework Overview Experimental Questions EQ1 EQ2 Which prompt strategies are most effective for the LLM-as-a-Judge system to evaluate serendipity? How does serendipity performance using the proposed framework vary across different RSs and datasets? Prompt Selection Experiment (EQ1) Evaluate the system's capability using the Serendipity-2018 dataset annotated with serendipitous ground truth. LLMs: GPT-4o-mini (GPT), Llama-3.1-70B Instruct (Llama). Prompt Strategies: LS: Likert-scale guidance to five serenipity levels. CoT: Chain-of-thought (relevance → unexpectedness → serendipity). LtM: Least-to-Most. separate sub-prompts and use as input to the next. Improve performance through stepwise prompting. With GPT, CoT keeps MAE under 1.0, letting the judge separate serendipitous from non-serendipitous. → Adopt GPT + CoT for EQ2. RS Performance Evaluation Experiment (EQ2) Datasets: MovieLens-1M, Goodreads, Amazon Beauty (w/o serendipity labels). Compared RSs General RS: BPRMF, SASRec Serendipity-oriented RS: KFN, UAUM, DESR, PURS Evaluation Metrics Type Metrics Based on Accuracy Precision_acc, NDCG_acc user behavior Serendipity Precision_ser, NDCG_ser, Avg. Score proposed framework's scores Flexibility Scalability Generalizability ✗ ◯ ◯ Input: user 's last interacted items Output: serendipity score ◯ ✗ ◯ ✗ ✗ ◯ and a candidate item . . (3: neutral) Base Prompt Template : 10-shot (2 examples each of 5-level scores) Reasonable recommendation accuracy can coexist with strong serendipity; RSs like PURS and UAUM balance both without sacrificing accuracy. No single RS consistently delivers the best serendipitous performance across all datasets, underscoring domain-specific behavior. General RSs such as BPRMF occasionally outperform serendipity-oriented RSs in serendipity metrics. Simpler serendipity-oriented RSs (e.g., UAUM) sometimes surpass more complex architectures in serendipity performance. Future Work Improve the proposed evaluation framework. Evaluate a broader range of RSs and datasets.