>100 Views
June 13, 24
スライド概要
Presenting our work "Diversity Maximization in the Presence of Outliers" at AAAI2023.
I'm a DB and AI researcher.
DIVERSITY MAXIMIZATION IN THE PRESENCE OF OUTLIERS Daichi Amagata Osaka University
BACKGROUND: MAKING A SUMMARY OF A DATASET This dataset (𝑋) is too large for my application (or resource)… I wanna reduce the size to 𝑘. 𝑋: a set of 𝑛 points 𝑆 𝑆: a set of 𝑘 (≪ 𝑛) points • Sampling: Quite easy & the best if the data distribution should be preserved (but may be skewed). • Submodular maximization: Users must specify an appropriate function (which may not be trivial). • Diversity maximization: Easy to specify & the summary can avoid skewness. #1
BACKGROUND: DIVERSITY MAXMIZATION ∗ 𝑆 = arg max𝑆⊆𝑋, 𝑆 =𝑘 𝒇 𝑺 ′ MAX-MIN: 𝑓 𝑆 = min 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ 𝑥,𝑥 ∈𝑆 It’s diverse! MAX-SUM: 𝑓 𝑆 = σ𝑥,𝑥 ′ ∈𝑆 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ It’s skewed… #2
BACKGROUND: DIVERSITY MAXMIZATION + OUTLIERS ∗ 𝑆 = arg max𝑆⊆𝑋, 𝑆 =𝑘 𝒇 𝑺 ′ MAX-MIN: 𝑓 𝑆 = min 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ 𝑥,𝑥 ∈𝑆 It’s diverse! #3
PRELIMINARY Problem Definition Input: 𝑋 = 𝑋𝑖𝑛 ∪ 𝑋𝑜𝑢𝑡 (a set of 𝑛 points in a metric space) and 𝑘 ≥ 2 (output size) ′ Output: 𝑆 ∗ = arg max𝑆⊆𝑋∖𝑿𝒐𝒖𝒕, 𝑆 =𝑘 min 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ ▷ NP-hard 𝑥,𝑥 ∈𝑆 Assumptions Same as 𝑘-clustering Outliers are far from the other points. 1) 𝑋𝑜𝑢𝑡 = 𝑧 ′ 2) Define 𝑑𝑖𝑠𝑡 𝑥, 𝑋 = min 𝑑𝑖𝑠𝑡(𝑥, 𝑥 ). For each 𝑥 ∈ 𝑋𝑜𝑢𝑡 , we have ′ 𝑥 ∈𝑋 for every 𝑥 ′ ∈ 𝑋𝑖𝑛 • 𝑑𝑖𝑠𝑡 𝑥, 𝑋 ∖ 𝑥 > 𝑑𝑖𝑠𝑡 𝑥 ′ , 𝑋 ∖ 𝑥 ′ • 𝑑𝑖𝑠𝑡 𝑥, 𝑋 ∖ 𝑥 ′ > 𝛼𝑙 ∗ , where 𝛼 ≥ 1 and 𝑙 ∗ = min 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ ∗ 𝑥,𝑥 ∈𝑆 #4
SOTA ALGO. FOR MAX-MIN DIV. W/O OUTLIERS IS NOT EFFECTIVE GMM [1]: A state-of-the-art 𝑂(𝑛𝑘) time ½-approximation algorithm if 𝑋𝑜𝑢𝑡 = ∅ 1) Add a random point in 𝑋 into 𝑆 ▷ 𝑂 1 time 2 to 𝑘) Add 𝑥 ∗ = arg max𝑥∈𝑋∖𝑆 𝑑𝑖𝑠𝑡 𝑥, 𝑆 into 𝑆 ▷ 𝑂 𝑛(𝑘 − 1) time 4 It’s diverse! 7 3 It’s dirty… 10 1 8 6 9 2 5 [1] Ravi, S. S.; Rosenkrantz, D. J.; and Tayi, G. K. 1994. Heuristic and Special Case Algorithms for Dispersion Problems. Operations Research, 42(2): 299–310. #5
CONTRIBUTION TO THE PROBLEM OF MAX-MIN DIV. w/ OUTLIERS Challenge Does there exist an algorithm that can return 𝑆 containing no outliers in time proportional to 𝑛, 𝑘, and 𝑧? Findings We prove that there exists an • • 𝑂 𝑘 + 𝑧 𝑛 time ½-approximation algorithm that returns no outliers 1 𝑂 𝑘𝑧 time -approximation algorithm that returns no outliers with a constant probability 6 1+𝜖 Experiments We demonstrate that • Existing Max-Min diversification algorithms’ solutions contain most outliers. • Our algorithms are much (e.g., four orders of magnitude) faster than a baseline algorithm. #6
BASELINE ALGORITHM: 𝑶(𝒏𝟐 ) TIME, ½ -APPROX., & NO OUTLIERS Operations 1) Compute the NN (nearest neighbor) of each 𝑥 ∈ 𝑋. ▷ 𝑶 𝒏𝟐 time 2) Remove 𝑧 points with the largest NN distance. ▷ 𝑂 𝑧 time 3) Run GMM on the remaining points. ▷ 𝑂 𝑛 − 𝑧 𝑘 time Assumptions The 𝑧 points with the largest distance to the nearest neighbor are outliers. 1) 𝑋𝑜𝑢𝑡 = 𝑧 ′ 2) Define 𝑑𝑖𝑠𝑡 𝑥, 𝑋 = min 𝑑𝑖𝑠𝑡(𝑥, 𝑥 ). For each 𝑥 ∈ 𝑋𝑜𝑢𝑡 , we have ′ 𝑥 ∈𝑋 for every 𝑥 ′ ∈ 𝑋𝑖𝑛 • 𝑑𝑖𝑠𝑡 𝑥, 𝑋 ∖ 𝑥 > 𝑑𝑖𝑠𝑡 𝑥 ′ , 𝑋 ∖ 𝑥 ′ • 𝑑𝑖𝑠𝑡 𝑥, 𝑋 ∖ 𝑥 > 𝛼𝑙 ∗ , where 𝛼 ≥ 1 and 𝑙 ∗ = 𝑑𝑖𝑣 𝑆 ∗ #7
OUTLIER-AWARE GREEDY ALGORITHM Lemma 1 𝑆 ′ ← GMM 𝑋, 𝑘 + 𝑧 contains the 𝑧 outliers. Operations 1) 𝑆 ′ ← GMM 𝑋, 𝑘 + 𝑧 ▷ 𝑂 𝑛(𝑘 + 𝑧) time 2) Compute the NN of each 𝑥 ∈ 𝑆′ ▷ 𝑂 𝑛(𝑘 + 𝑧) time 3) 𝑍 ← 𝑧 points with the largest NN dist. 4) 𝑆 ← GMM(𝑋 ∖ 𝑍, 𝑘) ▷ 𝑂 𝑧 time ▷ 𝑂 𝑛 − 𝑧 𝑘 time Result of GMM 𝑋, 𝑘 + 𝑧 Theorem 1 GREEDY returns a ½-approx. solution with no outliers in 𝑂(𝑛(𝑘 + 𝑧)) time. #8
STREAMING ALGORITHM መ 𝑙/2 𝑂( 𝑋 ′ 𝑘) time 𝑂(1) iterations because 𝐿 = 𝑂 1 Lemma 2 STREAMING(𝑋 , 𝑘) runs in 𝑂( 𝑋 𝑘) time and returns a ′ ′ no outliers with probability 1 − * 𝑙∗ = 𝑓 𝑆 ∗ = min 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ ′ ∗ 𝑥,𝑥 ∈𝑆 𝑧 𝑋′ if 𝛼 ≥ 2. 1 2 1+𝜖 -approx. solution with #9
CORESET FOR MAX-MIN DIVERSIFICATION WITH OUTLIERS Definition of Coreset 𝐶 ⊆ 𝑋 is a 𝛽-coreset, if, for 𝑆 ⊆ 𝐶, we have 𝑑𝑖𝑣 𝑆 ≥ 𝑑𝑖𝑣 𝑆 ∗ 𝛽 . Lemma 3 𝐶 ← GMM 𝑋, 𝑘 + 𝑧 is a 3-coreset (it yields the anti-cover property even in the presence of outliers). Theorem 2 (main result) Built in an offline phase 1 STREAMING(𝐶, 𝑘), where 𝐶 = 𝑂 𝑧 ≥ 𝑘 + 𝑧, runs in 𝑂(𝑘𝑧) time and returns a -approx. 6 1+𝜖 solution with no outliers with a constant probability if 𝛼 ≥ 2 (1 − 𝐶𝑧 = 𝑂 1 ⟺ 𝐶 = 1−𝑂𝑧 1 ). We can make a summary 𝑺 of 𝑿 with online time independent of 𝒏. # 10
EXPERIMENT: COMPARISON WITH COMPETITORS (𝒌 = 𝟏𝟎𝟎, 𝒛 = 𝟐𝟎𝟎) Table 1: Average number of outliers in 𝑆 𝑆 of the existing algorithms (GMM & PODS19) is dominated by outliers… Table 2: Average 𝑑𝑖𝑣 𝑆 = min 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ and running time [sec] (no outliers are returned) ′ 𝑥,𝑥 ∈𝑆 Our algorithms are faster than BASELINE. Especially, CORESET is super fast. Table 3: Average time to identify 𝑧 outliers [sec] GREEDY identifies the outliers much faster than BASELINE. * FCT: 10-dim & 𝑛 = 580,812, Household: 7-dim & 𝑛 = 2,049,280, KDD99: 16-dim & 𝑛 = 311,029, Mirai: 115-dim & 𝑛 = 764,137 # 11
EXPERIMENT: IMPACTS OF 𝒌 & 𝒛 # 12
CONCLUSION Problem: First study on Max-Min diversification with outliers Technical Contribution: We proposed • GREEDY: 𝑂 𝑘 + 𝑧 𝑛 time ½-approx. algorithm that returns no outliers • CORESET: 𝑂 𝑘𝑧 time 1/6 1 + 𝜖 -approx. algorithm that returns no outliers with a constant probability Experiments: We demonstrated that • Existing Max-Min diversification algorithms’ solutions contain most outliers. • Our algorithms are much (e.g., four orders of magnitude) faster than a baseline algorithm. Source code available at https://github.com/amgt-d1/Max-Min-w-Outliers # 13