>100 Views
June 13, 24
スライド概要
Presenting our work "Fast and Exact Outlier Detection in Metric Spaces: A Proximity Graph-based Approach" at SIGMOD2021.
I'm a DB and AI researcher.
Fast and Exact Outlier Detection in Metric Spaces: A Proximity Graph-based Approach Daichi Amagata1,2, Makoto Onizuka1, Takahiro Hara1 1. Osaka University 2. JST PRESTO
PROPBLEM DEFINITION: DOD (DISTANCE-BASED OUTLIER DETECTION) No. 1 • 𝑃: a set of 𝑛 objects 𝑝𝑖 . • NEIGHBORS OF 𝒑𝒊 : objects 𝑝𝑗 ∈ 𝑃 𝑠. 𝑡. 𝑑𝑖𝑠𝑡 𝑝𝑖 , 𝑝𝑗 ≤ 𝒓, where 𝑑𝑖𝑠𝑡(⋅,⋅) is a metric function. • DISTANCE-BASED OUTLIERS: objects with less than 𝒌 neighrbors. Detect all of them - Exactly (no false positive & no false negative), - Quickly, and - With a linear space to 𝒏. Example) 𝑘 = 5 & Euclidean space: 𝑟 𝑟
MOTIVATION No. 2 Application View Noise Removal • Data cleaning, Machine learning, and etc. Abnormal phenomena detection • Security check, Medical diagnosis, and etc. They deal with big data. 1. DOD should scale to 𝒏. They deal with variety type of data. 2. DOD should deal with metric spaces. 3. DOD should be robust to dimensionality. Technical View • Nested-loop approach [1] incurs 𝑂 𝑛2 time, which does not scale to big data. • Tree-based algorithms are not efficient for high-dimensional data. • Existing algorithms [2,3] do not solve the above issues. [1] Mining Distance-based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule, In KDD, 2003. [2] Mining Distance-based Outliers from Large Datasets in any Metric Space, In KDD, 2006. [3] Dolphin: An Efficient Algorithm for Mining Distance-based Outliers in Very Large Datasets, TKDD, 2009.
CONTRIBUTION No. 3 1. New algorithm for the DOD problem • It can run in any metric spaces with 𝑂 𝑓 + 𝑡 𝑛 time. • 𝑓 and 𝑡 are #false-positives and #outliers, respectively • It is robust to high-dimensional spaces. • It is easy to parallelize (all you need is ”parallel for”). 2. New proximity graph • It can reduce 𝑓. • It needs 𝑂 𝑛 space and can be built in 𝑂 𝑛 time for a fixed degree. 3. Empirical evaluation • 7 real datasets with some distance functions. • Our algorithm is quite faster than any others. Table 1: Running time [sec] (some examples) Dataset SNIF [2] DOLPHIN [3] Ours Glove 1222.43 9277.89 2.63 HEPMASS 20360.80 NA 38.40
MAIN IDEA 1 No. 4 Solving the RANGE COUNTING PROBLEM* quickly *The problem that returns the number of objects 𝑝′ s.t. 𝑑𝑖𝑠𝑡 𝑞, 𝑝′ ≤ 𝑟, where 𝑞 is a query. NEIGHBORS OF 𝑝𝑖 : objects 𝑝𝑗 ∈ 𝑃 𝑠. 𝑡. 𝑑𝑖𝑠𝑡 𝑝𝑖 , 𝑝𝑗 ≤ 𝑟. OUTLIERS: objects with less than 𝑘 neighrbors. We just need to know whether “𝒌 objects in this space?” or not. #outliers is usually small (≈ 𝟎. 𝟎𝟏𝒏) -> It is important to quickly filter non-outliers (inliers).
MAIN IDEA 2 No. 5 Solving the RANGE COUNTING PROBLEM* quickly *The problem that returns the number of objects 𝑝′ s.t. 𝑑𝑖𝑠𝑡 𝑞, 𝑝′ ≤ 𝑟, where 𝑞 is a query. We want to access only neighbors for each 𝑝 ∈ 𝑃. We need a data structure that can prune far objects. A proximity graph is built offline. • Any metric space is available. • Being robust to dimensionality.
OVERVIEW OF OUR DOD ALGORITHM No. 6 Filter-and-Verification framework is employed. Filter inliers with 𝑂 𝑘 time. Verify outliers by linear scan, i.e., with 𝑂 𝑛 time. Filtering phase (greedy algorithm): 1. Run breadth-first-search from 𝑝, and if we have 𝑑𝑖𝑠𝑡 𝑝, 𝑝′ ≤ 𝑟 for an accessing object 𝑝′, 1. Increment count by 1. If count = 𝒌, BFS is terminated. 2. Each object is evaluated independently. -> It is parallelizable. 2. Add 𝑝′ into a queue. Repeat traversing objects until queue becomes empty. Verification phase (linear scan): 1. For each object with count < 𝑘, run a linear scan.
CHALLENGE OF OUR ALGORITHM No. 7 Time complexity of our algorithm: 𝑶 𝒇+𝒕 𝒏 𝑓: #false positives, 𝑡: #outliers Verification incurs the main cost. Filtering phase (greedy algorithm) • Outliers always have count < 𝑘 -> No false negatives. • Inliers may have count < 𝑘 -> False positives. Eliminating false positives theoretically needs Ω 𝑛2 offline time or 𝑂 𝑛2 space. There are neighbors that are not reachable from a given object. We develop a new proximity graph that can reduce 𝑓 in “as much as possible manner” with 𝑂(𝑛) offline time and space.
MRPG: METRIC RANDOMIZED PRPXIMITY GRAPH AKNNG: Approximate KNN Graph (red nodes are pivots.) SC-AKNNG: Stronglyconnected AKNNG SC-AKNNG + Monotonic paths from random objects No. 8 SC-AKNNG + Monotonic paths from random objects + removing unnecessary links MRPG Monotonic path: sequence of nodes 𝑝, 𝑝𝑖 , 𝑝𝑖+1, … 𝑝𝑖+𝑗 such that 𝑑𝑖𝑠𝑡 𝑝, 𝑝𝑖 ≤ 𝑑𝑖𝑠𝑡(𝑝, 𝑝𝑖+𝑗 ) for 𝑗 ∈ 1, 𝑚 MSG: graph with at least a monotonic path between arbitrary two nodes <- 𝑂(𝑛2(𝐾 + log 𝑛)) time to build MRPG: approximation of MSG -> 𝑶 𝒏𝑲𝟐 𝐥𝐨𝐠 𝑲𝟑 = 𝑶(𝒏) time for a fixed 𝐾 (= initial degree)
STEP 1: NNDESCENT+ AKNNG: Approximate KNN Graph (red nodes are pivots.) SC-AKNNG: Stronglyconnected AKNNG No. 9 SC-AKNNG + Monotonic paths from random objects -> Creating links between the most similar objects -> high reachability to neighbors NNDESCENT [WWW’11] SC-AKNNG + Monotonic paths from random objects + removing unnecessary links NNDESCENT+: 𝑂(𝑛𝐾 2 log 𝐾) time AKNNG building 1. Init AKNNs by creating random K links for each 𝑝 ∈ 𝑃. 1. Init AKNNs by a VP-tree to cluster similar objects. 2. Update AKNNs of 𝑝 iff there exist 𝑝′ with smaller 2. Do the same as NNDESCENT while ignoring non-update nodes. distances than the current AKNNs within 2-hop from 𝑝. 3. After convergence, retrieve the exact K’NN (K’ > K) for 3. Repeat the above until no update occurs. nodes with large distances to their AKNNs.
STEP 2: CONNECT-SUBGRAPHS AKNNG: Approximate KNN Graph (red nodes are pivots.) SC-AKNNG: Stronglyconnected AKNNG No. 10 SC-AKNNG + Monotonic paths from random objects SC-AKNNG + Monotonic paths from random objects + removing unnecessary links -> Improving connectivity to achieve high reachability to neighbors 𝑝 1. Run BFS (breadth-first-search) from a random pivot 𝑝. <- 𝑂(𝑛𝐾) time 2. Iff another pivot 𝑝′ is not traversed, do ANN search on accessed objects. <- 𝑂 𝐾 time - Let 𝑝∗ be the ANN search result, and create a link between 𝑝′ and 𝑝∗ . <- 𝑂 1 time 3. Re-start the BFS from 𝑝′ , and repeat the above operations. 𝑂 𝑛𝐾 time
STEP 3: REMOVE-DETOURS AKNNG: Approximate KNN Graph (red nodes are pivots.) SC-AKNNG: Stronglyconnected AKNNG No. 11 SC-AKNNG + Monotonic paths from random objects SC-AKNNG + Monotonic paths from random objects + removing unnecessary links -> Improving high reachability to neighbors 1. Run “3-hop” BFS from a random object 𝑝. <- 𝑂 𝐾 3 log 𝐾 3 × 𝑂(𝑛/𝐾) time 2. Run “2-hop” BFS from random pivots that are close to 𝑝. <- 𝑂 𝐾 2 log 𝐾 2 × 𝑂 𝐾 time 3. Create necessary links if no monotonic paths from 𝑝 to the accessed objects. <- 𝑂 𝑛𝐾 time 4. Do 1 to 3 𝑂 𝑛 𝐾 objects. 𝑂 𝑛𝐾 2 log 𝐾 3 time
STEP 4: REMOVE-LINKS AKNNG: Approximate KNN Graph (red nodes are pivots.) SC-AKNNG: Stronglyconnected AKNNG No. 12 SC-AKNNG + Monotonic paths from random objects SC-AKNNG + Monotonic paths from random objects + removing unnecessary links MRPG 1. If objects 𝑝𝑖 and 𝑝𝑗 s.t. 𝑝𝑖 , 𝑝𝑗 ∈ 𝐸 have links to a common pivot 𝑝, remove 𝑝𝑖 , 𝑝𝑗 . Note: Steps 1 to 4 resp. need 𝑂(𝑛𝐾 2 log 𝐾), 𝑂(𝑛𝐾), 𝑂(𝑛𝐾 2 log 𝐾 3 ), and 𝑂(𝑛𝐾) time. -> For a fixed 𝑲, we need 𝑶(𝒏) time to build a MRPG. ← 𝑂(𝑛𝐾) time
EXPERIMENTAL SETTING No. 13 Evaluated algorithms: • State-of-the-art: Nested-loop [KDD’03], SNIF [KDD’06], DOLPHIN [TKDD’09], and VP-tree [SODA’93]. • Our algorithm with proximity graphs: NSW [IS’14], KGraph [WWW’11], MRPG. • For MPRG, our algorithm is optimized to it. Table 2: Dataset, statistics, and default parameters Dataset Cardinality Dimensionality Distance function 𝑟 𝑘 Outlier ratio [%] Deep 10,000,000 96 L2 norm 0.93 50 0.62 Glove 1,193,514 25 Angular distance 0.25 20 0.55 HEPMASS 7,000,000 27 L1 norm 15 50 0.65 MNIST 3,000,000 784 L4 norm 600 50 0.34 PAMAP2 2,844,868 51 L2 norm 50,000 100 0.61 SIFT 1,000,000 128 L2 norm 320 40 1.04 Words 466,551 1 - 45 Edit distance 5 15 4.16
EXPERIMENTAL RESULT OF OFFLINE TIME & INDEX SIZE Table 3: Offline time [sec] with 48 threads Table 4: Index size [MB] Dataset NSW KGraph MRPG Dataset SNIF DOLPHIN VP-tree NSW KGraph MRPG Deep NA 20079.80 17230.30 Deep NA NA 324.35 NA 5516.58 7350.83 Glove 2333.47 923.83 791.53 Glove 13.26 69.14 54.86 186.62 460.48 438.76 HEPMASS NA 7935.25 5221.86 HEPMASS 61.04 NA 265.39 NA 2188.65 2450.84 MNIST NA 7217.84 6850.54 MNIST NA NA 117.80 NA 589.08 591.27 PAMAP2 4522.14 1325.40 888.61 PAMAP2 18.36 65.12 128.00 819.17 553.87 760.69 SIFT 4910.94 929.48 817.33 SIFT 8.10 47.00 39.99 157.58 433.48 489.14 Words 871.27 455.15 868.62 Words 4.41 26.86 27.79 102.20 191.73 178.74 30000 2500 16000 20000 NSW 15000 10000 5000 Offline time [sec] Offline time [sec] 20000 2000 1500 1000 500 Offline time [sec] 14000 25000 Offline time [sec] No. 14 12000 10000 8000 6000 4000 16000 KGraph MRPG 12000 8000 4000 2000 0 0 0.2 0.4 0.6 0.8 Sampling rate (Deep) 1 0 0.2 0.4 0.6 0.8 Sampling rate (Glove) 1 0 0.2 0.4 0.6 0.8 Sampling rate (HEPMASS) 1 0.2 0.4 0.6 0.8 Sampling rate (MNIST) 1
EXPERIMENTAL RESULT OF ONLINE TIME (OVERVIEW) No. 15 Table 5: Online time [sec]. 48 (resp. 12) threads on Deep and MNIST (resp. the others) Dataset Nested-loop SNIF DOLPHIN VP-tree NSW KGraph MRPG Deep NA NA NA NA NA 8616.10 1966.17 Glove 1045.47 1222.43 9277.89 1398.92 147.00 83.32 2.63 HEPMASS 17925.40 20360.80 NA 8597.23 NA 780.19 38.40 MNIST NA NA NA NA NA 4983.33 3435.29 PAMAP2 422.40 1213.56 1819.90 208.55 22.16 23.77 10.55 SIFT 1427.74 1507.58 8684.08 2005.27 200.89 175.88 11.32 Words 1844.65 2086.50 7061.50 1021.39 498.34 393.95 2.67 Winner is our algorithm + MRPG: • x21~ (Deep), x398 (Glove), x224 (HEPMASS), x13~ (MNIST), x20 (PAMAP2), x126 (SIFT), and x383 (Words) times faster than state-of-the-art. Our approach “filter-and-verification” is faster than the state-of-the-art: • NSW and KGraph are also faster than the state-of-the-art.
EXPERIMENTAL RESULT OF ONLINE TIME (IMPACT OF PARAMETERS) No. 16 Scalability to 𝒏: 6000 4000 2000 Online time [sec] MRPG 120 100 80 60 40 20 0 0.4 0.6 0.8 Sampling rate (Deep) 1 5000 800 700 600 500 400 300 200 4000 3000 2000 1000 100 0 0 0.2 6000 900 140 KGraph 8000 Online time [sec] Online time [sec] 1000 160 NSW Online time [sec] 10000 0.2 0.4 0.6 0.8 Sampling rate (Glove) 1 0 0.2 0.4 0.6 0.8 Sampling rate (HEPMASS) 1 0.2 0.4 0.6 0.8 1 50 55 60 Sampling rate (MNIST) Robustness to outlier rate by varying 𝒌: 250 8000 6000 4000 2000 0 150 100 50 0 40 45 50 k (Deep) 55 60 7000 1000 200 Online time [sec] Online time [sec] Online time [sec] 10000 8000 1200 Online time [sec] 12000 800 600 400 200 15 20 k (Glove) 25 30 5000 4000 3000 2000 1000 0 0 10 6000 40 45 50 k (HEPMASS) 55 60 40 45 k (MNIST)
CONCLUSION New solution for the distance-based outlier detection problem: • A filter-and-verification framework on a proximity graph. • MRPG: Metric Randomized Proximity Graph. Theoretical/Empirical evaluation of offline/online time and space cost: • Offline time: 𝑂(𝑛) for a fixed degree. • Online time: 𝑂 𝑓 + 𝑡 𝑛 ≪ 𝑂 𝑛2 . • Space cost: 𝑂 𝑛 for a fixed degree. • Our solution is much faster than existing state-of-the-art. No. 17