Random Sampling over Spatial Range Joins@ICDE2025

541 Views

May 20, 25

#空間範囲結合 #ランダムサンプリング #アルゴリズム #効率化 #空間データベース

スライド概要

Daichi Amagata

@8828547526

スライド一覧

I'm a DB and AI researcher.

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

Fair Clustering & Search 2

Daichi Amagata 802

Independent Range Sampling on Interval Data@ICDE2024

Daichi Amagata 533

2022-DBS-amagata

Daichi Amagata 395

AIP加速課題成果報告@WebDB夏のワークショップ2025

Daichi Amagata 274

Diversity Maximization in the Presence of Outliers@AAAI2023

Daichi Amagata 246

Fast and Exact Outlier Detection in Metric Spaces: A Proximity Graph-based Approach@SIGMOD2021

Daichi Amagata 166

各ページのテキスト

Random Sampling over Spatial Range Joins Daichi Amagata The University of Osaka Japan

INTRODUCTION: SPATIAL RANGE JOINS No. 1 𝑱=𝑹⋈𝑺  𝑅 (𝑆) is a set of 𝑛 (𝑚) geo-spatial (2-dimensional) points.  A point 𝑟 ∈ 𝑅 joins points 𝑠 ∈ 𝑆 iff 𝑤 𝑟 ∩ 𝑠 (𝑠 is in the orthogonal range 𝑤 𝑟 , i.e., we have  Many practical applications: ). 𝑤 𝑟 Geographical information systems, location-based services, neuroscience, clustering, visualization, etc…

INTRODUCTION: TOO LARGE RESULT ISSUE No. 2 𝑱=𝑹⋈𝑺  𝑅 (𝑆) is a set of 𝑛 (𝑚) geo-spatial (2-dimensional) points.  A point 𝑟 ∈ 𝑅 joins points 𝑠 ∈ 𝑆 iff 𝑤 𝑟 ∩ 𝑠 (𝑠 is in the orthogonal range 𝑤 𝑟 ).  𝑱 = 𝑶(𝒏𝒎) is usually huge, e.g., “> 1012 ” even when 𝑛 and 𝑚 are million-scales (moderate size nowadays). ◼ Too large for data analysis and computational resources (e.g., RAM).

INTRODUCTION: RANDOM SAMPLING No. 3 Getting 𝒕 uniform random samples of 𝑱 = 𝑹 ⋈ 𝑺  𝑅 (𝑆) is a set of 𝑛 (𝑚) geo-spatial points.  A point 𝑟 ∈ 𝑅 joins points 𝑠 ∈ 𝑆 iff 𝑤 𝑟 ∩ 𝑠 (𝑠 is in the orthogonal range 𝑤 𝑟 ).  𝐽 = 𝑂(𝑛𝑚) is usually huge, e.g., “> 1012 ” even when 𝑛 and 𝑚 are million-scales (moderate size nowadays).  𝒕 is an application-dependent parameter, and random join samples (i) can fit into the main memory and (ii) are sufficient for statistics calculation [1] and ML training data [2]. [1] Z. Zhao, R. Christensen, F. Li, X. Hu, and K. Yi, “Random sampling over joins revisited,” in SIGMOD, 2018. [2] T. Vu, A. Belussi, S. Migliorini, and A. Eldawy, “A learned query optimizer for spatial join,” in SIGSPATIAL, 2021.

PRELIMINARY No. 4 Definition: Random sampling over spatial range join Input: 𝑅 (a set of 𝑛 geo-spatial points), 𝑆 (a set of 𝑚 geo-spatial points), a range size, and 𝑡 (#samples) Output: 𝑡 samples of 𝐽 = 𝑟, 𝑠 | 𝑟 ∈ 𝑅, 𝑠 ∈ 𝑆, 𝑤 𝑟 ∩ 𝑠 , each of which is picked uniformly at random Full join result is not required! Run a join algorithm then do random sampling…? ⋈ Run sample(𝑅) ⋈ sample 𝑆 …? 𝐬𝐚𝐦𝐩𝐥𝐞 𝑹 𝐬𝐚𝐦𝐩𝐥𝐞 𝑺 𝑱 𝑺 Necessary for stats. & distribution estimations! Intuitive algorithms: Intuitive approaches: 𝑹 𝑤 𝑟 𝒕 samples Super slow response owing to the 𝑂(𝑛𝑚) time complexity ⋈ 𝑱′ 𝒕 samples sample 𝑅 ⋈ sample 𝑆 ≠ sample 𝐽 [3], i.e., not uniform samples [1] Z. Zhao, R. Christensen, F. Li, X. Hu, and K. Yi, “Random sampling over joins revisited,” in SIGMOD, 2018. [3] S. Chaudhuri, R. Motwani, and V. Narasayya, “On random sampling over joins,” ACM SIGMOD Record, vol. 28, no. 2, pp. 263–274, 1999.

CONTRIBUTION No. 5 Challenge How to obtain 𝑡 uniform random samples without running a join algorithm, i.e., in less than 𝑶(𝒏𝒎) time? Findings There exists an 𝑂෨ 𝑛 + 𝑚 + 𝑡 expected time and 𝑂(𝑛 + 𝑚) space algorithm that outputs 𝑡 join pairs, each of which is picked uniformly at random. Experiments  We demonstrate that our algorithm is much faster Time Space KDS [4] 𝑂 Our implementation is available at KDS-rejection [1 + 4] 𝑂 𝑛+𝑚+ https://github.com/amgt-d1/RS-over-SRJ. BBST (Ours) ෩ 𝒏 + 𝒎 + 𝒕 in expectation 𝑶 than baseline algorithms.  Comparison of time and space complexities. 𝑂෨ ⋅ hides polylog factors. [1] T. Vu, A. Belussi, S. Migliorini, and A. Eldawy, “A learned query optimizer for spatial join,” in SIGSPATIAL, 2021. [4] D. Xie, J. M. Phillips, M. Matheny, and F. Li, “Spatial independent range sampling,” in SIGMOD, 2021. 𝑛+𝑡 𝑚 𝑛𝑚1.5 𝑡 𝐽 𝑂 𝑛+𝑚 in expectation 𝑂 𝑛+𝑚 𝑶 𝒏+𝒎

https://github.com/amgt-d1/RS-over-SRJ

BASELINE 1: KDS [4] No. 6 Range sampling algorithm that randomly picks 𝒔 ∈ 𝑺 𝒘 𝒓  Observation (i.e., how to employ for our problem): 𝐽 = ‫𝑟 𝑅∈𝑟ڂ‬, 𝑠 | 𝑠 ∈ 𝑆 𝑤 𝑟 0. Pre-processing 1. Range counting Build 𝒦, a kd-tree of 𝑆 Compute 𝑆 𝑤 𝑟 𝑟 ∈ 𝑅 on 𝒦 for each , i.e., union of range “search” results 2. Building alias structure 3. Sampling So that each 𝑟 ∈ 𝑅 is picked Sample 𝑟 ∈ 𝑅 with the alias, run KDS on 𝒦, and return 𝑟, 𝑠 . 𝑆 𝑤 𝑟 with probability σ 𝑆 𝑤 𝑟′ 𝑟′ ∈𝑅 𝑆 𝑤 𝑟 𝒘𝒓 𝑂(𝑚 log 𝑚) time & 𝑂(𝑚) space ⊆ 𝑺 overlapping 𝒘 𝒓 𝑶(𝒏 𝒎) time [1] T. Vu, A. Belussi, S. Migliorini, and A. Eldawy, “A learned query optimizer for spatial join,” in SIGSPATIAL, 2021. [4] D. Xie, J. M. Phillips, M. Matheny, and F. Li, “Spatial independent range sampling,” in SIGMOD, 2021. 𝑟1 1 𝑟2 2 𝑟3 3 𝑟4 4 𝑟5 5 . Alias 𝑂(𝑛) time & 𝑂 𝑛 space 𝒘 𝒓𝟑 𝑶(𝒕 𝒎) time Large computational costs for range counting and sampling

BASELINE 2: KDS-REJECTION [1, 4] No. 7 Incorporating rejection sampling − Upper-bounding sampling probability of a pair ∈ 𝐽 −  Let 𝜇 𝑟 ≥ |𝑆 𝑤 𝑟 |, and assume 𝑟, 𝑠 is sampled. It is accepted with probability 𝑆 𝑤 𝑟 /𝜇(𝑟). 0. Pre-processing 1. Approx. range counting 2. Building alias structure 3. Sampling Build 𝒦, a kd-tree of 𝑆 Build a grid of 𝑆 with no empty cells. Compute 𝜇 𝑟 , #points in the overlapping cells, ∀𝑟 ∈ 𝑅. So that each 𝑟 ∈ 𝑅 is picked 𝜇(𝑟) with probability σ . 𝜇 𝑟′ Sample 𝑟 ∈ 𝑅, run KDS, and 𝑤 𝑟 return 𝑟, 𝑠 with prob. 𝑆 𝜇(𝑟) . 𝑟′ ∈𝑅 𝜇 𝑟 𝝁 𝒓 =𝟒 𝑂(𝑚 log 𝑚) time & 𝑂(𝑚) space 𝑂(𝑛) time [1] Z. Zhao, R. Christensen, F. Li, X. Hu, and K. Yi, “Random sampling over joins revisited,” in SIGMOD, 2018. [4] D. Xie, J. M. Phillips, M. Matheny, and F. Li, “Spatial independent range sampling,” in SIGMOD, 2021. 𝑟1 1 𝑟2 2 𝑟3 3 𝑟4 4 𝑟5 5 Alias 𝒘 𝒓𝟑 𝑂(𝑛) time & 𝑂 𝑛 space 𝑶(𝒕𝒏𝒎𝟏.𝟓/ 𝑱 ) expected time Still large computational cost for sampling

OUR ALGORITHM: OVERVIEW No. 8 Question: Is there an efficient data structure both for counting and sampling? Answer: Yes, and we build a range size-dependent structure online (almost no pre-processing is required). 1. Building a grid & BBSTs 2. Approx. range counting 2. Building alias structure 3. Sampling Build a grid of 𝑆 with no empty cells. For each cell, build our data structures BBSTs. Count 𝜇 𝑟 ∀𝑟 ∈ 𝑅 with the grid and BBST. Each 𝑟 ∈ 𝑅 is picked with 𝜇(𝑟) probability σ . 𝜇 𝑟′ Sample 𝑟 ∈ 𝑅 with the alias, use the grid (and BBST), and return 𝑟, 𝑠 iff 𝑤 𝑟 ∩ 𝑠. 𝑟′ ∈𝑅 𝜇 𝑟 ෨ 𝑂(𝑚) time & 𝑂(𝑚) space ෨ 𝑂(𝑛) time 𝑟1 1 𝑟2 2 𝑟3 3 𝑟4 4 𝑟5 5 Alias 𝑂(𝑛) time & 𝑂 𝑛 space 𝜇 𝑟 Guarantee: 𝑆 𝑤 𝑟 ≤ 𝑂 log 𝑚 ෩ (𝒕) expected time 𝑶

10.

OUR ALGORITHM: IDEA No. 9 Reduction to 0-, 1-, and 2-sided range counting & sampling by making each cell ½-scale of 𝒘 𝒓  0-sided: 𝑐5 <- Fully covered by 𝑤 𝑟  1-sided: 𝑐2 , 𝑐4 , 𝑐6 , 𝑐8 <- Caring only x- or y-dimension  2-sided: 𝑐1 , 𝑐3 , 𝑐7, 𝑐9 <- With 𝑂(𝑚) space, how to (i) obtain a bounded count and (ii) run efficient sampling? z-sided case’s performance. 𝑂෨ ⋅ hides polylog factors. 0-sided 1-sided 2-sided Count & sampling time 𝑂 1 𝑂෨ 1 ? Space 𝑂 1 𝑂 𝑚 ? Count approximation bound Exact Exact ? Random sampling? Yes Yes ? 3 6 2 5 1 4 9 𝑟 8 7

11.

OUR ALGORITHM: BBST (BUCKET-BASED BINARY SEARCH TREE) No. 10 BBST is (i) a balanced BST of a set of buckets and (ii) for 2-sided range counting & sampling  A bucket 𝐵 of 𝑆 𝑐 , a set of points in a cell 𝑐, is a sequence of log 𝑚 points sorted in ascending order of the x-coordinate ◼  min 𝑠𝑥 , max 𝑠𝑥 , min 𝑠𝑦 , and max 𝑠𝑦 are recorded (min 𝑠𝑥 or max 𝑠𝑥 is used as the coordinate of 𝐵) 𝐵∈𝑆 𝐵∈𝑆 𝐵∈𝑆 𝐵∈𝑆 𝐵∈𝑆 𝐵∈𝑆 Each node 𝑢𝑖 has: ◼ ◼ ℬ𝑖𝑚𝑖𝑛 (ℬ𝑖𝑚𝑎𝑥 ): list of buckets having the same x-coordinate of 𝑢𝑖 , where the buckets are sorted in ascending order of min 𝑠𝑦 (max 𝑠𝑦 ). 𝐵∈𝑆 𝐵∈𝑆 𝑚𝑖𝑛 𝑚𝑎𝑥 𝐴𝑖 (𝐴𝑖 ): array of buckets existing in the subtree rooted at 𝑢𝑖 , where the buckets are sorted in ascending order of min 𝑠𝑦 (max 𝑠𝑦 ). 𝐵∈𝑆 𝐵∈𝑆 𝐵2 𝐵4 ℬ1𝑚𝑖𝑛 [𝐵2] 𝐵1 ⋯ 𝐵4 𝐴𝑚𝑖𝑛 𝑟𝑜𝑜𝑡 𝑚𝑎𝑥 [𝐵4] ℬ𝑟𝑜𝑜𝑡 𝐵7 ⋯ 𝐵2 𝐴𝑚𝑎𝑥 𝑟𝑜𝑜𝑡 𝑢𝑟𝑜𝑜𝑡 ℬ1𝑚𝑎𝑥 [𝐵2] 𝐵6 𝐵3 𝑚𝑖𝑛 𝐵4 ℬ𝑟𝑜𝑜𝑡 𝐴1𝑚𝑖𝑛 𝐵1 𝐵3 𝐵2 𝐴1𝑚𝑎𝑥 𝐵1 𝐵3 𝐵2 𝐵5 𝐵7 𝐵6 𝐵5 𝐵7 𝐵6 𝐵5 𝐴𝑚𝑖𝑛 2 𝑚𝑎𝑥 𝐴2 𝐵6 ℬ2𝑚𝑖𝑛 𝑢1 𝑢2 𝐵6 ℬ2𝑚𝑎𝑥 𝐵7 𝐵1 𝑐 𝑢3 𝑢4 𝑢5 𝑢6

12.

OUR ALGORITHM: BBST’s SPACE & BUILDING TIME No. 11 BBST is (i) a balanced BST of a set of buckets and (ii) for 2-sided range counting & sampling  A bucket 𝐵 of 𝑆 𝑐 , a set of points in a cell 𝑐, is a sequence of log 𝑚 points sorted in ascending order of the x-coordinate ◼  min 𝑠𝑥 , max 𝑠𝑥 , min 𝑠𝑦 , and max 𝑠𝑦 are recorded (min 𝑠𝑥 or max 𝑠𝑥 is used as the coordinate of 𝐵) 𝐵∈𝑆 𝐵∈𝑆 𝐵∈𝑆 𝐵∈𝑆 𝐵∈𝑆 𝐵∈𝑆 Each node 𝑢𝑖 has ◼ ◼ ℬ𝑖𝑚𝑖𝑛 (ℬ𝑖𝑚𝑎𝑥 ): list of buckets having the same x-coordinate of 𝑢𝑖 , where the buckets are sorted in ascending order of min 𝑠𝑦 (max 𝑠𝑦 ). 𝐵∈𝑆 𝐵∈𝑆 𝑚𝑖𝑛 𝑚𝑎𝑥 𝐴𝑖 (𝐴𝑖 ): array of buckets existing in the subtree rooted at 𝑢𝑖 , where the buckets are sorted in ascending order of min 𝑠𝑦 (max 𝑠𝑦 ). 𝐵∈𝑆 𝐵∈𝑆 Space complexity: assume 𝑁 = 𝑆 𝑐 𝑁  #buckets: log 𝑚  Height: ℎ = 𝑂 log log 𝑚 𝑁  Each bucket is maintained in 𝑂(ℎ) nodes.  𝑵 Total: 𝐥𝐨𝐠 𝒎 × 𝑶 𝒉 =𝑶 𝑵 ℬ1𝑚𝑖𝑛 [𝐵2] 𝑚𝑖𝑛 𝐵4 ℬ𝑟𝑜𝑜𝑡 𝐵1 ⋯ 𝐵4 𝐴𝑚𝑖𝑛 𝑟𝑜𝑜𝑡 𝑚𝑎𝑥 [𝐵4] ℬ𝑟𝑜𝑜𝑡 𝐵7 ⋯ 𝐵2 𝐴𝑚𝑎𝑥 𝑟𝑜𝑜𝑡 𝑢𝑟𝑜𝑜𝑡 ℬ1𝑚𝑎𝑥 [𝐵2] 𝐴1𝑚𝑖𝑛 𝐵1 𝐵3 𝐵2 𝐴1𝑚𝑎𝑥 𝐵1 𝐵3 𝐵2 Sorting 𝑁 (copies of) points: 𝑂 𝑁 log 𝑁  Building a BBST: 𝑂(𝑁)  Total: 𝑶 𝑵 𝐥𝐨𝐠 𝑵 𝑢3 𝐵6 𝐵5 𝐵7 𝐵6 𝐵5 𝐴𝑚𝑖𝑛 2 𝑚𝑎𝑥 𝐴2 𝐵6 ℬ2𝑚𝑖𝑛 𝑢1 𝑢2 Building time: assume 𝑁 = 𝑆 𝑐  𝐵7 𝑢4 𝑢5 𝐵6 ℬ2𝑚𝑎𝑥 𝑢6

13.

OUR ALGORITHM: APPROXIMATE RANGE COUNTING WITH BBST No. 12 BBST is (i) a balanced BST of a set of buckets and (ii) for 2-sided range counting & sampling  𝜇 𝑟, 𝑐 = log 𝑚 × 𝐵 | 𝐵 ∈ ℬ 𝑐 , 𝑤 𝑟 . 𝑥𝑚𝑖𝑛 ≤ max 𝑠𝑥 , 𝑤 𝑟 . 𝑦𝑚𝑖𝑛 ≤ max 𝑠𝑦 1. Identify the nodes such that 𝑤 𝑟 . 𝑥𝑚𝑖𝑛 ≤ max 𝑠𝑥 2. Identify the number of buckets satisfying 𝑤 𝑟 . 𝑦𝑚𝑖𝑛 ≤ max 𝑠𝑦 𝑠∈𝐵 𝑠∈𝐵 <- 𝑂(ℎ) nodes exist. 𝑠∈𝐵 𝑠∈𝐵 ෨ Time complexity: 𝑂 ℎ × 𝑂 log 𝑚 ≤ 𝑂(1) 𝐵2 Approximation bound: 𝑂(log 𝑚) Count #buckets overlapping 𝒘 𝒓 𝐵4 ℬ1𝑚𝑖𝑛 [𝐵2] 𝑚𝑖𝑛 𝐵4 ℬ𝑟𝑜𝑜𝑡 𝐵1 ⋯ 𝐵4 𝐴𝑚𝑖𝑛 𝑟𝑜𝑜𝑡 𝑚𝑎𝑥 [𝐵4] ℬ𝑟𝑜𝑜𝑡 𝐵7 ⋯ 𝐵2 𝐴𝑚𝑎𝑥 𝑟𝑜𝑜𝑡 𝑢𝑟𝑜𝑜𝑡 ℬ1𝑚𝑎𝑥 [𝐵2] 𝐵6 𝐵3 <- Binary searches for 𝑂(ℎ) lists/arrays 𝐴1𝑚𝑖𝑛 𝐵1 𝐵3 𝐵2 𝐴1𝑚𝑎𝑥 𝐵1 𝐵3 𝐵2 𝐵5 𝐵7 𝐵6 𝐵5 𝐵7 𝐵6 𝐵5 𝐴𝑚𝑖𝑛 2 𝑚𝑎𝑥 𝐴2 𝐵6 ℬ2𝑚𝑖𝑛 𝑢1 𝑢2 𝐵6 ℬ2𝑚𝑎𝑥 𝐵7 𝐵1 𝑐 𝑢3 𝑢4 𝑢5 𝑤 𝑟 . 𝑥𝑚𝑖𝑛 𝑢6

14.

OUR ALGORITHM: RANDOM SAMPLING WITH BBST No. 13 BBST is (i) a balanced BST of a set of buckets and (ii) for 2-sided range counting & sampling  Probabilistic traversal of nodes having buckets overlapping 𝑤 𝑟  Assume 𝑤 𝑟 . 𝑥𝑚𝑖𝑛 ≤ 𝑢𝑟𝑜𝑜𝑡 . 𝑥 . (Otherwise, we traverse its right child node.) 1. 𝑚𝑎𝑥 If 𝑟𝑛𝑑1 ∈ [1, 𝜇] ≤ 𝑐𝑛𝑡𝑟𝑜𝑜𝑡 (= an upper-bound #points ∩ 𝑤 𝑟 in 𝑢𝑟𝑜𝑜𝑡 ), sample a random 𝐵 ∈ ℬ𝑟𝑜𝑜𝑡 , and sample a random 𝑠 ∈ 𝐵. 2. Else, we sample a random 𝐵 ∈ 𝐴𝑚𝑎𝑥 , and sample a random 𝑠 ∈ 𝐵 if 𝑟𝑛𝑑2 ∈ 1, 𝜇 − 𝑐𝑛𝑡𝑟𝑜𝑜𝑡 ≤ 𝑐𝑛𝑡2 2 Return (𝑟, 𝑠) iff 𝑤 𝑟 ∩ 𝑠 and repeat the above while 𝜇 ← 𝜇 − 𝑐𝑛𝑡𝑟𝑜𝑜𝑡 − 𝑐𝑛𝑡2 until we have a sample. 𝐵2 𝐵4 ℬ1𝑚𝑖𝑛 [𝐵2] 𝐵1 ⋯ 𝐵4 𝐴𝑚𝑖𝑛 𝑟𝑜𝑜𝑡 𝑚𝑎𝑥 [𝐵4] ℬ𝑟𝑜𝑜𝑡 𝐵7 ⋯ 𝐵2 𝐴𝑚𝑎𝑥 𝑟𝑜𝑜𝑡 𝑢𝑟𝑜𝑜𝑡 ℬ1𝑚𝑎𝑥 [𝐵2] 𝐵6 𝐵3 𝑚𝑖𝑛 𝐵4 ℬ𝑟𝑜𝑜𝑡 𝐴1𝑚𝑖𝑛 𝐵1 𝐵3 𝐵2 𝐴1𝑚𝑎𝑥 𝐵1 𝐵3 𝐵2 𝑩𝟓 𝐵7 𝐵6 𝐵5 𝐵7 𝐵6 𝐵5 𝐴𝑚𝑖𝑛 2 𝑚𝑎𝑥 𝐴2 𝐵6 ℬ2𝑚𝑖𝑛 𝑢1 𝑢2 𝐵6 ℬ2𝑚𝑎𝑥 𝐵7 𝐵1 𝑐 𝑢3 𝑢4 𝑢5 𝑤 𝑟 . 𝑥𝑚𝑖𝑛 𝑢6

15.

OUR ALGORITHM: RANDOM SAMPLING WITH BBST No. 14 BBST is (i) a balanced BST of a set of buckets and (ii) for 2-sided range counting & sampling  Probabilistic traversal of nodes having buckets overlapping 𝑤 𝑟  Assume 𝑤 𝑟 . 𝑥𝑚𝑖𝑛 ≤ 𝑢𝑟𝑜𝑜𝑡 . 𝑥 . (Otherwise, we traverse its right child node.) 1. 𝑚𝑎𝑥 If 𝑟𝑛𝑑1 ∈ [1, 𝜇] ≤ 𝑐𝑛𝑡𝑟𝑜𝑜𝑡 (= an upper-bound #points ∩ 𝑤 𝑟 in 𝑢𝑟𝑜𝑜𝑡 ), sample a random 𝐵 ∈ ℬ𝑟𝑜𝑜𝑡 , and sample a random 𝑠 ∈ 𝐵. 2. Else, we sample a random 𝐵 ∈ 𝐴𝑚𝑎𝑥 , and sample a random 𝑠 ∈ 𝐵 if 𝑟𝑛𝑑2 ∈ 1, 𝜇 − 𝑐𝑛𝑡𝑟𝑜𝑜𝑡 ≤ 𝑐𝑛𝑡2 2 Return (𝑟, 𝑠) iff 𝑤 𝑟 ∩ 𝑠 and repeat the above while 𝜇 ← 𝜇 − 𝑐𝑛𝑡𝑟𝑜𝑜𝑡 − 𝑐𝑛𝑡2 until we have a sample. Time complexity: 𝑚𝑖𝑛 𝐵4 ℬ𝑟𝑜𝑜𝑡 𝐵1 ⋯ 𝐵4 𝐴𝑚𝑖𝑛 𝑟𝑜𝑜𝑡  Computing 𝑐𝑛𝑡𝑖 : 𝑂෨ 1 𝑚𝑎𝑥 [𝐵4] ℬ𝑟𝑜𝑜𝑡 𝐵7 ⋯ 𝐵2 𝐴𝑚𝑎𝑥 𝑟𝑜𝑜𝑡  Expected #iterations to have 𝑤 𝑟 ∩ 𝑠: 𝑂෨ 1   Thanks to the approximation bound ෩ 𝟏 ×𝑶 𝒉 ×𝑶 ෩ 𝟏 =𝑶 ෩ 𝟏 Total: 𝑶 Correctness  ℬ1𝑚𝑖𝑛 [𝐵2] 𝑢𝑟𝑜𝑜𝑡 ℬ1𝑚𝑎𝑥 [𝐵2] 𝐴1𝑚𝑖𝑛 𝐵1 𝐵3 𝐵2 𝐴1𝑚𝑎𝑥 𝐵1 𝐵3 𝐵2 𝐵7 𝐵6 𝐵5 𝐵7 𝐵6 𝐵5 𝐴𝑚𝑖𝑛 2 𝑚𝑎𝑥 𝐴2 𝐵6 ℬ2𝑚𝑖𝑛 𝑢1 𝑢2 1 : Pr[(𝑟, 𝑠) is picked] = σ 𝜇 𝑟′ 𝑟′ ∈𝑅 𝐵6 ℬ2𝑚𝑎𝑥 Check paper for detail. 𝑢3 𝑢4 𝑢5 𝑤 𝑟 . 𝑥𝑚𝑖𝑛 𝑢6

16.

OUR ALGORITHM: SUMMARY No. 15 Reduction to 0-, 1-, and 2-sided range counting & sampling by making each cell ½-scale of 𝒘 𝒓  0-sided: 𝑐5 <- Fully covered by 𝑤 𝑟  1-sided: 𝑐2 , 𝑐4 , 𝑐6 , 𝑐8 <- Caring only x- or y-dimension  2-sided: 𝑐1 , 𝑐3 , 𝑐7, 𝑐9 ෨ 𝑆 𝑐 ) time and yields a uniform random sample in 𝑂෨ 1 time <- A BBST is built in 𝑂( with 𝑂( 𝑆 𝑐 ) space. z-sided case’s performance. 𝑂෨ ⋅ hides polylog factors. 0-sided 1-sided 2-sided Count & sampling time 𝑂 1 𝑂෨ 1 𝑂෨ 1 Space 𝑂 1 𝑂 𝑚 𝑂 𝑚 Count approximation bound Exact Exact 𝑂 log 𝑚 Random sampling? Yes Yes Yes 3 6 2 5 1 4 9 𝑟 8 7

17.

EXPERIMENT: DATASET, PRE-PROCESSING TIME, & MEMORY USAGE No. 16 Reasonable pre-processing time and memory usage even for > 100 million points * BBST only sorts 𝑆 in pre-processing. Datasets 𝑛+𝑚 Pre-processing time [sec] CaStreet Foursquare IMIS NYC 2,249,727 11,180,160 168,240,595 323,598,288 CaStreet Foursquare IMIS NYC KDS 0.36 1.89 29.09 48.16 BBST 0.14 0.99 14.23 22.97

18.

EXPERIMENT: DECOMPOSED TIME [sec], 𝒕 = 𝟏, 𝟎𝟎𝟎, 𝟎𝟎𝟎 No. 17 BBST shows a much better performance than the baseline algorithms. CaStreet Total KDS 35.29 KDS-rejection 71.10 BBST 1.54 Grid Mapping Upperbounding Sampling #iterations Total Grid Mapping 3.62 31.67 1,000,000 116.90 - 0.07 0.10 70.93 1,883,952 132.49 0.20 0.79 0.55 1,143,805 5.67 x23~ Foursquare Upperbounding Sampling #iterations 36.64 80.26 1,000,000 0.32 0.42 131.73 1,449,218 1.06 4.11 0.50 1,029,027 Upperbounding Sampling #iterations 555.97 2696.42 1,000,000 8.11 12.48 3790.74 1,803,540 21.74 7.60 1.21 1,168,306 x21~ IMIS NYC Total Grid Mapping Upperbounding Sampling #iterations Total Grid Mapping KDS 2844.62 - 2261.52 583.12 1,000,000 3242.38 - KDS-rejection 1062.01 5.20 6.81 1049.99 1,476,050 3811.34 12.67 58.68 0.67 1,066,613 30.55 BBST 72.02 x15~ x106~

19.

EXPERIMENT: IMPACT OF PARAMETERS No. 18

20.

CONCLUSION Problem Random sampling over spatial range join that returns 𝑡 join pairs sampled uniformly at random Findings We prove that there exists an 𝑂෨ 𝑛 + 𝑚 + 𝑡 expected time and 𝑂(𝑛 + 𝑚) space algorithm that picks 𝑡 join pairs, each of which is picked uniformly at random. Experiments  We demonstrate that our algorithm is much faster than the baseline algorithms.  Our implementation is available at https://github.com/amgt-d1/RS-over-SRJ. No. 19

https://github.com/amgt-d1/RS-over-SRJ