FIRAS@SIGMOD2026

313 Views

June 03, 26

#区間データ #範囲検索 #範囲サンプリング #アルゴリズム #データストリーム

スライド概要

Presentation slides of "FIRAS: A Framework for Interval Range Search and Sampling"

Daichi Amagata

@8828547526

スライド一覧

I'm a DB and AI researcher.

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

Fair Clustering & Search 2

Daichi Amagata 1.5K

AIP加速課題成果報告@WebDB夏のワークショップ2025

Daichi Amagata 1.4K

Random Sampling over Spatial Range Joins@ICDE2025

Daichi Amagata 1.3K

2022-DBS-amagata

Daichi Amagata 667

Independent Range Sampling on Interval Data@ICDE2024

Daichi Amagata 617

Diversity Maximization in the Presence of Outliers@AAAI2023

Daichi Amagata 355

各ページのテキスト

FIRAS: A Framework for Interval Range Search and Sampling Daichi Amagata (The University of Osaka) Panagiotis Simatis (University of Ioannina) Panagiotis Bouros --- Nikos Mamoulis (University of Ioannina)

Background: Range Search on Interval Data No. 1 • An interval 𝑥 = [𝑥. 𝑙, 𝑥. 𝑟] • A set of 𝑛 intervals 𝑋 = {𝑥1 , … , 𝑥𝑛 } • A query interval 𝒒 = [𝒒. 𝒍, 𝒒. 𝒓] finds 𝑹 = 𝒙 ∈ 𝑿 ∶ 𝒙 ∩ 𝒒 , i.e., “all” intervals ∈ 𝑿 that overlap 𝒒. Ex.1) Vehicle management systems: Show the vehicles that were active between 17:00 and 22:00 a week ago. Ex.2) Book management systems: Collect the books that were borrowed in the last month. Ex.3) Historical cryptocurrency databases: Show when the price of Bitcoin (BTC) falls in [30,000, 40,000] dollars.

Background: What if “Too Large Result Set”… No. 2 • An interval 𝑥 = [𝑥. 𝑙, 𝑥. 𝑟] • A set of 𝑛 intervals 𝑋 = {𝑥1 , … , 𝑥𝑛 } • A query interval 𝒒 = [𝒒. 𝒍, 𝒒. 𝒓] finds 𝑹 = 𝒙 ∈ 𝑿 ∶ 𝒙 ∩ 𝒒 , i.e., “all” intervals ∈ 𝑿 that overlap 𝒒. Ex.1) Vehicle management systems: Show the vehicles that were active between 17:00 and 22:00 a week ago. Database system view: A range search must incur 𝛀 to obtain its result set. = 𝑶(𝒏) time 𝒙∈𝑿∶𝒙∩𝒒 Still 10,000,000 intervals! query interval 𝑞 Database 𝑋 of a billion intervals 𝑥 ∈𝑋 ∶𝑥∩𝑞 Issue 1: not easy to analyze/visualize Issue 2: slow response

Background: Range Sampling on Interval Data • An interval 𝑥 = [𝑥. 𝑙, 𝑥. 𝑟] • A set of 𝑛 intervals 𝑋 = {𝑥1 , … , 𝑥𝑛 } • A query interval 𝒒 = [𝒒. 𝒍, 𝒒. 𝒓] randomly samples some intervals from 𝑹. Ex.1) Vehicle management systems: Show the vehicles that were active between 17:00 and 22:00 a week ago. Issue 1: not easy to analyze/visualize Ex.1) Vehicle management systems: Show randomly sampled vehicles that were active between 17:00 and 22:00 a week ago. Easy to analyze/visualize & accelerated efficiency (if only samples are obtained) No. 3

Problem Definition Problem Definition 1: Range search on interval data Input: 𝑋 (a set of 𝑛 intervals), 𝑞 (query interval) Output: 𝑅 = 𝑥 ∈ 𝑋 ∶ 𝑥 ∩ 𝑞 Problem Definition 2: Range counting on interval data Input: 𝑋 (a set of 𝑛 intervals), 𝑞 (query interval) Output: 𝑘 = |𝑅|, i.e., range search result size Problem Definition 3: IRS (Independent Range Sampling) on interval data Input: 𝑋 (a set of 𝑛 intervals), 𝑞 (query interval), and 𝑠 (#samples) Output: 𝑆 (a set of 𝑠 intervals, each of which is picked from 𝑅 uniformly at random) No. 4

Background: Seamless between Range Search & Sampling No. 5 Case 2 is useful/efficient ☺ Case 1 is bothering and not useful/efficient in practice  Range search Range search 𝑅 𝑅 in 𝑶(𝒌) time ☺ Run search algo. Run algo. x IRS IRS/Counting 𝑅 then 𝑆 or 𝑘 (2) 𝛀 𝒌 = 𝑶(𝒏) time  𝑆 in 𝑶(𝒔) time ☺ Run algo. x Run search algo. IRS (1) Range counting 𝑆 |𝑅| Another algorithm implementation  in 𝒐 𝒏 = 𝑶(𝒏𝟏−𝝐 ) time ☺ Run IRS algo. Run algo. x

Contributions No. 6 Challenge How to design an algorithm that satisfies the following requirements? • Range search and IRS are done in 𝑜 𝑛 + 𝑂(𝑘) and 𝑜 𝑛 + 𝑂(𝑠) times, respectively. • Only 𝑂(𝑛) space is required (to scale well to data size). • Updatable, i.e., it can deal with streaming interval data. Findings Method Range search time IRS time Space Update time Interval tree [1] 𝑂(log 𝑛 + 𝑘) 𝑂(𝑛) 𝑂(𝑛) 𝑂(𝑛 log 𝑛) AIT [2] 𝑂(log 𝑛 + 𝑘) 𝑂(log 2 𝑛 + 𝑠) 𝑂(𝑛 log 𝑛) 𝑂(𝑛 log 𝑛) FIRAS (static) 𝑂(log 𝑛 + 𝑘) 𝑂(log 2 𝑛 + 𝑠) 𝑂(𝑛) - FIRAS (dynamic) 𝑂( 𝑛 + 𝑘) 𝑂( 𝑛 + 𝑠) 𝑂(𝑛) 𝑂 𝑛 𝑂(log 𝑛) amortized [1] Computational Geometry: Algorithms and Applications, 3rd Edition, 2008. [2] “Independent Range Sampling on Interval Data,” In ICDE, 2024. Seamless?

FIRAS: Main Idea No. 7 Query result decomposition (decomposing 𝑅 into two disjoint sets) 𝑅 = 𝑥 ∈ 𝑋 ∶ 𝑥 ∩ 𝑞 = 𝑄1 ∪ 𝑄2 , where • 𝑸𝟏 = 𝒙 ∈ 𝑿 ∶ 𝒒. 𝒍 ≤ 𝒙. 𝒓 < 𝒒. 𝒓 , -> one-dimensional range search • 𝑸𝟐 = 𝒙 ∈ 𝑿 ∶ 𝒙. 𝒍 ≤ 𝒒. 𝒓 ≤ 𝒙. 𝒓 , -> stabbing query (𝑞. 𝑙 = 𝑞. 𝑟) on interval data • and 𝑄1 ∩ 𝑄2 . 𝑸𝟏 case: binary searches on a sorted array 𝑥𝑖 𝑥𝑗 … 𝑥𝑎 … 𝑞. 𝑙 … 𝑥𝑏 … … 𝑥𝑛 𝑞. 𝑟 𝑸𝟐 case: stabbing query processing on an interval data structure Merits: • Any structure is acceptable. • Guaranteed to be faster than range search Merits: • 𝑂(𝑛) space Default structure: interval tree • 𝑂(log 𝑛) time to identify the space of 𝑄1 • 𝑂(𝑛) space. • 𝑂(𝑘) time enumeration & 𝑂(1) time sampling • 𝑂(log 𝑛 + 𝑘) time stabbing query

FIRAS: IRS in 𝑶(log 2 𝒏 + 𝒔) time with 𝑶(𝒏) space 1. 𝑸𝟏 = 𝒙 ∈ 𝑿 ∶ 𝒒. 𝒍 ≤ 𝒙. 𝒓 < 𝒒. 𝒓 case Algorithm: Binary search of 𝑞𝑙 (𝑞𝑟 ) on the sorted array. 𝑥𝑖 𝑥𝑗 … 𝑥𝑎 … … 𝑥𝑏 … 𝑞. 𝑙 … 𝑥𝑛 No. 8 2. 𝑸𝟐 = 𝒙 ∈ 𝑿 ∶ 𝒙. 𝒍 ≤ 𝒒. 𝒓 ≤ 𝒙. 𝒓 case Algorithm: For each node of the interval tree, Binary search on the corresponding sorted list 𝑞. 𝑟 Time complexity: 𝑂(log 𝑛) 3. Weighted sampling Algorithm: 1. Build an alias structure [3] on the index ranges in 2. 2. Repeat the following: • Generate a random value 𝑣 ∈ [1, 𝑘]. • If 𝑣 ≤ |𝑄1 |, random sampling from 𝑄1 . • Otherwise, random sampling from the alias. Time complexity: 𝑂 log 𝑛 + 𝑠 Time complexity: 𝑂 log 2 𝑛 = 𝑂 log 𝑛 × 𝑂(log 𝑛) Sampling probability of 𝒙 ∈ 𝑹: 1/𝑘 [3] “New fast method for generating discrete random numbers with arbitrary frequency distributions,” Electronics Letters 10 (1974), 127–128. bin. search cost tree height

10.

FIRAS: Range Search in 𝑶(log 𝒏 + 𝒌) time with 𝑶(𝒏) space 1. 𝑸𝟏 = 𝒙 ∈ 𝑿 ∶ 𝒒. 𝒍 ≤ 𝒙. 𝒓 < 𝒒. 𝒓 case Algorithm: Binary search of 𝑞𝑙 (𝑞𝑟 ) on the sorted array 𝑥𝑖 𝑥𝑗 … 𝑥𝑎 … 𝑞. 𝑙 … 𝑥𝑏 … … 𝑥𝑛 No. 9 2. 𝑸𝟐 = 𝒙 ∈ 𝑿 ∶ 𝒙. 𝒍 ≤ 𝒒. 𝒓 ≤ 𝒙. 𝒓 case Algorithm: For each node of the interval tree, Binary search on the corresponding sorted list 𝑞. 𝑟 Time complexity: 𝑂(log 𝑛) 3. Scan the index ranges Algorithm: For each array/list accessed, scan the index range identified in 1 or 2. Time complexity: 𝑂(𝑘) Note: • • In the 𝑄2 case, binary searches are not required. Then, the time becomes 𝑂(log 𝑛 + 𝑘). Time complexity: 𝑂 log 2 𝑛 = 𝑂 log 𝑛 × 𝑂(log 𝑛) list size tree height

11.

FIRAS: Streaming Interval Data No. 10 Assumption: Intervals arrive in a point-wise manner. • An interval 𝑥 first has only 𝑥. 𝑙, i.e., it is an infinite interval [𝑥. 𝑙, ∞). • Sometime later, it has 𝑥. 𝑟, i.e., it becomes a closed interval 𝑥 = [𝑥, 𝑙, 𝑥. 𝑟]. 𝑥2 . 𝑙 𝑥1 . 𝑙 𝑥2 . 𝑟 𝑥3 . 𝑙 𝑥3 . 𝑟 … 𝑡 Observation: • 𝑥 = 𝑥. 𝑙, ∞ ∩ 𝑞 iff 𝑥. 𝑙 ≤ 𝑞. 𝑟. • One-sided one-dimensional range search Data structure for infinite intervals: • • A sequence of buffers 𝐵1 , … , 𝐵𝑏 • 𝐵𝑖 contains at most 𝑛 points. • 𝑂( 𝑛) time for merging 𝑶 𝟏 time insertion/deletion • • Hash table-based implementation 𝑶( 𝒏) time for range search/IRS … 𝑥1 𝒒

12.

FIRAS: Evolving Domain Interval Tree (EDIT) No. 11 Basic structure: Balanced binary search tree based on the domain size 𝐷 • Idea: The EDIT is doubled (i.e., nodes are created) only when 𝐷 > 𝐷𝑒𝑑𝑖𝑡 and 𝐷 is an odd integer. • Merits: Nodes are added incrementally, and existing intervals do no need to move. Theoretical properties: Theoretical properties: Theoretical properties: Update time (𝑫 = 𝑶 𝒏 ) Space complexity Stabbing query time • • • 𝑂(1) amortized tree update time • 𝑂 log 𝐷 + 𝑛 𝐷 ≈ 𝑂(log 𝑛) average interval insertion time 𝑂(max 𝑛, 𝐷 ) 𝑂 log 𝐷 log 𝑛 < 𝑂( 𝑛)

13.

Experiment: Setting No. 12 Algorithms for streaming interval data Algorithms for static interval data Method Range search time IRS time Space Method Range search time IRS time Space Update time Interval tree [1] 𝑂(log 𝑛 + 𝑘) 𝑂(𝑛) 𝑂(𝑛) LIT [4] 𝑂(𝑛) 𝑂(𝑛) 𝑂(𝑛) - AIT [2] 𝑂(log 𝑛 + 𝑘) LIT-EDAIT 𝑂(log 𝑛 + 𝑘) 𝑂(log 2 𝑛 + 𝑠) 𝑂(𝑛 log 𝑛) HINT [3] 𝑂(𝑛) 𝑂(𝑛) 𝑂(𝑛) LIT-FIRAS 𝑂( 𝑛 + 𝑘) 𝑂( 𝑛 + 𝑠) FIRAS 𝑂(log 𝑛 + 𝑘) 𝑂(log 2 𝑛 + 𝑠) 𝑂(𝑛) 𝑂(log 2 𝑛 + 𝑠) 𝑂(𝑛 log 𝑛) [1] Computational Geometry: Algorithms and Applications, 3rd Edition, 2008. [2] “Independent Range Sampling on Interval Data,” In ICDE, 2024. [3] “HINT: a hierarchical interval index for Allen relationships,” VLDB J. 33, 1 (2024), 73–100. [4] “LIT: Lightning-fast In-memory Temporal Indexing,” Proc. ACM Manag. Data 2, 1 (2024), 20:1–20:27. 𝑂(𝑛) 𝑂 𝑛 𝑂(log 𝑛) amortized

14.

Experiment: Result on Static Data (Range search & Memory size) No. 13

15.

Experiment: Result on Static Data (Independent Range Sampling) No. 14

16.

Experiment: Result on Streaming Data (Range search & Memory size) No. 15

17.

Experiment: Result on Streaming Data (Independent Range Sampling) No. 16 Key takeaway: FIRAS has the best tradeoff between query time, space, and update-efficiency.

18.

Conclusion No. 17 Problems • Range search/counting and IRS on static and streaming interval data Findings Method Range search time IRS time Space Update time Interval tree [1] 𝑂(log 𝑛 + 𝑘) 𝑂(𝑛) 𝑂(𝑛) 𝑂(𝑛 log 𝑛) AIT [2] 𝑂(log 𝑛 + 𝑘) 𝑂(log 2 𝑛 + 𝑠) 𝑂(𝑛 log 𝑛) 𝑂(𝑛 log 𝑛) FIRAS (static) 𝑂(log 𝑛 + 𝑘) 𝑂(log 2 𝑛 + 𝑠) 𝑂(𝑛) - FIRAS (dynamic) 𝑂( 𝑛 + 𝑘) 𝑂( 𝑛 + 𝑠) 𝑂(𝑛) 𝑂 𝑛 𝑂(log 𝑛) amortized Experiments • We demonstrate that FIRAS has the best tradeoff between query time, space, and update-efficiency. • Our implementation is available at https://github.com/psimatis/FIRAS. Seamless?

https://github.com/psimatis/FIRAS