AIP加速課題成果報告@WebDB夏のワークショップ2025

>100 Views

September 16, 25

スライド概要

profile-image

I'm a DB and AI researcher.

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

各ページのテキスト
1.

データベースサイズおよび探索結果の 公平な削減 大阪大学大学院情報科学研究科 天方 大地

2.

自己紹介 ◼ 氏名:Daichi Amagata(天方 大地) ◼ Job: 准教授@大阪大学 + 名古屋大学(クロアポ)  BE: Osaka University  MSc: Osaka University  Ph.D. (information science) Osaka University ◼ 研究テーマ: DB、DM、MLのための高速アルゴリズム  ACT-I(2017-2019)、さきがけ (2019-2023)、AIP加速課題(主たる共同研究者、2023-2026)  BOOST(2025-) No. 1

3.

研究テーマ:データベース,データマイニング,および機械学習のための高速アルゴリズム No. 2 問い合わせ • ここの計算をとにかく高速に! 結果の出力 Google & Microsoft も自社内で精力的に開発 Meta は良さそうなアルゴリズムをFaiss lib. に導入 • 実際に動く & 理論的性能保証!

4.

研究テーマ:データベース,データマイニング,および機械学習のための高速アルゴリズム 巨大な入力 出力 𝑿 𝒀 = 𝑛 個のデータの集合 ~2010年代前半:𝑛 = million は大きい 2010年代後半~:𝑛 = billion も珍しくない 探索 クラスタリング 要約 外れ値検出 ICDE’19, RecSys’21-22, TKDE’22, SIGIR’23, ICDE’24-25 SIGMOD’21, AISTATS’24 AAAI’23 SIGMOD’21, VLDBJ’22 No. 3

5.

探索:条件に合致したデータを出力 No. 4 巨大な入力 出力 𝑿 𝒀 = 𝑛 個のデータの集合 フィルタ

6.

クラスタリング:𝑿をグループに分ける No. 5 巨大な入力 出力 𝑿 𝒀 = 𝑛 個のデータの集合 分類

7.

要約:𝑿 → 𝑿′ s.t. 𝑿′ ≪ 𝑿 No. 6 巨大な入力 出力 𝑿 𝒀 = 𝑛 個のデータの集合 選択 ∗ 𝑆 = arg max𝑆⊆𝑋, 𝑆 =𝑘 𝒇 𝑺 ′ 𝑓 𝑆 = min 𝑑𝑖𝑠𝑡 𝑥, 𝑥 𝑥,𝑥 ′ ∈𝑆

8.

外れ値検出:条件に合致したデータを出力 No. 7 巨大な入力 出力 𝑿 𝒀 = 𝑛 個のデータの集合 分類

9.

研究テーマ:データベース,データマイニング,および機械学習のための高速アルゴリズム 巨大な入力 出力 𝑿 𝒀 = 𝑛 個のデータの集合 ~2010年代前半:𝑛 = million は大きい 2010年代後半~:𝑛 = billion も珍しくない 探索 クラスタリング 公平性の概念を組み込んだ定式化 要約 外れ値検出 No. 8

10.

この分野における「公平性」ってなに??? 定義は沢山ありますが,私が扱う研究では (たぶん) 公平性 ≒ 平等 「探索」,「クラスタリング」,および「要約」における平等ってなに??? No. 9

11.

この分野における「公平性」ってなに??? No. 10 巨大な入力 出力 𝑿 𝒀 = 𝑛 個のデータの集合 ~2010年代前半:𝑛 = million は大きい 2010年代後半~:𝑛 = billion も珍しくない 公平性 ≒ 平等 「探索」,「クラスタリング」,および「要約」における平等ってなに???

12.

探索問題における「公平性」ってなに??? No. 11 巨大な入力 要件を満たす集合 出力 𝑿 𝒁 𝒀 = 𝑛 個のデータの集合 = 𝑚 個のデータの集合 ~2010年代前半:𝑛 = million は大きい 2010年代後半~:𝑛 = billion も珍しくない ∀𝒙 ∈ 𝒁, 𝐏𝐫𝐨𝐛 𝒙 ∈ 𝒀 = 𝟏/ 𝒁 :統計的に重要な条件 𝑍: 条件を満たす全てのデータ 𝑌: 𝑍からランダムサンプルされたデータ

13.

公平な探索問題の何が難しい??? No. 12 巨大な入力 要件を満たす集合 出力 𝑿 𝒁 𝒀 = 𝑛 個のデータの集合 ~2010年代前半:𝑛 = million は大きい 2010年代後半~:𝑛 = billion も珍しくない 𝑿 から 𝒁 を計算 = 𝑚 個のデータの集合 ランダムサンプリング → 𝑍 が大きい → 少なくとも大きい数のデータにアクセスする必要有 → 遅い(計算コスト大) 巨大な入力 出力 𝑿 𝒀 = 𝑛 個のデータの集合 ~2010年代前半:𝑛 = million は大きい 2010年代後半~:𝑛 = billion も珍しくない

14.

公平な探索問題で何を達成した? No. 13 内積探索問題 [1] 𝑥 | 𝑥 ∈ 𝑋, 𝑥 ⋅ 𝑞 ≥ 𝜏 から,ランダムな 𝑡 個のデータをサンプルする時間(期待値):log 𝑛 + 𝑡 に比例 インターバル探索問題 [2] 𝑥 | 𝑥 ∈ 𝑋, 𝑥 ∩ 𝑞 ≠ ∅ から,ランダムな 𝑡 個のデータをサンプルする時間:log 2 𝑛 + 𝑡 に比例 空間ジョイン問題 [3] (𝑥, 𝑦) | 𝑥, 𝑦 ∈ 𝑋 ⋈ 𝑌 から,ランダムな 𝑡 個のデータをサンプルする時間(期待値):𝑛 + 𝑚 + 𝑡 にほぼ比例 巨大な入力 出力 𝑿 𝒀 = 𝑛 個のデータの集合 [1] K. Aoyama et al., “Simpler is Much Faster: Fair and Independent Inner Product Search,” In SIGIR, 2023. [2] D. Amagata, “Independent Range Sampling on Interval Data,” In ICDE 2024. [3] D. Amagata, “Random Sampling over Spatial Range Joins,” In ICDE 2025.

15.

クラスタリングにおける「公平性」ってなに??? グループ公平性 個人公平性 各データの属性を考慮 類似したデータ → 類似した結果 No. 14

16.

クラスタリングにおける「グループ公平性」ってなに? グループ公平性 w/o over- & lower-representation 各データの属性を考慮 = "𝛼𝑗 ≤ 𝐶𝑖 ∩ 𝑃𝑗 ≤ 𝛽𝑗 " No. 15 → 各クラスタ𝐶𝑖 には属性𝑗のデータが𝛼𝑗 以上𝛽𝑗 以下属する. → あるクラスタが特定の属性に独占されないようにする. Matroid constraint = " 𝑆 ∩ 𝑃𝑗 = 𝑘𝑗 ∀𝑗 ∈ 1, 𝑚 , 𝑆 = 𝑘" → クラスタ中心には属性𝑗から𝑘𝑗 個選ばれる. → クラスタ中心が特定の属性に独占されないようにする. Social fairness Δ 𝐶,𝑃𝑖 𝑃𝑖 𝑚 = min max → 各属性に対して遠いクラスタセンタを抑制 → クラスタ中心が特定の属性に対して遠い,を防ぐ.

17.

クラスタリングにおける「個人公平性」ってなに??? 定義 個人公平性 各データに対して、クラスタセンタが自身の 類似したデータ → 類似した結果 𝑛/𝑘-最近傍に含まれている. No. 16

18.

公平なクラスタリングの何が難しい? No. 17 𝑘-クラスタリング :𝑆 ∗ = argmin 𝑓(𝑃, 𝑆) 𝑆 =𝑘 𝑆 はクラスタセンタの集合であり, 𝑆 ∗ の計算はNP困難(多項式時間では解が得られない) → 𝑺∗ っぽい解を高速に探したい w/o over- & lower-representation = "𝛼𝑗 ≤ 𝐶𝑖 ∩ 𝑃𝑗 ≤ 𝛽𝑗 " → 各クラスタ𝐶𝑖 には属性𝑗のデータが𝛼𝑗 以上𝛽𝑗 以下属する. → あるクラスタが特定の属性に独占されないようにする. Matroid constraint = " 𝑆 ∩ 𝑃𝑗 = 𝑘𝑗 ∀𝑗 ∈ 1, 𝑚 , 𝑆 = 𝑘" → クラスタ中心には属性𝑗から𝑘𝑗 個選ばれる. → クラスタ中心が特定の属性に独占されないようにする. Social fairness Δ 𝐶,𝑃𝑖 𝑃𝑖 𝑚 = min max → 各属性に対して遠いクラスタセンタを抑制 → クラスタ中心が特定の属性に対して遠い,を防ぐ.

19.

Fair k-center Clustering with Outliers Daichi Amagata Presented at AISTATS2024

20.

Background: Data Summarization No. 19 This dataset (𝑋) is too large for my application (or resource)… I wanna reduce the size to 𝑘. 𝑋: a set of 𝑛 points 𝑆 𝑆: a set of 𝑘 (≪ 𝑛) points • Sampling: Quite easy & the best if the data distribution should be preserved (but may be skewed). • Submodular maximization: Users must specify an appropriate function (which may not be trivial). • Diversity maximization: Easy to specify & the summary can avoid skewness.

21.

Background: k-center Clustering ∗ 𝑆 = arg min𝑆⊆𝑃, 𝑆 =𝑘 max𝑝∈𝑃 𝑑𝑖𝑠𝑡(𝑝, 𝑆) It’s diverse! No. 20

22.

Background: k-center Clustering + Outliers No. 21 ∗ 𝑆 = arg min𝑆⊆𝑃∖𝑃𝑜𝑢𝑡 , 𝑆 =𝑘 max𝑝∈𝑃∖𝑃𝑜𝑢𝑡 𝑑𝑖𝑠𝑡(𝑝, 𝑆) where |𝑃𝑜𝑢𝑡 | ≤ 1 + 𝜖 𝑧 It’s diverse!

23.

Background: k-center Clustering + Outliers cannot be fair No. 22 Example 𝑘-center clustering(only red centers) Make it “FAIR” Google image retrieval by “CEO” (only “male”) in past It’s diverse if centers are from • inliers and • required demographic groups.

24.

Preliminary No. 23 Problem Definition Input: 𝑃 (a set of 𝑛 points), 𝑘, 𝑘1, … , 𝑘𝑚 (𝑘 ≤ σ 𝑘𝑖 ), 𝑧 < 𝑛, and 𝜖 > 0 Output: 𝑆 ∗ = argmin𝑆⊆𝑃∖𝑃𝑜𝑢𝑡, 𝑆 =𝑘,∀𝑖∈ 1,𝑚 , 𝑆∩𝑃𝑖 ≤𝑘𝑖 fairness max 𝑑𝑖𝑠𝑡 𝑝, 𝑆 𝑝∈𝑃∖𝑃𝑜𝑢𝑡 ▷ NP-hard inliers First attempt: How about running • a state-of-the-art 𝑘, 1 + 𝜖 𝑧 -center clustering (e.g., [3]) • with an approach that selects centers only from groups that do not violate the fairness [4]? Claim (no approximation bound) The clustering quality of the above algorithm can be arbitrarily bad. [3] “Greedy strategy works for k-center clustering with outliers and coreset construction,” In ESA, 2019. [4] “Fair k-centers via maximum matching”, In ICML, 2020

25.

Our Techniques No. 24 Procedure 2𝑟 1. 𝑆 ← a random point ∈ 𝑃 ∗ 2. 𝑇 ← 𝑝 𝑝 ∈ 𝑃, 𝑑𝑖𝑠𝑡 𝑝, 𝑆 > 2𝑟} (𝑟 is a guess of 𝑟𝑘,𝑧,𝑚 ) If 𝑇 > 1 + 𝜖 𝑧, add a random point ∈ 𝑇 to 𝑆 Else break During these, build/update a cluster-group graph cluster-group graph (If a group belongs to a cluster, its center has an edge to this group.) 3. 𝑘1 𝑎, 𝑖 , … , (𝑏, 𝑗) ← solve the max-flow problem 4. Replace the cluster centers by using the above result 𝑠 𝑡 𝑘𝑚 Note • Step 2 guarantees 2(1 + 𝛾)-approximation (without fairness constraint). • Steps 3 guarantees the fairness constraint. • Step 4 guarantees 3(1 + 𝛾)-approximation (from the triangle inequality).

26.

Theoretical Results No. 25 Main theorem (what is possible): There is an 𝑂(𝑛𝑘) time algorithm that needs 𝑂(𝑛) space and yields a (3 + 𝛾)-approximation result for the fair (𝑘, 1 + 𝜖 𝑧)-center clustering problem with probability at least centers and removing at most 1 + 𝜖 𝑧 points. 𝑧 1− 𝑛 𝜖 𝑘−1 , by returning exactly 𝑘 1+𝜖 Proposition (what is impossible): For any 𝑆 such that 𝑆 ≤ 𝑘 and 𝑆 contains at least one outlier, it is not guaranteed that ∗ • we have the worst-case approximation bound of 𝑟𝑘,𝑧,𝑚 and • we can fix 𝑆 so that it certainly satisfies 𝑆 = 𝑘, the fairness constraint, and a guaranteed approximation ∗ bound of 𝑟𝑘,𝑧,𝑚 in polynomial time if P ≠ NP.

27.

Discussion No. 26 ∗ ◼ How to guess 𝑟 = 𝑟𝑘,𝑧,𝑚 ? FACT: 𝒓∗𝒌,𝒛,𝟏 ≤ 𝒓∗𝒌,𝒛,𝒎 ≤ 𝒓∗𝒌,𝟎,𝒎 ∗ • 𝑟𝑘,𝑧,1 : optimal radius when no groups ∗ ∗ ∗ ∗ Repeating 𝑟 = 𝑟𝑘,𝑧,1 , 1 + 𝛾 𝑟𝑘,𝑧,1 , 1 + 𝛾 2𝑟𝑘,𝑧,1 , … , 𝑟𝑘,0,𝑚 𝛽 𝛾 needs 𝑂( log 𝑛) trials ∗ • 𝑟𝑘,0,𝑚 : optimal radius when no outliers ◼ How to make the success probability 1−𝑛 Boole’s inequality: 1 − 1 − 𝑝 𝑡 ≥ 𝐶, where 𝑡 is #trials 𝑛 1+𝜖 𝑡=𝑂 𝑛−𝑧 𝜖 𝑘−1 ≈𝑂 1 𝑧 𝜖 1+𝜖 𝑘−1 constant 𝐶 ?

28.

Empirical Results No. 27 [Max radius] Our algorithms yield smaller errors. [Running time] Ours-RS is basically faster than the competitors. [Impact of k] Our algorithms are still superior to the competitors.

29.

Independent Range Sampling on Interval Data 大阪大学 天方 大地

30.

Background: Range Search on Interval Data No. 29 • An interval 𝑥 = [𝑥. 𝑙, 𝑥. 𝑟] • A set of 𝑛 intervals 𝑋 = {𝑥1 , … , 𝑥𝑛 } • A query interval 𝒒 = [𝒒. 𝒍, 𝒒. 𝒓] finds 𝒙|𝒙 ∈ 𝑿, 𝒙 ∩ 𝒒 = 𝒒 ∩ 𝑿, “all” intervals ∈ 𝑿 that overlap 𝒒. Ex.1) Vehicle management systems: Show vehicles that were active between 17:00 and 22:00 a week ago. Ex.2) Book management systems: Collect books that were borrowed in the last month. Ex.3) Historical cryptocurrency databases: Show when the price of Bitcoin (BTC) falls in [30,000, 40,000] dollars.

31.

Background: Issues Due to “Too Large Result Set” No. 30 • An interval 𝑥 = [𝑥. 𝑙, 𝑥. 𝑟] • A set of 𝑛 intervals 𝑋 = {𝑥1 , … , 𝑥𝑛 } • A query interval 𝒒 = [𝒒. 𝒍, 𝒒. 𝒓] finds 𝒙|𝒙 ∈ 𝑿, 𝒙 ∩ 𝒒 = 𝒒 ∩ 𝑿, “all” intervals ∈ 𝑿 that overlap 𝒒. Ex.1) Vehicle management systems: Show vehicles that were active between 17:00 and 22:00 a week ago. Database system view: A range search must incur 𝛀 to obtain its result set. 𝒙|𝒙 ∈ 𝑿, 𝒙 ∩ 𝒒 = 𝑶(𝒏) time Still 10,000,000 intervals! query interval 𝑞 Database 𝑋 of a billion intervals 𝑥 | 𝑥 ∈ 𝑋, 𝑥 ∩ 𝑞 Issue 1: not easy to analyze/visualize Issue 2: slow response

32.

Background: Range Sampling • An interval 𝑥 = [𝑥. 𝑙, 𝑥. 𝑟] • A set of 𝑛 intervals 𝑋 = {𝑥1 , … , 𝑥𝑛 } • A query interval 𝒒 = [𝒒. 𝒍, 𝒒. 𝒓] randomly samples some intervals from 𝒒 ∩ 𝑿. Ex.1) Vehicle management systems: Show vehicles that were active between 17:00 and 22:00 a week ago. Ex.1) Vehicle management systems: Show randomly sampled vehicles that were active between 17:00 and 22:00 a week ago. Issue 1: not easy to analyze/visualize easy to analyze/visualize & accelerated efficiency (if only samples are obtained) No. 31

33.

Preliminary No. 32 Problem Definition: IRS (Independent Range Sampling) on interval data Input: 𝑋 (a set of 𝑛 intervals), 𝑞 (query interval), and 𝑠 (#samples) Output: 𝑆 (a set of 𝑠 intervals, each of which is picked from 𝑞 ∩ 𝑋 uniformly at random) First attempt: How about (1) running a search algorithm (e.g., [1]) (2) and then do random sampling? query interval 𝑞 (2) Random sampling 𝑥 | 𝑥 ∈ 𝑋, 𝑥 ∩ 𝑞 Claim (no merit) The above algorithm incurs Ω 𝑞 ∩ 𝑋 Database 𝑋 (1) Range search = 𝑂(𝑛) time. [1] “Hint: A hierarchical index for intervals in main memory,” In SIGMOD, 2022. 𝑆

34.

Contributions No. 33 Challenge How to obtain 𝑠 random samples without running a search algorithm? Findings We prove that there exists a • • 1 ෨ 𝑂෨ 𝑠 time and 𝑂(𝑛) space algo. that picks 𝑠 samples, each of which is obtained with probability 𝑋∩𝑞 ෨ 𝑂෨ 𝑠 time and 𝑂(𝑛) space algo. that picks 𝑠 weighted samples, each of which is obtained with probability σ 𝑂෨ ⋅ hides any polylog factors. Experiments • We demonstrate that our algorithms are much faster than baseline algorithms. • Our implementation is available at https://github.com/amgt-d1/IRS-interval. 1 𝑥𝑗 ∈𝑞∩𝑋 𝑤(𝑥𝑗 )

35.

Building Block of Our Technique No. 34 Interval tree: a balanced binary tree for interval data Structure • Each interval is maintained at the firstly stabbed node (e.g., 𝑥1 at 𝑢𝑟𝑜𝑜𝑡 , 𝑥3 at 𝑢2) • Each node 𝑢𝑖 maintains • a value 𝑐𝑖 (centroid) • 𝐿𝑙𝑖 and 𝐿𝑟𝑖 which respectively are lists of intervals sorted 𝐿𝑙1 = 𝑥2 , 𝑥4 𝐿𝑟1 = [𝑥2 , 𝑥4 ] in ascending order of left-endpoint and right-endpoint Range search 𝑢1 Traverse overlapping nodes from 𝑢𝑟𝑜𝑜𝑡 in the following way: 𝑢2 𝑢3 Case 1: If 𝑞. 𝑟 < 𝑐𝑖 , traverse only left child node (e.g., 𝑞1 at 𝑢𝑟𝑜𝑜𝑡) 𝑐3 Case 2: If 𝑐𝑖 < 𝑞. 𝑙, traverse only right child node (e.g., 𝑞1 at 𝑢1) 𝑥2 Case 3: If 𝑞. 𝑙 ≤ 𝑐𝑖 ≤ 𝑞. 𝑟, traverse both left and right child nodes (e.g., 𝑞2 at 𝑢2) ⊳ This approach needs 𝑶(𝒏) time, because of no bounds for case 3. 𝑢𝑟𝑜𝑜𝑡 𝐿𝑙𝑟𝑜𝑜𝑡 = 𝑥1 , 𝑥6 𝐿𝑟𝑟𝑜𝑜𝑡 = [𝑥6 , 𝑥1 ] 𝑥8 𝑢5 𝑢4 𝑐1 𝑥1 𝑐4 𝑐𝑟𝑜𝑜𝑡 𝑥3 𝑥4 𝑥9 𝑥6 𝑥5 𝑥10 𝑞1 𝑐5 𝑢6 𝑐2 𝑥7 𝑥11 𝑞2 𝑐6

36.

Main Idea 1 No. 35 Interval tree: a balanced binary tree for interval data Structure • Each interval is maintained at the firstly stabbed node (e.g., 𝑥1 at 𝑢𝑟𝑜𝑜𝑡 , 𝑥3 at 𝑢2) • Each node 𝑢𝑖 maintains 𝐿𝑙𝑖 = 𝒙𝟏, 𝒙𝟐 , 𝒙𝟑 , 𝒙𝟒 , 𝑥5, 𝑥6 • a value 𝑐𝑖 (centroid) • 𝐿𝑙𝑖 and 𝐿𝑟𝑖 which respectively are lists of intervals sorted Exploiting this structure: in ascending order of left-endpoint and right-endpoint A binary search identifies the boundary where 𝑞 overlaps. Range search ⊳ Only 𝑂(log 𝑛) time for each node Traverse overlapping nodes from 𝑢𝑟𝑜𝑜𝑡 in the following way: Case 1: If 𝑞. 𝑟 < 𝑐𝑖 , traverse only left child node (e.g., 𝑞1 at 𝑢𝑟𝑜𝑜𝑡) Case 2: If 𝑐𝑖 < 𝑞. 𝑙, traverse only right child node (e.g., 𝑞1 at 𝑢1) Case 3: If 𝑞. 𝑙 ≤ 𝑐𝑖 ≤ 𝑞. 𝑟, traverse both left and right child nodes (e.g., 𝑞2 at 𝑢2) ⊳ This approach needs 𝑂(𝑛) time, because of no bounds for case 3.

37.

Main Idea 2 No. 36 Interval tree: a balanced binary tree for interval data Make this case at most once: Structure If we can achieve this, we need only Cases 1 & 2: 𝑂 log 𝑛 × 𝑂 log 𝑛 = 𝑂෨ 1 + • Each interval is maintained at the firstly stabbed node (e.g., 𝑥1 at 𝑢𝑟𝑜𝑜𝑡 , 𝑥3 at 𝑢2) • Each node 𝑢𝑖 maintains • a value 𝑐𝑖 (centroid) • 𝐿𝑙𝑖 and 𝐿𝑟𝑖 which respectively are lists of intervals sorted Case 3: 𝑂 log 𝑛 × 𝑂 1 = 𝑂෨ 1 time to identify the sample candidate. 𝐿𝑙1 = 𝑥2 , 𝑥4 𝐿𝑟1 = [𝑥2 , 𝑥4 ] in ascending order of left-endpoint and right-endpoint Range search 𝑢𝑟𝑜𝑜𝑡 𝐿𝑙𝑟𝑜𝑜𝑡 = 𝑥1 , 𝑥6 𝐿𝑟𝑟𝑜𝑜𝑡 = [𝑥6 , 𝑥1 ] 𝑢1 Traverse overlapping nodes from 𝑢𝑟𝑜𝑜𝑡 in the following way: 𝑢2 𝑢3 Case 1: If 𝑞. 𝑟 < 𝑐𝑖 , traverse only left child node (e.g., 𝑞1 at 𝑢𝑟𝑜𝑜𝑡) 𝑐3 Case 2: If 𝑐𝑖 < 𝑞. 𝑙, traverse only right child node (e.g., 𝑞1 at 𝑢1) 𝑥2 Case 3: If 𝑞. 𝑙 ≤ 𝑐𝑖 ≤ 𝑞. 𝑟, traverse both left and right child nodes (e.g., 𝑞2 at 𝑢2) 𝑥8 𝑢5 𝑢4 𝑐1 𝑥1 𝑐4 𝑐𝑟𝑜𝑜𝑡 𝑥3 𝑥4 𝑥9 𝑥6 𝑐5 𝑥5 𝑥10 𝑢6 𝑐2 𝑐6 𝑥7 𝑥11 𝑞 How to access 𝑥9 and 𝑥10 without incurring 𝑂 𝑛 node accesses?

38.

New Data Structure No. 37 AIT: Augmented Interval Tree Difference to interval tree • Each interval is maintained at the firstly stabbed node (e.g., 𝑥1 at 𝑢𝑟𝑜𝑜𝑡 , 𝑥3 at 𝑢2) • Each node 𝑢𝑖 maintains • a value 𝑐𝑖 (centroid) • 𝐿𝑙𝑖 and 𝐿𝑟𝑖 which respectively are lists of intervals sorted in ascending order of left-endpoint and right-endpoint • 𝑨𝑳𝒍𝒊 and 𝑨𝑳𝒓𝒊 which respectively are lists of all intervals existing in the 𝐿𝑙1 = [𝑥2 , 𝑥4 ], 𝐿𝑟1 = [𝑥2 , 𝑥4 ] 𝐴𝐿𝑙1 = [𝑥8 , 𝑥2 , 𝑥4 , 𝑥9 ] 𝐴𝐿𝑟1 = [𝑥8 , 𝑥2 , 𝑥4 , 𝑥9 ] 𝑢1 subtree rooted at 𝑢𝑖 sorted in asc. order of left-endpoint and right-endpoint. At 𝑢2, 𝑥3, 𝑥10, and 𝑥5 are found from 𝐴𝐿𝑙2. 𝑐3 (At 𝑢𝑟𝑜𝑜𝑡 , 𝑥1 and 𝑥6 are found.) • Theoretical time efficiency: we traverse at most log 𝑛 (= height) nodes. • Practical efficiency: when we meet case 3, its child nodes are the last ones to be traversed (i.e., no need to traverse their descendants). 𝑢2 𝑢3 At 𝑢1, 𝑥9 is found from 𝐴𝐿𝑟1. Merits of this augmentation 𝑢𝑟𝑜𝑜𝑡 𝐿𝑙2 = [𝑥3 , 𝑥5 ], 𝐿𝑟2 = [𝑥5 , 𝑥3 ] 𝐴𝐿𝑙2 = [𝑥3 , 𝑥10 , 𝑥5 , 𝑥7 , 𝑥11 ] 𝐴𝐿𝑟2 = [𝑥10 , 𝑥5 , 𝑥3 , 𝑥11, 𝑥7 ] 𝑥2 𝑥8 𝑢5 𝑢4 𝑐1 𝑥1 𝑐4 𝑐𝑟𝑜𝑜𝑡 𝑥3 𝑥4 𝑥9 𝑥6 𝑐5 𝑥5 𝑥10 𝑢6 𝑐2 𝑐6 𝑥7 𝑥11 𝑞 How to access 𝑥9 and 𝑥10 without incurring 𝑂 𝑛 node accesses?

39.

Our IRS Algorithm No. 38 Case 1 Case 2 Cases 1 & 2: Identify the boundaries in 𝐿𝑙𝑖 or 𝐿𝑟𝑖 ⊳ 𝑂(log 𝑛) boundaries in total 𝐿𝑙𝑟𝑜𝑜𝑡 = 𝒙𝟏 , 𝒙𝟔 𝐿𝑟𝑟𝑜𝑜𝑡 = [𝑥6 , 𝑥1 ] because 𝑂(log 𝑛) nodes have this case (AIT’s height = 𝑂 log 𝑛 ) Case 3 Case 3: Identify the boundaries in 𝐴𝐿𝑗𝑙 and 𝐴𝐿𝑟𝑘 ⊳ 𝑂(1) boundaries in total + early termination 𝐿𝑙1 = [𝑥2 , 𝑥4 ], 𝐿𝑟1 = [𝑥2 , 𝑥4 ] 𝐴𝐿𝑙1 = [𝑥8 , 𝑥2 , 𝑥4 , 𝑥9 ] 𝐴𝐿𝑟1 = [𝑥8 , 𝑥2 , 𝑥4 , 𝒙𝟗 ] 𝑢1 because this case accesses at most additional two nodes 𝑢2 𝑢3 ෩ 𝟏 time to identify the candidate ⊳𝑶 𝑐3 Weighted sampling 𝑥2 (Each node has a different size of candidate) by Walker’s alias method: • 𝑂(log 𝑛) time to build an alias because of 𝑂(log 𝑛) boundaries • 𝑂(1) time to pick a random sample ⊳ 𝑶(𝐥𝐨𝐠 𝒏 + 𝒔) time to pick 𝒔 samples 𝑢𝑟𝑜𝑜𝑡 𝐿𝑙2 = [𝑥3 , 𝑥5 ], 𝐿𝑟2 = [𝑥5 , 𝑥3 ] 𝐴𝐿𝑙2 = [𝒙𝟑 , 𝒙𝟏𝟎 , 𝒙𝟓 , 𝑥7 , 𝑥11 ] 𝐴𝐿𝑟2 = [𝑥10 , 𝑥5 , 𝑥3 , 𝑥11, 𝑥7 ] 𝑥8 𝑢5 𝑢4 𝑐1 𝑥1 𝑐4 𝑐𝑟𝑜𝑜𝑡 𝑥3 𝑥4 𝑥9 𝑥6 𝑐5 𝑥5 𝑥10 𝑢6 𝑐2 𝑐6 𝑥7 𝑥11 𝑞 This query accesses only 𝑢𝑟𝑜𝑜𝑡 , 𝑢1, and 𝑢2.

40.

Theoretical Analysis No. 39 Theorem 1 (Space complexity) The space complexity of our algorithm (or an AIT) is 𝑂(𝑛 log 𝑛). Larger space than that of KDS [4] (i.e., 𝑂 𝑛 ). Theorem 2 (Time complexity) Our algorithm picks 𝑠 random samples in 𝑂(log 2 𝑛 + 𝑠) time. Theorem 3 (Correctness) Our algorithm picks a random samples with probability 1/|𝑞 ∩ 𝑋|. [4] “Spatial independent range sampling,” In SIGMOD, 2021. Less time complexity than that of KDS (i.e., 𝑂( 𝑛 + 𝑠) expected). Better success probability than 1 that of KDS (i.e., ). 𝛼 𝑞∩𝑋

41.

How to Reduce Space Complexity? No. 40 Definition 1 (virtual interval) Given a subset 𝑋𝑖 of 𝑋, the virtual interval of 𝑋𝑖 , 𝑣𝑖 , is defined as 𝑣𝑖 . 𝑙 = min 𝑥. 𝑙 and 𝑣𝑖 . 𝑟 = max 𝑥. 𝑟 𝑥∈𝑋𝑖 𝑥∈𝑋𝑖 Corollary 1 Given a set 𝑉 of virtual intervals such that 𝑉 = Θ 𝑛 , the space complexity of the AIT on 𝑉 is 𝑂 𝑛 . log 𝑛 How to partition 𝑿 (or how to alleviate false positives) Challenge: virtual intervals overlapping 𝑞 may have intervals that do not overlap 𝑞. Our approach: pair-sort Ex.) 𝑞 overlaps 𝑣1 : 𝑞 overlaps 𝑥1 and 𝑥2 but not 𝑥3 . ⊳ This corresponds to drawing a z-curve 𝑥1 𝑥3 𝑣1 𝑥2 𝑞 (sort by left-endpoint and break ties by right-endpoint) Our partition: disjoint and uniform partition based on the sort order

42.

How to Handle Weighted Intervals? No. 41 Why AIT is inefficient for the weighted case? Note: we may have 𝑤 𝑥𝑖 ≠ 𝑤(𝑥𝑗 ) The cumulative sum of the weights of the sequence of overlapping intervals is NOT known without accessing these intervals. (If we access these intervals, 𝑂(𝑛) time is incurred.) AWIT: Augmented Weighted Interval Tree Difference to AIT • Each node 𝑢𝑖 further maintains • • • 𝐿𝑙1 = [𝑥2 , 𝑥4 ], 𝐿𝑟1 = [𝑥2 , 𝑥4 ] 𝐴𝐿𝑙1 = [𝑥8 , 𝑥2 , 𝑥4 , 𝑥9 ] 𝐴𝐿𝑟1 = [𝑥8 , 𝑥2 , 𝑥4 , 𝒙𝟗 ] 𝑢1 𝑾𝒍𝒊 and 𝑾𝒓𝒊 which are respectively arrays maintaining the cumulative sums of weights of the intervals in 𝐿𝑙𝑖 and 𝐿𝑟𝑖 . 𝑨𝑾𝒍𝒊 and 𝑨𝑾𝒓𝒊 which are respectively arrays maintaining the cumulative sums of weights of the intervals in 𝐴𝐿𝑙𝑖 and 𝐴𝐿𝑟𝑖 . Only additional 𝑂(log 𝑛) time for each sampling. 𝑢𝑟𝑜𝑜𝑡 𝐿𝑙2 = [𝑥3 , 𝑥5 ], 𝐿𝑟2 = [𝑥5 , 𝑥3 ] 𝐴𝐿𝑙2 = [𝒙𝟑 , 𝒙𝟏𝟎 , 𝒙𝟓 , 𝑥7 , 𝑥11 ] 𝐴𝐿𝑟2 = [𝑥10 , 𝑥5 , 𝑥3 , 𝑥11, 𝑥7 ] 𝑢2 𝑢3 𝑐3 𝑥2 𝑥8 𝑢5 𝑢4 𝑐1 𝑥1 𝑐4 𝑐𝑟𝑜𝑜𝑡 𝑥3 𝑥4 𝑥9 𝑥6 𝑐5 𝑥5 𝑥10 𝑢6 𝑐2 𝑐6 𝑥7 𝑥11 𝑞 𝐴𝑊2𝑙 = 𝑤(𝑥3 , 𝑤 𝑥3 + 𝑤(𝑥10 ), 𝒘 𝒙𝟑 + 𝒘(𝒙𝟏𝟎 ) + 𝒘(𝒙𝟓 ), ⋯ ]

43.

Experiments: Dataset, Pre-processing Time, and Memory Usage No. 42 Diverse applications: • Book management • Bitcoin • Vehicles (train & taxi) • • AIT needs more offline time and memory than the others, but they are still reasonable. AIT-V (AIT on 𝑉) alleviates both costs.

44.

Experiments: Query Processing Time (Decomposed Time) AIT & AIT-V limits the space where 𝑞 ∩ 𝑋 exist so quickly. • • AIT picks 1000 samples much faster than KDS. AIT-V is slower than AIT because it incurs checks of "𝑥 ∈ 𝑞 ∩ 𝑋". No. 43

45.

Experiments: Query Processing Time (Impact of Parameters) AIT & AIT-V are not affected by range size. AIT & AIT-V are less affected by dataset size (𝑛). Times of range sampling algorithms are linear to 𝑠. No. 44

46.

Experiments: Query Processing Time (Weighted Interval Case) No. 45 AIT is less affected by range size. AWIT always outperforms the others. AWIT has a superior sampling efficiency.

47.

How to Deal With Updates? No. 46 Insertion case: Batch processing • Insertion pool size: 𝑂(log 2 𝑛) • When an insertion is received: 1. Add it into the pool. 2. If the pool size exceeds, we update the AIT by following the query processing algorithm. (If going to left or right child node, but no node exists, we create it.) • Why batch? • 𝑢𝑟𝑜𝑜𝑡 When update the AIT, we need to sort the intervals in 𝐿𝑖 and 𝐴𝐿𝑖 , 𝑢1 𝑢2 and batch alleviates this sort cost. • 𝑢3 When re-build the AIT? • 𝑐3 When the height of AIT exceeds log 𝑛. 𝑥2 Deletion case: One-by-one • Simply delete the corresponding interval from 𝐿𝑖 and 𝐴𝐿𝑗 . • If the nodes have no intervals, these nodes are also removed. 𝑥8 𝑢5 𝑢4 𝑐1 𝑥1 𝑐4 𝑐𝑟𝑜𝑜𝑡 𝑥3 𝑥4 𝑥9 𝑥6 𝑐5 𝑥5 𝑥10 𝑢6 𝑐2 𝑥7 𝑥11 New interval (added to 𝐿𝑟𝑜𝑜𝑡 ) 𝑐6

48.

Summary No. 47 Problem Independent range sampling on interval data that returns 𝑠 intervals, each of which is picked from 𝑞 ∩ 𝑋 uniformly at random Findings We prove that there exists an • • 1 ෨ 𝑂෨ 𝑠 time and 𝑂(𝑛) space algo. that picks 𝑠 samples, each of which is obtained with probability 𝑋∩𝑞 ෨ 𝑂෨ 𝑠 time and 𝑂(𝑛) space algo. that picks 𝑠 weighted samples, each of which is obtained with probability σ 1 𝑥𝑗 ∈𝑞∩𝑋 𝑤(𝑥𝑗 ) Experiments • We demonstrate that our algorithms are much faster than a baseline algorithm. • Our implementation is available at https://github.com/amgt-d1/IRS-interval.

49.

まとめ:この分野における「公平性」とは No. 48 巨大な入力 出力 𝑿 𝒀 = 𝑛 個のデータの集合 ~2010年代前半:𝑛 = million は大きい 2010年代後半~:𝑛 = billion も珍しくない 公平性 ≒ 平等 探索における公平性:解に入る確率が平等(同じ) クラスタリングにおける公平性:グループ公平性(属性が平等に扱われる)と個人公平性(似たデータがクラスタセンタ)