[20241216] PG-StromのApache Arrow対応とデータベース運用の勘所

PG-StromのApache Arrow対応とデータベース運用の勘所ヘテロＤＢ株式会社チーフアーキテクト兼 CEO 海外浩平 <[email protected]>

自己紹介／HeteroDB社についてヘテロジニアスコンピューティング技術をデータベース領域に適用し、誰もが使いやすく、安価で高速なデータ解析基盤を提供する。会社概要  商号  創業  拠点  事業内容ヘテロDB株式会社 2017年7月4日東京都品川区北品川高速データベース製品の販売 GPU＆DB領域の技術コンサルティング代表者プロフィール  海外浩平（KaiGai Kohei）  OSS開発者コミュニティにおいて、PostgreSQLやLinux kernelの開発に10年以上従事。主にセキュリティ・FDW等の分野でアップストリームへの貢献。  IPA未踏ソフト事業において“天才プログラマー”認定 (2006)  GPU Technology Conference Japan 2017でInception Awardを受賞 2 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

3.

製品概要 PG-Stromとは？ PG-Strom: GPUとNVMEの能力を最大限に引き出し、テラバイト級のデータを高速処理するPostgreSQL向け拡張モジュール off-loading App GPU ➢ GpuScan / GpuJoin / GpuPreAgg ➢ GPU-Direct SQL ➢ Apache Arrow support (with min/max statistics) ➢ Updatable GPU Cache ➢ Asymmetric Partition-wise JOIN/GROUP BY ➢ Large Tables GPU-JOINs ➢ BRIN-Index support ➢ PostGIS support (with GiST-index on GpuJoin) ➢ pg2Arrow / mysql2arrow / pcap2arrow 【機能】  集計／解析ワークロードの透過的なGPU高速化  SQLからGPUプログラムを自動生成し超並列実行  GPU-Direct SQLによるPCIeバスレベルのI/O最適化を行う  Apache Arrowに対応し、IoT/M2Mログデータの処理にフォーカス  PostGIS関数・演算子のサポートで、位置情報分析を高速化 3 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

4.

GPUとはどんなプロセッサなのか？主にHPC分野で実績があり、機械学習用途で爆発的に普及 NVIDIA A100 スーパーコンピュータ（東京工業大学 TSUBAME3.0）シミュレーションＣＧ（Computer Graphics）数千コアの並列処理ユニット、TB/sに達する広帯域メモリをワンチップに実装した半導体デバイス ➔ “同じ計算を大量のデータに並列実行” を最も得意とする 4 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所機械学習

5.

PG-Strom － GPUを用いてSQLを並列処理する実行結果 SQL パーサ PG-Strom拡張モジュール SQL オプティマイザ SQL エグゼキュータ GPUコード生成データの読み出し GPUカーネル実行トランザクション共有バッファ GPUデバイス管理各種関数・演算子プロセス間通信 GPUメモリ管理 PostgreSQLの各種インフラ 5 代替実行計画作成 GPUDirect SQL PG-Stromのインフラ機能 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

6.

PG-Stromの使いどころ大量データからの分析・検索・集計ワークロード高速化 Manufacturing Logistics 機械学習 Log Collector DBサーバ（OLTP） Home electronics ビジネスアプリケーション（バッチ処理） DBサーバ（OLAP） Transactions BIツール Commercial Applications DBサーバ（OLTP） 6 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

7.

[beta]

利用例、および実行計画への介入（1/2）
ssbm=# set pg_strom.enabled = off;

ssbm=# explain select sum(lo_revenue), d_year, p_brand1
from lineorder, date1, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12’
and s_region = 'AMERICA’
group by d_year, p_brand1
order by d_year, p_brand1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------Sort (cost=195347917.09..195347934.59 rows=7000 width=46)
Sort Key: date1.d_year, part.p_brand1
-> HashAggregate (cost=195347382.53..195347470.03 rows=7000 width=46)
Group Key: date1.d_year, part.p_brand1
-> Hash Join (cost=382023.91..194999578.74 rows=46373839 width=20)
Hash Cond: (lineorder.lo_orderdate = date1.d_datekey)
-> Hash Join (cost=381919.40..194877541.06 rows=46373839 width=20)
Hash Cond: (lineorder.lo_suppkey = supplier.s_suppkey)
-> Hash Join (cost=55258.00..190631734.29 rows=240001188 width=26)
Hash Cond: (lineorder.lo_partkey = part.p_partkey)
-> Seq Scan on lineorder (cost=0.00..174826364.96 rows=6000029696 width=20)
-> Hash (cost=54258.00..54258.00 rows=80000 width=14)
-> Seq Scan on part (cost=0.00..54258.00 rows=80000 width=14)
Filter: (p_category = 'MFGR#12'::bpchar)
-> Hash (cost=293591.14..293591.14 rows=2015701 width=6)
-> Seq Scan on supplier (cost=0.00..293591.14 rows=2015701 width=6)
Filter: (s_region = 'AMERICA'::bpchar)
-> Hash (cost=72.56..72.56 rows=2556 width=8)
-> Seq Scan on date1 (cost=0.00..72.56 rows=2556 width=8)

7

20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

8.

[beta]

利用例、および実行計画への介入（2/2）
ssbm=# set pg_strom.enabled = on;
ssbm=# explain select sum(lo_revenue), d_year, p_brand1
from lineorder, date1, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12’
and s_region = 'AMERICA’
group by d_year, p_brand1
order by d_year, p_brand1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------Sort (cost=31074909.10..31074926.60 rows=7000 width=46)
Sort Key: date1.d_year, part.p_brand1
-> HashAggregate (cost=31074374.54..31074462.04 rows=7000 width=46)
Group Key: date1.d_year, part.p_brand1
-> Custom Scan (GpuPreAgg) on lineorder (cost=31074252.04..31074322.04 rows=7000 width=46)
GPU Projection: pgstrom.psum((lineorder.lo_revenue)::double precision), date1.d_year, part.p_brand1
GPU Join Quals [1]: (lineorder.lo_partkey = part.p_partkey) ... [nrows: 6000030000 -> 240001200]
GPU Outer Hash [1]: lineorder.lo_partkey
GPU Inner Hash [1]: part.p_partkey
GPU Join Quals [2]: (lineorder.lo_suppkey = supplier.s_suppkey) ... [nrows: 240001200 -> 46373840]
GPU Outer Hash [2]: lineorder.lo_suppkey
GPU Inner Hash [2]: supplier.s_suppkey
GPU Join Quals [3]: (lineorder.lo_orderdate = date1.d_datekey) ... [nrows: 46373840 -> 46373840]
GPU Outer Hash [3]: lineorder.lo_orderdate
GPU Inner Hash [3]: date1.d_datekey
GPU-Direct SQL: enabled (GPU-0)
-> Seq Scan on part (cost=0.00..54258.00 rows=80000 width=14)
Filter: (p_category = 'MFGR#12'::bpchar)
-> Seq Scan on supplier (cost=0.00..293591.14 rows=2015701 width=6)
Filter: (s_region = 'AMERICA'::bpchar)
-> Seq Scan on date1 (cost=0.00..72.56 rows=2556 width=8)
8

20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

9.

中核機能：GPU-Direct SQL（1/3） Storage Block Read PCI-E Bus 大量の ”ゴミデータ” も含む Buffer Copy SCAN Buffer Copy JOIN GROUP BY 9 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

10.

中核機能：GPU-Direct SQL（2/3） P2P-DMAを利用し、NVME-SSDとGPUを直結してデータ転送 Storage Block Read by NVIDIA GPUDirect Storage SCAN P2P-DMA JOIN GROUP BY PCI-E Bus P2P-DMA : Peer-to-Peer Direct Memory Access 10 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

11.

中核機能：GPU-Direct SQL（3/3） Query Execution Throughput [MB/s] Star Schema Benchmark (SF=999; 875GB) CPU: AMD EPYC 7402P (24C; 2.8GHz), GPU: NVIDIA A100 [40GB; PCI-E], SSD: Intel D7-5510 (U.2; 3.84TB) x4 24,065 25,000 24,040 24,027 23,640 23,590 23,344 23,651 23,843 22,216 22,056 22,010 22,147 20,242 20,000 15,000 10,000 5,000 5,044 4,686 3,974 3,924 3,848 5,218 3,948 4,379 4,400 3,346 1,879 1,853 2,747 0 Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 PostgreSQL v15.3 (Heap) Query Execution with GPU-Direct SQL Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3 PG-Strom (GPU-Direct SQL; Heap) Storage Read Throughput under the query execution [MB/s] 25,000 20,000 nvme3 15,000 nvme2 10,000 nvme1 Query Execution with Filesystem on PostgreSQL Heap Tables 5,000 nvme0 0 0 11 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

12.

背景課題（1/2）PostgreSQLのHeapデータ構造 Table Segment Block 8kB HeapTuple Tuple HeapTupleHeader nullmap 列A 列B 列D 列E 要る？ SELECT B FROM this_table WHERE E like ‘%hoge%’; 12 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

13.

背景課題（2/2）行の可視性チェックトランザクション① INSERT Tuple-A Tuple-B トランザクション② DELETE UPDATE トランザクション③ Tuple-C GPU-Direct SQL Tuple-D Tuple-E トランザクション④ xmin 行を追加したトランザクションID xmax 行を削除したトランザクションID cmin/cmax xmin/xmaxに付随するコマンドID セッション情報 Tuple 13 Tuple-F HeapTupleHeader nullmap 列A 列B 列D 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所列E

14.

PG-StromのApache Arrow対応について 14 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

15.

Apache Arrowとは（1/2） ▌特徴  列指向で分析用途向けに設計されたデータ形式  アプリケーションによらず、共通のデータ交換形式として利用可能  整数、実数、日付時刻、文字列など基本的なデータ型を定義 PostgreSQL / PG-Strom NVIDIA GPU 15 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

16.

Apache Arrowとは（2/2）– データ型のマッピング大半のデータ型はApache Arrow  PostgreSQLの間で変換可能補足説明 Apache Arrowデータ型 PostgreSQLデータ型 Int Int1, int2, int4, int8 FloatingPoint float2, float4, float8 Binary bytea Utf8 text Bool bool Decimal numeric Date date adjusted to unitsz = Day Time time adjusted to unitsz = MicroSecond Timestamp timestamp adjusted to unitsz = MicroSecond Interval interval List array types Struct composite types Union ------ FixedSizeBinary char(n) FixedSizeList array types? Map ------ 16 float2 is an enhancement of PG-Strom Only 1-dimensional array is supportable 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

17.

《背景技術》FDW (Foreign Data Wrapper) 外部テーブル（Foreign Table）－ PostgreSQL管理外のデータソースを、あたかもテーブルであるかのように取り扱うための機能 Foreign Table postgres_fdw Foreign Table file_fdw Foreign Table twitter_fdw PostgreSQL Table Foreign Table Arrow_fdw External RDBMS CSV Files Twitter (Web API) Arrow Files  FDWモジュールは、外部データ  PostgreSQLの相互変換に責任を持つ。  Arrow_Fdwの場合、ファイルシステム上の Arrow 形式ファイルを Foreign Table という形で参照できるようにマップする（≠ インポートする）。 ➔ 改めてデータをDBシステムにインポートする必要がない。 17 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

18.

Arrow_Fdw（1/3）物理的な（ファイル上の）レイアウト Apache Arrow File Header “ARROW1¥0¥0” Schema Definition RecordBatch: A列 NULL-Bitmap 長さの同じ配列を列ごとに寄せ集めたもの。 A列 Values-array B列 NULL-Bitmap Record Batch-k RecordBatch-1 RecordBatch-k B列 Values-index 被参照列のみのロードが容易 B列 Values-body C列 NULL-Bitmap C列 Values-array D列 NULL-Bitmap A列 B列 C列 D列 D列 Values-array RecordBatch-N Footer Schema Definition Custom Metadata Terminator “ARROW1” 18 Schema Definition: 列名やデータ型、属性が列の数だけ列挙されている。テーブル定義に相当する。 SELECT SUM(D) FROM f_arrow_tbl WHERE B LIKE ‘%abc%’; 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

19.

Arrow_Fdw（2/3） GPU Direct SQL機構を使って、被参照列だけを転送する。 NVMe SSD GPU クエリの被参照列のみ、ダイレクトデータ転送 WHERE-clause JOIN GROUP BY PCIe Bus Apache Arrow形式を解釈し、データを取り出せるよう GPUコード側での対応。 SSD-to-GPU P2P DMA Results 小規模の処理結果だけを PostgreSQLデータ形式で返す metadata ▌なぜApache Arrow形式が優れているのか？  被参照列のみ読み出すため、I/O量が少なくて済む  GPUメモリバスの特性から、プロセッサ実行効率が高い  Read-onlyデータなので、実行時のMVCC検査を行う必要がない 19 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

20.

Arrow_Fdw（3/3） 1,200 1,092.3 1,112.8 1,115.8 1,062.2 1,000 954.0 Query Execution Throughput [百万行/秒] 899.4 800 864.0 736.7 707.4 637.3 600 494.5 488.0 467.7 400 200 148.3 35.4 148.1 36.6 148.3 36.9 146.0 41.0 137.9 44.7 146.8 45.9 141.3 18.0 145.9 31.0 146.8 39.9 146.9 40.2 141.5 16.9 140.9 19.3 145.2 19.3 0 Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 PostgreSQL v16 [Heap] Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 PG-Strom v5.3 [Heap] PG-Strom v5.3 [Arrow] Q4_1 Q4_2 ▌GPUに転送すべきデータ量の違いにより処理速度のブレが大きいものの、全体的に非常に高いデータ処理能力を発揮している。 ▌Arrowファイルの内容はSSBMのlineorderテーブルの複製（同一内容） 20 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所 Q4_3

21.

Arrow_Fdwによる外部テーブルの定義（1/3） postgres=# drop foreign table f_mytest ; DROP FOREIGN TABLE postgres=# CREATE FOREIGN TABLE f_mytest ( id int, ts timestamp, x float8, y text, z numeric ) SERVER arrow_fdw OPTIONS (file '/tmp/mytest.arrow'); CREATE FOREIGN TABLE postgres=# SELECT * FROM f_mytest; id | ts | x | y | z ----+----------------------------+--------------------+--------------+--------1 | 2022-08-02 00:34:44.210589 | 77.4344383633856 | 65ac7f6 | 38.6218 2 | 2019-05-11 03:06:16.353798 | 95.33235230265761 | 9105319395 | 51.8267 3 | 2015-01-04 11:02:25.66779 | 93.67415248121794 | b56930f7834 | 84.9033 : : : : : 24 | 2022-09-24 06:26:02.058316 | 17.668632938372266 | c5e35 | 55.9739 25 | 2016-08-08 18:16:12.248363 | 92.2211769466387 | fa889dd51692 | 19.246 (25 rows) 21 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

22.

Arrow_Fdwによる外部テーブルの定義（2/3） ▌構文 CREATE FOREIGN TABLE table_name ( colname_1 type_1, colname_2 type_2 OPTIONS (field ‘field_name_of_arrow’), : : ) SERVER arrow_fdw OPTIONS (file ‘/path/to/arror_file’); ▌オプション  file 1個の Arrow 形式ファイルを指定する  files 1個以上の Arrow 形式ファイルをカンマ区切りで指定する  dir 指定したディレクトリ配下のファイル全てをマップする。  pattern dirと併用し、ワイルドカードを含むファイル名のパターンを指定する。（Virtual Column機能については後述）  parallel_workers パラレルクエリを使用する際の並列度を指定する。  field 列をマップする Arrow ファイルのフィールド名を指定する。何も指定しないと、列名と同じフィールドをマップする。 ※ データ型はArrowのフィールドと互換性のある型でなければならない。 22 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

23.

Arrow_Fdwによる外部テーブルの定義（3/3） ▌構文 IMPORT FOREIGN SCHEMA table_name FROM SERVER arrow_fdw INTO public OPTIONS (file ‘/path/to/arrow_file’);  Arrowファイルのスキーマ定義情報から、PostgreSQLのテーブル定義を自動生成する。 postgres=# IMPORT FOREIGN SCHEMA f_mytest FROM SERVER arrow_fdw INTO public OPTIONS (file '/opt/arrow/mytest.arrow'); IMPORT FOREIGN SCHEMA postgres=# ¥d f_mytest Foreign table "public.f_mytest" Column | Type | Collation | Nullable | Default | FDW options --------+-----------------------------+-----------+----------+---------+------------id | integer | | | | ts | timestamp without time zone | | | | x | double precision | | | | y | text | | | | z | numeric | | | | Server: arrow_fdw FDW options: (file '/opt/arrow/mytest.arrow’) 23 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

24.

Arrowファイルの作り方（1/3） pg2arrow - PostgreSQLの問い合わせ結果をArrowとして保存する $ pg2arrow -d ssbm -o /opt/arrow/f_lineorder_1996.arrow --progress ¥ -c ‘SELECT * FROM lineorder WHERE lo_orderdate BETWEEN 19960101 ¥ AND 19961231’ worker:0 SQL=[SELECT * FROM lineorder WHERE lo_orderdate BETWEEN 19960101 AND 19961231] 2024-12-16 01:21:57 RecordBatch[0]: offset=1680 length=268436376 ¥ (meta=920, body=268435456) nitems=1303083 2024-12-16 01:22:01 RecordBatch[1]: offset=268438056 length=268436376 ¥ (meta=920, body=268435456) nitems=1303083 2024-12-16 01:22:03 RecordBatch[2]: offset=536874432 length=268436376 ¥ (meta=920, body=268435456) nitems=1303083 : ▌オプション接続するデータベース名を指定する  -o FILENAME 出力する Arrow ファイル名を指定する  --append FILENAME 追記する Arrow ファイル名を指定する  -c COMMAND 実行する SQL コマンドを指定する  -t TABLE_NAME -c ‘SELECT * FROM TABLE_NAME’ と同じ  -s SEGMENT_SIZE レコードバッチの区切りサイズを指定する  --stat=COLUMNS 統計情報を埋め込む列を指定する。 ➔ 後述  -d DATABASE 24 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

25.

[beta]

Arrowファイルの作り方（2/3）
複数同時にセッションを開き、Arrowファイルを並列作成できる
$ pg2arrow -d ssbm -o /opt/arrow/f_lineorder_para.arrow --progress -t lineorder -n 4
worker:0 SQL=[SELECT * FROM lineorder WHERE ctid < '(28706520,0)'::tid]
worker:1 SQL=[SELECT * FROM lineorder WHERE ctid >= '(28706520,0)'::tid ¥
AND ctid < '(57413040,0)'::tid]
worker:2 SQL=[SELECT * FROM lineorder WHERE ctid >= '(57413040,0)'::tid ¥
AND ctid < '(86119560,0)'::tid]
worker:3 SQL=[SELECT * FROM lineorder WHERE ctid >= '(86119560,0)'::tid]
2024-12-16 01:39:03 RecordBatch[0]: offset=1648 length=268436376 ¥
(meta=920, body=268435456) nitems=1303083 by worker:2
2024-12-16 01:39:03 RecordBatch[1]: offset=268438024 length=268436376 ¥
(meta=920, body=268435456) nitems=1303083 by worker:3
2024-12-16 01:39:03 RecordBatch[2]: offset=536874400 length=268436376 ¥
(meta=920, body=268435456) nitems=1303083 by worker:1
2024-12-16 01:39:03 RecordBatch[3]: offset=805310776 length=268436376 ¥
(meta=920, body=268435456) nitems=1303083 by worker:0
2024-12-16 01:39:07 RecordBatch[4]: offset=1073747152 length=268436376 ¥
(meta=920, body=268435456) nitems=1303083 by worker:2
2024-12-16 01:39:07 RecordBatch[5]: offset=1342183528 length=268436376 ¥
(meta=920, body=268435456) nitems=1303083 by worker:3
:

25

20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

26.

[beta]

Arrowファイルの作り方（3/3）
ctidシステム列でスキャン範囲を指定できる。
ssbm=# explain SELECT * FROM lineorder WHERE ctid >= '(28706520,0)'::tid
AND ctid < '(57413040,0)'::tid;
QUERY PLAN
--------------------------------------------------------------------------------Tid Range Scan on lineorder (cost=0.01..43706589.35 rows=1500006535 width=107)
TID Cond: ((ctid >= '(28706520,0)'::tid) AND (ctid < '(57413040,0)'::tid))
(2 rows)
SELECT * FROM lineorder
WHERE ctid < '(28706520,0)'::tid

(0,0)

pg2arrow

pg2arrow

pg2arrow
pg2arrow

26

SELECT * FROM lineorder
WHERE ctid >= '(28706520,0)'::tid
AND ctid < '(57413040,0)'::tid
SELECT * FROM lineorder
WHERE ctid >= '(57413040,0)'::tid
AND ctid < '(86119560,0)'::tid

SELECT * FROM lineorder
WHERE ctid >= '(86119560,0)'::tid

20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

(28706520,0)

(57413040,0)

(86119560,0)

27.

Apache Arrowにおけるmin/max統計情報（1/4）中身を読むまでもなく、明らかに検索条件にマッチしないブロックを読み飛ばす。 Apache Arrow File Header “ARROW1¥0¥0” Schema Definition RecordBatch: 長さの同じ配列を列ごとに寄せ集めたもの。 SELECT ymd, count(*) FROM arrow_table WHERE ymd <= 202103; RecordBatch-1 ymd最小値 <= 202103 ➔ true RecordBatch-1 RecordBatch-k ymd最小値 <= 202103 ➔ false RecordBatch-k RecordBatch-N Footer Schema Definition Custom Metadata Terminator “ARROW1” 27 各レコードバッチ毎に最小値・最大値をカスタムメタデータとして埋め込む事ができる。 RecordBatch-N ymd最小値 <= 202103 ➔ false 独自機能 Custom Metadata min_values=“202009,202107, ...” max_values=“202107,202111, ...” min/max統計情報に基づいて、検索条件にマッチする行を含むブロック（RecordBatch）だけを読み出すよう最適化が可能。 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

28.

[beta]

Apache Arrowにおけるmin/max統計情報（2/4）
明らかに検索条件に合致しない RecordBatch を読み飛ばしているのが分かる
ssbm=# EXPLAIN (analyze, costs off)
SELECT count(*), sum(lo_revenue)
FROM f_lineorder_sorted
WHERE lo_orderdate BETWEEN 19940401 AND 19940630
AND lo_orderpriority = '2-HIGH’;
QUERY PLAN
--------------------------------------------------------------------------------------------------Aggregate (actual time=334.706..334.708 rows=1 loops=1)
-> Custom Scan (GpuPreAgg) on f_lineorder_sorted (actual time=334.691..334.694 rows=2 loops=1)
GPU Projection: pgstrom.nrows(), pgstrom.psum((lo_revenue)::bigint)
GPU Scan Quals: ((lo_orderdate >= 19940401) AND (lo_orderdate <= 19940630) AND
(lo_orderpriority = '2-HIGH'::text)) ¥
[plan: 5999990000 -> 150000, exec: 232539843 -> 45396726]
referenced: lo_orderdate, lo_orderpriority, lo_revenue
Stats-Hint: (lo_orderdate >= 19940401), ¥
(lo_orderdate <= 19940630) [loaded: 77, skipped: 1910]
file0: /opt/arrow/f_lineorder_sorted.arrow (read: 109.52GB, size: 496.69GB)
GPU-Direct SQL: enabled (N=2,GPU0,1; direct=556444, ntuples=232539843)
Planning Time: 2.028 ms
Execution Time: 336.700 ms
検索条件にマッチしない事が明らかなら、
(10 rows)

そもそも読み出す必要もない。

28

20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

29.

[beta]

Apache Arrowにおけるmin/max統計情報（3/4）
統計情報を含む Arrow ファイルの作成方法（--statオプション）
$ pg2arrow -d ssbm -o /opt/arrow/f_lineorder_sorted.arrow --progress ¥
-c ‘SELECT * FROM lineorder ORDER BY lo_orderdate’ ¥
--stat=lo_orderdate
:
$ pg2arrow --dump /opt/arrow/f_lineorder_sorted.arrow
[Footer]
{Footer: version=V4, schema={Schema: endianness=little, fields=[
{Field: name="lo_orderkey", nullable=true, type={Int64}, children=[], custom_metadata=[]},
{Field: name="lo_linenumber", nullable=true, type={Int32}, children=[], custom_metadata=[]},
{Field: name="lo_custkey", nullable=true, type={Int32}, children=[], custom_metadata=[]},
{Field: name="lo_partkey", nullable=true, type={Int32}, children=[], custom_metadata=[]},
{Field: name="lo_suppkey", nullable=true, type={Int32}, children=[], custom_metadata=[]},
{Field: name="lo_orderdate", nullable=true, type={Int32}, children=[],
custom_metadata=[{KeyValue: key="min_values" value="19920101,19920102,19920103,19920104,
19920105,19920107,19920108,...”},
{KeyValue: key="max_values" value="19920102,19920103,19920104,19920105,
19920107,19920108,19920109,...”}]},
{Field: name="lo_orderpriority", nullable=true, type={Utf8}, children=[], custom_metadata=[]},
:
:
:

▌ArrowのCustomMetadata機能を用いて、各フィールドのメタデータ中に、レコードバッチ毎の最小値、
最大値を埋め込んでいる。
▌これらの値を参照する事で、実際にデータをロードする前に、そのレコードバッチをロードすべきか
どうか、判別する事ができる。
29

20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

30.

Apache Arrowにおけるmin/max統計情報（4/4）統計情報を利用するためにQ1_2クエリを書き換える。 =# select sum(lo_extendedprice*lo_discount) from f_lineorder_sorted where lo_orderdate between 19940100 and 19940199 and lo_discount between 4 and 6 and lo_quantity between 26 and 35; revenue ---------------96427226949654 (1 row) Time: 5549.456 ms (00:05.549) Time: 152.898 ms クエリ応答時間[ sec] =# select sum(lo_extendedprice*lo_discount) from f_lineorder, date1 where lo_orderdate = d_datekey and d_yearmonthnum = 199401 and lo_discount between 4 and 6 and lo_quantity between 26 and 35; revenue ---------------96427226949654 (1 row) 180.00 160.00 140.00 120.00 100.00 80.00 60.00 40.00 20.00 0.00 163.92 40.51 5.39 PostgreSQL v16 [Heap] PG-Strom v5.3 [Heap] PG-Strom v5.3 [Arrow] 0.153 PG-Strom v5.3 [Arrow min/max統計] 30 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

31.

Arrowテーブルを組み込んだデータベース構造（1/2）ログデータトランザクションデータ（PostgreSQL Heap形式） pg2arrow: SELECT * FROM my_table WHERE ymd >= ‘2023-01-01’ AND ymd < ‘2024-01-01’ 肥大化 mytable_2023.arrow mytable_2022.arrow テーブルパーティションすごく古いデータは、gzipなどで固めてバックアップ用ストレージへ 31 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所古いデータを Arrow形式に変換

32.

Arrowテーブルを組み込んだデータベース構造（2/2）習慣的に、ファイル名に “yyyymmdd” を付加したりしますよね？ $ ls -l /opt/arrow/monthly/ total 337074788 -rw-r--r--. 1 kaigai kaigai 967199318 Dec 14 00:17 f_lineorder_199301_AIR.arrow -rw-r--r--. 1 kaigai kaigai 966790870 Dec 14 00:18 f_lineorder_199301_FOB.arrow -rw-r--r--. 1 kaigai kaigai 978921494 Dec 14 00:19 f_lineorder_199301_MAIL.arrow -rw-r--r--. 1 kaigai kaigai 978374102 Dec 14 00:20 f_lineorder_199301_RAIL.arrow -rw-r--r--. 1 kaigai kaigai 978857494 Dec 14 00:22 f_lineorder_199301_SHIP.arrow -rw-r--r--. 1 kaigai kaigai 989774806 Dec 14 00:23 f_lineorder_199301_TRUCK.arrow -rw-r--r--. 1 kaigai kaigai 873871318 Dec 14 00:24 f_lineorder_199302_AIR.arrow -rw-r--r--. 1 kaigai kaigai 873760598 Dec 14 00:25 f_lineorder_199302_FOB.arrow -rw-r--r--. 1 kaigai kaigai 883919318 Dec 14 00:26 f_lineorder_199302_MAIL.arrow -rw-r--r--. 1 kaigai kaigai 884077078 Dec 14 00:27 f_lineorder_199302_RAIL.arrow -rw-r--r--. 1 kaigai kaigai 883877142 Dec 14 00:28 f_lineorder_199302_SHIP.arrow -rw-r--r--. 1 kaigai kaigai 893281622 Dec 14 00:30 f_lineorder_199302_TRUCK.arrow -rw-r--r--. 1 kaigai kaigai 967705878 Dec 14 00:31 f_lineorder_199303_AIR.arrow -rw-r--r--. 1 kaigai kaigai 967453078 Dec 14 00:32 f_lineorder_199303_FOB.arrow -rw-r--r--. 1 kaigai kaigai 978963094 Dec 14 00:33 f_lineorder_199303_MAIL.arrow -rw-r--r--. 1 kaigai kaigai 978954454 Dec 14 00:34 f_lineorder_199303_RAIL.arrow -rw-r--r--. 1 kaigai kaigai 978754966 Dec 14 00:36 f_lineorder_199303_SHIP.arrow -rw-r--r--. 1 kaigai kaigai 989493718 Dec 14 00:37 f_lineorder_199303_TRUCK.arrow -rw-r--r--. 1 kaigai kaigai 936397334 Dec 14 00:38 f_lineorder_199304_AIR.arrow -rw-r--r--. 1 kaigai kaigai 936229014 Dec 14 00:39 f_lineorder_199304_FOB.arrow -rw-r--r--. 1 kaigai kaigai 946560726 Dec 14 00:40 f_lineorder_199304_MAIL.arrow -rw-r--r--. 1 kaigai kaigai 947162198 Dec 14 00:42 f_lineorder_199304_RAIL.arrow -rw-r--r--. 1 kaigai kaigai 946814166 Dec 14 00:43 f_lineorder_199304_SHIP.arrow -rw-r--r--. 1 kaigai kaigai 957260310 Dec 14 00:44 f_lineorder_199304_TRUCK.arrow -rw-r--r--. 1 kaigai kaigai 967627350 Dec 14 00:45 f_lineorder_199305_AIR.arrow -rw-r--r--. 1 kaigai kaigai 967433046 Dec 14 00:46 f_lineorder_199305_FOB.arrow 32 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

33.

[beta]

Arrow_FdwのVirtual Column機構（1/4）
ssbm=# IMPORT FOREIGN SCHEMA f_lineorder_vcol FROM SERVER arrow_fdw INTO public
OPTIONS (dir '/opt/arrow/monthly', pattern 'f_lineorder_@{year_month}_${shipmode}.arrow');
IMPORT FOREIGN SCHEMA
ssbm=# ¥d f_lineorder_vcol
Foreign table "public.f_lineorder_vcol"
Column
|
Type
| Collation | Nullable | Default |
FDW options
--------------------+--------------+-----------+----------+---------+-----------------------lo_orderkey
| bigint
|
|
|
|
lo_linenumber
| integer
|
|
|
|
lo_custkey
| integer
|
|
|
|
lo_partkey
| integer
|
|
|
|
lo_suppkey
| integer
|
|
|
|
lo_orderdate
| integer
|
|
|
|
lo_orderpriority
| text
|
|
|
|
lo_shippriority
| character(1) |
|
|
|
lo_quantity
| integer
|
|
|
|
lo_extendedprice
| integer
|
|
|
|
lo_ordertotalprice | integer
|
|
|
|
実はArrowファイル上には存在していない列。
lo_discount
| integer
|
|
|
|
ファイル名の一部を列の値として読み出す
lo_revenue
| integer
|
|
|
|
事ができる。
lo_supplycost
| integer
|
|
|
|
@{xxxx} … 数値列（bigint）
lo_tax
| integer
|
|
|
|
${xxxx} … 文字列（text）
lo_commit_date
| text
|
|
|
|
lo_shipmode
| text
|
|
|
|
year_month
shipmode

| bigint
| text

|
|

|
|

|
|

| (virtual 'year_month')
| (virtual 'shipmode')

Server: arrow_fdw
FDW options: (dir '/opt/arrow/monthly', pattern 'f_lineorder_@{year_month}_${shipmode}.arrow')
33

20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

34.

Arrow_FdwのVirtual Column機構（2/4） SSBM Q1_2をVirtual Column用に改修 =# select sum(lo_extendedprice*lo_discount) from f_lineorder, date1 where lo_orderdate = d_datekey and d_yearmonthnum = 199401 and lo_discount between 4 and 6 and lo_quantity between 26 and 35; revenue ---------------96427226949654 (1 row) =# select sum(lo_extendedprice*lo_discount) from f_lineorder_sorted where lo_orderdate between 19940100 and 19940199 and lo_discount between 4 and 6 and lo_quantity between 26 and 35; revenue ---------------96427226949654 (1 row) Time: 5549.456 ms (00:05.549) Time: 152.898 ms min/max統計情報を使った際は、Custom Metadataに統計情報が付加されている lo_orderdate 列を用いて絞り込みを行ったが、Virtual Columnを使用したケースでは、仮想列である year_month 列を用いて絞り込みを行っている。メリット： pg2arrowでなくとも使える。デメリット：ファイル名管理はユーザの責任。 ssbm=# select sum(lo_extendedprice*lo_discount) from f_lineorder_vcol where year_month = 199401 and lo_discount between 4 and 6 and lo_quantity between 26 and 35; sum ---------------96427226949654 (1 row) Time: 142.665 ms 34 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

35.

[beta]

Arrow_FdwのVirtual Column機構（3/4）
min/max統計情報と同じ仕組みで、明らかにマッチしないRecord Batchを読み飛ばし
ssbm=# explain (analyze, costs off) select sum(lo_extendedprice*lo_discount)
from f_lineorder_vcol
where year_month = 199401
and lo_discount between 4 and 6
and lo_quantity between 26 and 35;
QUERY PLAN
-------------------------------------------------------------------------------------------------Aggregate (actual time=123.326..123.328 rows=1 loops=1)
-> Custom Scan (GpuPreAgg) on f_lineorder_vcol (actual time=123.315..123.317 rows=2 loops=1)
GPU Projection: pgstrom.psum(((lo_extendedprice * lo_discount))::bigint)
GPU Scan Quals: ((lo_discount >= 4) AND (lo_discount <= 6) AND (lo_quantity >= 26) AND
(lo_quantity <= 35) AND (year_month = 199401))
[plan: 3914078000 -> 489, exec: 77300885 -> 4217587]
referenced: lo_quantity, lo_extendedprice, lo_discount, year_month
Stats-Hint: (year_month = 199401) [loaded: 28, skipped: 1416]
file0: /opt/arrow/monthly/f_lineorder_199411_FOB.arrow (read: 122.30MB, size: 892.79MB)
file1: /opt/arrow/monthly/f_lineorder_199406_AIR.arrow (read: 122.28MB, size: 892.68MB)
file2: /opt/arrow/monthly/f_lineorder_199709_FOB.arrow (read: 122.38MB, size: 893.37MB)
file3: /opt/arrow/monthly/f_lineorder_199401_RAIL.arrow (read: 126.33MB, size: 932.76MB)
file4: /opt/arrow/monthly/f_lineorder_199508_TRUCK.arrow (read: 126.39MB, size: 943.73MB)
:
:
:
:
file360: /opt/arrow/monthly/f_lineorder_199405_SHIP.arrow (read: 126.43MB, size: 933.48MB)
GPU-Direct SQL: enabled (N=2,GPU0,1; direct=113261, ntuples=77300885)
Planning Time: 4.772 ms
Execution Time: 128.046 ms

35

20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

36.

Arrow_FdwのVirtual Column機構（4/4） ▌どういった条件句で利用できるか  『仮想列 > 定数』のような大小比較演算  『仮想列 = 定数』のような等価演算 ※ min/max統計情報、virtual columnで共通 ▌今後、対応を拡大したいユースケース  仮想列 IN (値1, 値2, 値3, …) ➔ PostgreSQL内部的には、仮想列 = 値1 OR 仮想列 = 値2 OR 仮想列 = 値3 OR … と等価  Custom Metadataの特定のキー値を virtual column 扱いにする ➔ 例えば、Arrowファイルを生成したツールやデータソースなど 36 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

37.

まとめ ▌PG-Stromについて  GPU-Direct SQLを特徴とする、GPUを用いたPostgreSQLの高速化モジュール。  OLTPではなく、OLAPワークロードを対象とする。  PostgreSQL Heapデータ（行形式）とApache Arrowファイル（列形式）に対応 ▌Arrow_Fdwについて  Apache Arrow形式ファイルをPostgreSQLで読み出すためのFDWモジュール  ストレージから読み出すデータを削減し、可視性チェックを省略する事で大幅な高速化を達成する。 ▌pg2arrowについて  PG-Strom同梱の、PostgreSQLクエリ結果をArrow形式で保存するためのツール  パラレルクエリにも対応。テーブルの物理構造を意識した分割統治方式。  min/max統計情報を埋め込むことも可能 ▌Virtual Column機構  PG-Strom v5.3で追加予定の新機能。ファイル名の一部を仮想列として参照可能に  条件句次第では、特定のArrowファイルを「読み飛ばす」最適化が可能 37 20241216セミナー PG-StromのApache Arrow対応とデータベース運用の勘所

38.

オモシロ技術を、カタチにする。

[20241216] PG-StromのApache Arrow対応とデータベース運用の勘所

KaiGai Kohei

関連スライド

SQLを用いた大量データ分析～GPUによる高速化アプローチ～

PG-Strom v5.0　～変わったこと、できること～

PG-StromにおけるGPU PostGIS対応

学振特別研究員になるために～2025年度申請版

研究に使える便利なフリーソフト ImageJ

StampFlyで学ぶマルチコプタ制御

各ページのテキスト

[20241216] PG-StromのApache Arrow対応とデータベース運用の勘所

KaiGai Kohei

関連スライド

SQLを用いた大量データ分析～GPUによる高速化アプローチ～

PG-Strom v5.0 ～変わったこと、できること～

PG-StromにおけるGPU PostGIS対応

学振特別研究員になるために～2025年度申請版

研究に使える便利なフリーソフト ImageJ

StampFlyで学ぶマルチコプタ制御

各ページのテキスト

PG-Strom v5.0　～変わったこと、できること～