実践的！FPGA開発セミナー vol.21（2023/04/26）

7K Views

April 26, 23

#fpga #fpga開発 #fpga開発シリーズ #FPGA #AIエンジン #テンプレートマッチング #プログラム実装 #高速化手法

スライド概要

フィックスターズならではの「FPGA」に関する高速化手法、効率的な開発ノウハウ、苦労話などについてお話しいたします。
フィックスターズならではの「FPGA」に関する高速化手法、効率的な開発ノウハウ、苦労話などについてお話しいたします。

＜講演内容＞
1、AI Engine によるテンプレートマッチング高速化手法の解説
AI EngineはAMD社が開発した次世代のプロセッサで、400個近いコアを用いて並列・パイプライン実行できることを特徴としており、その応用範囲はAIに留まらず、注目されているメニーコアプロセッサです。
今回はこのプロセッサが搭載されたボード『VCK5000』を用いて、画像処理タスクの1つであるテンプレートマッチングの高速化を行いました。
本セミナでは、簡単なアーキテクチャの説明からアルゴリズムのハードウェアへの落とし込み、実装・シミュレーション、そして実機での評価までの解説を一気通貫で行い、AI Engineでの実践的な開発手法を紹介します！

2、Lightning Talk！
1）AI Engine 適用例の解説～FFT 演算の実装～
AI Engine は、CPU や GPU とは異なる新しいアーキテクチャをもち、その性能を発揮するには AI Engine に適したアルゴリズムとプログラミングが必要です。
本セミナーでは、AI Engine の特徴、使い方の理解を深められるよう、演算ネックの代表的なアプリケーションの１つである FFT 演算を題材に、AI Engine での具体的な実装方法について紹介します。
2）Generative AI を活用した FPGA 開発談

・当社技術ブログ記事：　https://proc-cpuinfo.fixstars.com/

・フィックスターズグループ／セミナー一覧：　https://www.fixstars.com/ja/seminar

・フィックスターズのFPGAシステム開発：　https://www.fixstars.com/ja/services/fpga

株式会社フィックスターズ

@fixstars

スライド一覧

フィックスターズは、コンピュータの性能を最大限に引き出すソフトウェア開発のスペシャリストです。車載、産業機器、金融、医療など、幅広い分野での開発経験があります。また、ディープラーニングや機械学習などの最先端技術にも力を入れています。並列化や最適化技術を駆使して、マルチコアCPU、GPU、FPGA、量子アニーリングマシンなど、さまざまなハードウェアでソフトウェアを高速化するサービスを提供しています。さらに、長年の経験から培ったハードウェアの知識と最適化ノウハウを活かし、高精度で高性能なアルゴリズムの開発も行っています。　　　　　　・開催セミナー一覧：https://www.fixstars.com/ja/seminar　　・技術ブログ：https://proc-cpuinfo.fixstars.com/

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

Kaggleスコアアップセミナー～画像系コンペ入門[前編]（2023/08/02）

kaggle 画像処理機械学習深層学習ディープラーニングコンペ初心者 kaggleスコアアップシリーズ

株式会社フィックスターズ 78.7K

いまさらきけないCUDA高速化（2024/12/19）

cuda gpu llm nvidia 高速化

株式会社フィックスターズ 68.1K

いまさら聞けない ARM を使ったNEONの基礎と活用事例（2021/08/05）

arm neon

株式会社フィックスターズ 64.7K

ROS2自律走行実現に向けて 2 ～次世代ロボット開発フレームワークROS2のビルドシステム徹底理解～（2022/11/30）

ros2 gpu ロボット自律走行 ros2シリーズ高速化シリーズ

株式会社フィックスターズ 57.8K

いまさら聞けない！CUDA高速化入門～プログラミングモデルとアーキテクチャの解説、高速化の実践～（2021/10/29）

cuda gpu cuda高速化高速化シリーズ

株式会社フィックスターズ 53.4K

CUDA高速化セミナー vol.2 ～CUDAアーキテクチャの進化～（2022/06/23）

gpu cuda cuda高速化高速化シリーズ cuda高速化シリーズ

株式会社フィックスターズ 48.2K

各ページのテキスト

AI Engineとは？ - AMD社によって開発されたプロセッサ - 電力効率が良い & 演算処理が高速 - PL(FPGA)とAXI4-Streamで接続されている ※ ACAP概要図 → FPGA + AI Engineでより高速に演算が可能に ※ https://japan.xilinx.com/products/technology/ai-engine.htmlより引用 Copyright© Fixstars Group 6

どんなアーキテクチャか - 強力なベクトル演算ユニットを搭載 - 沢山のAI Engineコアが接続されたAIE Arrayで構成されている - データ処理→次のコアへ流す → マルチコアの並列性とパイプラインを兼ね備えたアーキテクチャ応用範囲はAIだけではない ※ ※ https://japan.xilinx.com/products/technology/ai-engine.htmlより引用 Copyright© Fixstars Group ※ 7

10.

11.

12.

13.

14.

実装にあたっての7つのステップ 1. リファレンスプログラムの実装 2. AI Engine上で動作するカーネルの実装 3. AI Engine Graphの実装 4. AI Engineのシミュレーション 5. AI Engineとホストを繋ぐ、PLの実装 6. ホストプログラムを書く 7. HW Emulation or 実機での動作確認 Copyright© Fixstars Group 14

15.

実装にあたっての7つのステップ 1. リファレンスプログラムの実装 2. AI Engine上で動作するカーネルの実装 3. AI Engine Graphの実装 4. AI Engineのシミュレーション 5. AI Engineとホストを繋ぐ、PLの実装 6. ホストプログラムを書く 7. HW Emulation 8. 実機での動作確認 Copyright© Fixstars Group 15

16.

リファレンスプログラムの実装元・テンプレート画像読み込み for( int hi = 0 ; hi < SRC_H - TMP_H ; ++hi ){ for( int wi = 0 ; wi < SRC_W - TMP_W ; ++wi ){ // 検証範囲を切り出し uint8_t *ins_img_gray = cut_img(); // sad score計算関数実行グレースケール化 uint32_t score = sad_part( ins_img_gray, tmp_img_gray, TMP_SIZE) SAD計算 best_score = max(score, best_score); } } 終了 Copyright© Fixstars Group 16

17.

[beta]

SAD値計算部分概要
uint32_t sad_part(const uint8_t *src, const uint8_t *tmp, const uint32_t TMP_SIZE
){
uint32_t score = 0;
for( uint32_t idx = 0 ; idx < TMP_SIZE ; ++idx ){
const uint8_t comp_src_val = src[ idx ];
const uint8_t comp_tmp_val = tmp[ idx ];
if( comp_src_val >= comp_tmp_val )
score += (comp_src_val - comp_tmp_val);

元画像とテンプレート画像の
全ピクセルを走査し、
2つの画像の差異を取得

else
score += (comp_tmp_val - comp_src_val);
}
return score;
}
Copyright© Fixstars Group

17

18.

実装にあたっての8つのステップ 1. リファレンスプログラムの実装 2. AI Engine上で動作するカーネルの実装 3. AI Engine Graphの実装 4. AI Engineのシミュレーション 5. AI Engineとホストを繋ぐ、PLの実装 6. ホストプログラムを書く 7. HW Emulation or 実機での動作確認 Copyright© Fixstars Group 18

19.

20.

[beta]

AIEへの落とし込み（入力部分）
void compute_krnl( input_stream<data_t>* dat, output_stream<uint32_t>* out){
aie::vector<data_t, READ_BYTE_SIZE> tmp_vecs[READ_NUM];
for ( itr_t i = 0 ; i < READ_NUM ; ++i

)

テンプレートデータを入力

tmp_vecs[i] = readincr_v<READ_BYTE_SIZE>(dat);
while(true){

切り出した元画像を入力し、差異を計算

uint32_t sad_val = 0;
for( itr_t bi = 0 ; bi < READ_NUM ; bi++ ){

aie::vector<data_t, READ_BYTE_SIZE> src_vec = readincr_v<READ_BYTE_SIZE>(dat);
const uint32_t score = compute_score( src_vec, tmp_vecs[bi]);
sad_val += score;
}
writeincr(out, sad_val);
}

テンプレートサイズ画像と
同ピクセル計算した後、出力

}

Copyright© Fixstars Group

20

21.

22.

[beta]

AIEへの落とし込み（画素の差異の計算）
uint32_t compute_score(
aie::vector<data_t, READ_BYTE_SIZE> src_vec,
aie::vector<data_t, READ_BYTE_SIZE> tmp_vec
){
auto msk_lt = aie::lt( src_vec, tmp_vec );

大小のマスクを取得
対応ピクセルごとに大小を分ける

aie::vector<data_t, READ_BYTE_SIZE> grt_vec = aie::select(src_vec, tmp_vec, msk_lt );
aie::vector<data_t, READ_BYTE_SIZE> les_vec = aie::select(tmp_vec, src_vec, msk_lt );
uint32_t score = 0;
aie::vector<data_t, READ_BYTE_SIZE> sub_vec
= aie::sub(grt_vec, les_vec);
aie::vector<reduce_t, READ_BYTE_SIZE/2> red_vec; // reduce用の配列. 16bit
for( itr_t wi = 0; wi < READ_BYTE_SIZE/2; ++wi )
red_vec[wi] = sub_vec[wi] + sub_vec[wi + READ_BYTE_SIZE / 2];
score = aie::reduce_add(red_vec);
return score;

大きい方から小さい方を引いた値を取得

差をreduceで計算

}

Copyright© Fixstars Group

22

23.

24.

25.

[beta]

AI Engine Graphでカーネル同士を繋ぎフローを作成
class compute_graph : public adf::graph {
private :
adf::kernel krnl;

カーネル

public :
adf::port<input> dat_in;
adf::port<output> res_out;

入出力ポート

compute_graph(){
krnl = adf::kernel::create(compute_krnl);
adf::source(krnl) = "../src/compute_krnl.cpp"; // カーネル読み込み
adf::connect<adf::stream>(dat_in, krnl.in[0]);
adf::connect<adf::stream>(krnl.out[0], res_out);
adf::runtime<adf::ratio>(krnl) = 1.0;

入力→カーネル→出力

};
};
Copyright© Fixstars Group

25

26.

[beta]

AI Engine Graphでカーネル同士を繋ぎフローを作成
// 0段目
for( itr_t ii = 0 ; ii < PARALLEL_NUM / 2

; ++ii ){

adf::connect<adf::stream>( reduce_in[ii * 2 ]

, krnl[ii].in[0]);

adf::connect<adf::stream>( reduce_in[ii * 2 + 1 ], krnl[ii].in[1]);
}
// 中段
uint32_t ofs = 0;
for( itr_t pi = PARALLEL_NUM / 2; pi > 1; pi = pi >> 1 ){
for( itr_t ii = 0 ; ii < pi; ++ii)
adf::connect<adf::stream>( krnl[ ii + ofs ].out[0], krnl[ ii / 2 + ofs + pi].in[ ii % 2 ]);
ofs += pi;
}
// 最終段
adf::connect<adf::stream>( krnl[ PARALLEL_NUM - 1 - 1 ].out[0]

Copyright© Fixstars Group

, reduce_out );

26

27.

実装にあたっての8つのステップ 1. リファレンスプログラムの実装 2. AI Engine上で動作するカーネルの実装 3. AI Engine Graphの実装 4. AI Engineのシミュレーション 5. AI Engineとホストを繋ぐ、PLの実装 6. ホストプログラムを書く 7. HW Emulation 8. 実機での動作確認 Copyright© Fixstars Group 27

28.

AI Engine『だけ』のシミュレーション各ポートへの入力ファイルを用意するだけで、 AI Engine部分だけのシミュレーションが可能！ここだけ！ input_0.txt 1923201977 3648631153 3015563216 2931403443 3535452630 1352627154 1599623816 1095908160 …… 8bit × 4 = 32 ここでテストを通しておけば安心して以降の実装に進める Copyright© Fixstars Group 28

29.

30.

31.

32.

[beta]

AI Engineとホストを繋ぐ、PLの実装
void mm2s( ap_uint<32>* mem, hls::stream<ap_axiu<32, 0, 0, 0>>& str, int size ) {
for (int i = 0; i < size; i++)

{

ap_axiu<32, 0, 0, 0> x;
x.data = mem[i];
x.keep = -1; // バイトイネーブルのフラグをすべて立てる
str.write(x);
}
}
void s2mm( ap_uint<32>* mem, hls::stream<ap_axiu<32, 0, 0, 0>>& str, int size) {
for (int i = 0; i < size; i++){
auto x = str.read();
mem[i] = x.data;
}
}
Copyright© Fixstars Group

32

33.

34.

リファレンスのC++実装に何を足せばAI Engineを使えるのか 1. デバイスオープン 2. xclbinをデバイスに書き込み 3. カーネルの作成 4. バッファオブジェクトの作成 5. ホスト側のバッファポインタをユーザー空間にマップ 6. バッファへのデータ書き込み＆ホストからデバイスバッファへの内容の同期 7. カーネルの起動&完了待機 8. デバイスからホストへバッファの内容を同期 Copyright© Fixstars Group 34

35.

1&2. デバイスオープン＆xclbinの書き込み const int device_index = 0; const std::string xclbin_file = argv[1]; auto device = xrt::device(device_index); // デバイスのオープン auto uuid = device.load_xclbin(xclbin_file); // xcl_binのデバイスの書き込み用意されている関数に渡すだけです Copyright© Fixstars Group 35

36.

[beta]

3. カーネル作成
// PARALLEL_NUM 並列でのカーネル送信作成
std::vector< xrt::kernel > mm2s_krnls;
for(itr_t pi = 0; pi < PARALLEL_NUM ; pi++ ){
mm2s_krnls.push_back(
xrt::kernel(device, uuid, "mm2s:{mm2s_" + std::to_string(pi + 1) + "}")
);
}
// 受信カーネルの作成
auto s2mm_krnl = xrt::kernel(device, uuid, "s2mm:{s2mm_1}");

用意されている関数をほぼ叩くだけでできます
Copyright© Fixstars Group

36

37.

[beta]

4. バッファオブジェクト作成
// PARALLEL_NUM並列での送信用バッファの作成
std::vector< xrt::bo > mm2s_bos;
for(itr_t pi = 0; pi < PARALLEL_NUM ; pi++ ){
mm2s_bos.push_back( xrt::bo(device, sizeof(uint8_t) * IMG_SIZE /
PARALLEL_NUM, mm2s_krnls[pi].group_id(0)) );
}
// 受信用バッファオブジェクトの作成
auto s2mm_bo = xrt::bo(device, sizeof(uint32_t) * TRIAL_CNT, \
s2mm_krnl.group_id(0));

欲しいサイズをbyte単位で指定するだけでできます
Copyright© Fixstars Group

37

38.

[beta]

5. ホスト側のバッファポインタを
ユーザー空間にマップする
// PARALLEL_NUM並列での送信用バッファの作成
std::vector<uint8_t *> mm2s_mapd_bufs;
for(itr_t pi = 0; pi < PARALLEL_NUM ; pi++ ){
mm2s_mapd_bufs.push_back( mm2s_bos[pi].map<uint8_t *>() );
}
// 受信用バッファオブジェクトのマップ
auto s2mm_mapd_buf = s2mm_bo.map<uint32_t *>();

バッファオブジェクトのmap関数を叩くだけです
Copyright© Fixstars Group

38

39.

6. バッファへのデータ書き込み＆ホストからデバイスバッファへの内容の同期 // マップした送信用バッファへのデータを書き込み for ( itr_t pi = 0 ; pi < PARALLEL_NUM ; pi++ ){ const uint32_t SIZE_PER_UNIT = IMG_SIZE / PARALLEL_NUM; const uint32_t OFS = SIZE_PER_UNIT * pi; for( int bi = 0; bi < SIZE_PER_UNIT ; bi++ ){ mm2s_mapd_bufs[pi][bi] = tmp_img_gray[ bi + OFS ]; } } for ( itr_t pi = 0 ; pi < PARALLEL_NUM ; pi++ ) mm2s_bos[pi].sync(XCL_BO_SYNC_BO_TO_DEVICE); // 同期書き込んで同期関数を叩くだけです Copyright© Fixstars Group 39

40.

[beta]

7. カーネルの起動&完了待機
std::vector< xrt::run > mm2s_runs;
for ( itr_t pi = 0 ; pi < PARALLEL_NUM ; pi++ ){
mm2s_runs.push_back (
mm2s_krnls[pi]( mm2s_bos[pi], nullptr, sizeof(uint8_t)
* IMG_SIZE / PARALLEL_NUM / 4) );
}
for ( itr_t pi = 0 ; pi < PARALLEL_NUM ; pi++ )
mm2s_runs[pi].wait();

// カーネル終了の待機

3で作ったカーネルにbuﬀer objectと
『何回実行してほしいか』を渡すだけです
Copyright© Fixstars Group

40

41.

リファレンスプログラムのAIE 対応元・テンプレート画像読み込み [AIE 準備] [テンプレート画像送信] [s2mm 受信開始] グレースケール化 for( int hi = 0 ; hi < SRC_H - TMP_H ; ++hi ){ for( int wi = 0 ; wi < SRC_W - TMP_W ; ++wi ){ [ mm2s送信処理（一枚ずつ） ] } SAD計算 } [s2mm 受信待機] 終了 Copyright© Fixstars Group 41

42.

43.

44.

45.

改良ホスト→AIE の送信を減らす元・テンプレート画像読み込み [AIE 準備] [テンプレート画像送信] for( int hi = 0 ; hi < SRC_H - TMP_H ; ++hi ){ for( int wi = 0 ; wi < SRC_W - TMP_W ; ++wi ){ グレースケール化 [検証用元画像格納] } } SAD計算 [s2mm 受信開始] [mm2s送信処理 (全部) ] [s2mm 受信待機] 終了 AIE Copyright© Fixstars Group : 650 ms 45

46.

47.

48.

49.

50.

51.

52.

53.

AI Engine FFT 演算 ● ● 背景＆目的 Versal AI Engine ○ 演算負荷の高いアプリケーションに有用 ○ 性能を発揮するには・・・ ■ AI Engine に適したアルゴリズムとプログラミングが必要 ■ AI Engine のアーキテクチャ、特徴の把握が必要 FFT 演算 ○ 演算負荷の高い代表的なアプリケーションの1つ ■ AI Engine FFT 演算は、DSP Library を用いてすでに利用可能今回、AI Engine の理解を深めるために、なじみのある FFT 実装を題材に試行した結果を共有する Copyright© Fixstars Group 53

54.

55.

単一カーネルでの FFT 実装 ● データ型：単精度複素数 ○ cﬂoat 型を使用する ■ ● ● ⇒ 複素数演算の記述が容易に可能 .real, .imag で実部、虚部にアクセスする AI Engine メモリ容量 ○ １コア辺りのデータメモリ 32KByte ○ 隣接する４コアのメモリを共有可能 Total 128KByte FFT データ点数 1024点 ○ 入出力、中間のバッファが必要（4 ケ程度） ○ 8Byte (=sizeof(cﬂoat)) x 1024 点 x ~4ケ < 32KByte ※ 隣接コアのメモリを使用すれば、これより大きな点数も実現できる可能性はある（未試行）参照URL https://docs.xilinx.com/r/ja-JP/am009-versal-ai-engine/AI-%E3%82%A8%E3%83%B3%E3%82%B8%E3%83%B3%E3%8 1%AE%E3%83%A1%E3%83%A2%E3%83%AA-%E3%83%A2%E3%82%B8%E3%83%A5%E3%83%BC%E3%83%AB Copyright© Fixstars Group 55

https://docs.xilinx.com/r/ja-JP/am009-versal-ai-engine/AI-エンジンのメモリ-モジュール

56.

57.

単一カーネルでの FFT 実装 ● 演算機能（バタフライ演算）の記述 ○ AI Engine Intrinsics より高速、効率的な動作をさせるには、Intrinsics を利用する https://www.xilinx.com/htmldocs/xilinx2022_2/aiengine_intrinsics/intrinsics/index.html ■ ベクタ演算エンジンの利用、パイプライン動作など ■ FFT 用の Intrinsics も存在ただし、cﬂoat 対応の FFT 演算が見当たらない、仕様理解に時間がかかる（難解）ため、今回未適用 ○ 参照URL https://japan.xilinx.com/developer/articles/aie-kernel-programming-vitis-ai e-api.html AI Engine API Intrinsics よりも抽象度の高い C++ ヘッダライブラリ https://www.xilinx.com/htmldocs/xilinx2022_2/aiengine_api/aie_api/doc/index.html ■ ﬀt_dit も使用可能適用を試みたが期待通りの演算ができず適用断念パラメータ、入出力データ設定を探ったものの、適切な使い方がつかめなかった Copyright© Fixstars Group 57

58.

59.

複数カーネルでの FFT 実装 stage 数分カーネルを直列に配置 ● カーネル分割 ○ 共有メモリを介してカーネルを直列に接続 ○ 連続したデータ投入でもスループットを確保できるデータ数分繰り返し O(N) 構成参照URL https://docs.xilinx.com/r/ja-JP/am009-versal-ai-engine/%E5%85%B1%E6%9C%89%E3%83%A1%E3%83%A2%E3%83%AA%E3%82%92%E4%BD%BF%E7%94%A8%E3%81%97%E3%81 %9F-AI-%E3%82%A8%E3%83%B3%E3%82%B8%E3%83%B3%E3%81%8B%E3%82%89-AI-%E3%82%A8%E3%83%B3%E3%82%B8%E3%83%B3%E3%81%B8%E3%81%AE%E3%83 %87%E3%83%BC%E3%82%BF%E9%80%9A%E4%BF%A1 Copyright© Fixstars Group 59

https://docs.xilinx.com/r/ja-JP/am009-versal-ai-engine/共有メモリを使用した-AI-エンジンから-AI-エンジンへのデータ通信

60.

61.

62.

63.

複数カーネルでの FFT 実装 ● FFT 1000 point プロファイル結果（aiesimulator 実行） Total 5,714,000 cycle = 5.71msec @ 1GHz ※単一カーネルと同じ連続したデータ投入ができていないため、カーネル分割した利点が見えていない ↓ スループットの改善効果は実機で確認する必要があるカーネル間で順番に処理している自動で同期をとりながら動作できている（カーネルコードは独立に記述し、グラフ接続しているだけ） Copyright© Fixstars Group 63

64.

65.

参考：AMD 提供サンプルデザイン（XAPP1356）の実装 ● ブロックごとにコンフィギュレーション可能な高速フーリエ変換の AI エンジンでの実装 ○ 1024 point FFT x4 カーネルから共有メモリを介して結果を集約する構成参照URL https://docs.xilinx.com/r/ja-JP/xapp1356-fft-ai-engine/%E8%A4%87%E6%95%B0%E3%81%AE-AI-%E3%82%A8%E3%83%B3%E3%82%B8%E3%83%B3%E3%81%AB%E3%82%88%E3%82%8B-FFT Copyright© Fixstars Group 65

https://docs.xilinx.com/r/ja-JP/xapp1356-fft-ai-engine/複数の-AI-エンジンによる-FFT

66.

67.

まとめ ● AI Engine を用いた FFT 演算の実装を通して、AI Engine の構成、特徴の共有を行った ○ 演算機能の最適化方法 ○ メモリ構成、共有メモリを介したカーネル間接続方法 ● 比較的簡単な方法で、複数のカーネルが連携して動作することを示した ● 演算処理の最適化が不足しており、AI Engine API や Intrinsics を用いた高速化が必須であることが明らかになった Copyright© Fixstars Group 67

68.

69.

70.

https://investingnews.com/invest-in-openai-chatgpt/

71.

72.

ChatGPT とは？ ChatGPT（チャットジーピーティー、英語: Chat Generative Pre-trained Transformer）は、OpenAIが2022年11月に公開した人工知能チャットボット。原語のGenerative Pre-trained Transformerとは、「生成可能な事前学習済み変換器」という意味である。 OpenAIのGPT-3ファミリーの言語モデルを基に構築されており、教師あり学習と強化学習の両方の手法で転移学習されている。 Wikipedia より引用 : https://ja.wikipedia.org/wiki/ChatGPT Copyright© Fixstars Group 72

https://ja.wikipedia.org/wiki/ChatGPT

73.

https://trends.google.co.jp/home

74.

75.

76.

①構想段階での活用： (cont.) SH I want to find the TOE open source for Intel FPGA. Please let me know the best OSS and its reason. The TOE (TCP Oﬄoad Engine) open source for Intel FPGA that I would recommend is the TOE project maintained by the Open Programmable Acceleration Engine (OPAE) community. This project provides a full-featured, high-performance TCP/IP stack for FPGA-based NICs (network interface cards) and is optimized for use with Intel FPGAs. The TOE project offers several beneﬁts that make it a great choice for Intel FPGA users: 1. 2. 3. 4. High-performance: The TOE project is designed to offer high-performance TCP/IP oﬄoading capabilities for FPGA-based NICs, allowing for increased throughput and reduced CPU utilization. Open source: The TOE project is open source, which means that users can modify and customize the code to suit their speciﬁc needs. This also means that the project is constantly being improved and updated by a community of contributors. Easy to use: The TOE project is designed to be easy to use, with a straightforward API that allows developers to quickly integrate it into their applications. Compatibility: The TOE project is compatible with a wide range of Intel FPGA devices and development boards, making it a versatile option for developers working on FPGA-based networking projects. Overall, the TOE project is a solid choice for anyone looking to implement TCP/IP oﬄoading on Intel FPGAs. Its high-performance, open-source nature, ease of use, and compatibility with a wide range of Intel FPGA devices make it a top contender in this space. Copyright© Fixstars Group 76

77.

①構想段階での活用： (cont.) Where can I find it? Give me the link to download the software. SH You can ﬁnd the TOE project on the Open Programmable Acceleration Engine (OPAE) GitHub repository. Here's the link to the repository: https://github.com/OPAE/toe You can download the latest version of the TOE project from the repository's releases page: https://github.com/OPAE/toe/releases From there, you can download the source code or pre-built binaries for your platform. The repository also contains documentation, examples, and other resources to help you get started with the project. Copyright© Fixstars Group 77

78.

①構想段階での活用： (cont.) Do you have any alternatives? SH Yes, there are alternative TCP Oﬄoad Engine (TOE) open source projects available for Intel FPGAs. Here are a few options: 1. 2. 3. DPDK: The Data Plane Development Kit (DPDK) is an open source set of libraries and drivers for fast packet processing on Intel architectures, including FPGAs. DPDK provides a high-performance TCP/IP stack that can be used for TOE oﬄoading. NetCOPE: NetCOPE is an open source framework for FPGA-based network processing that includes a TCP/IP stack optimized for use with Intel FPGAs. NetCOPE offers a range of features, including support for multiple protocols and hardware acceleration. NetFPGA: NetFPGA is an open source platform for networking research and education that includes a TCP/IP stack optimized for FPGA-based NICs. NetFPGA offers a range of features, including support for multiple protocols and hardware acceleration. These alternative projects offer similar capabilities to the TOE project, but each has its own strengths and weaknesses. Ultimately, the best choice will depend on your speciﬁc needs and requirements. Copyright© Fixstars Group 78

79.

80.

81.

②仕様検討段階での活用： (cont.) SH Which FPGA board do you recommend to implement the TOE project on the OPAE, and why? There are several FPGA boards that can be used to implement the TOE project on the OPAE, depending on your speciﬁc requirements and budget. Here are a few options that I would recommend: 1. 2. 3. Intel PAC with Intel Arria 10 GX FPGA: This FPGA board is designed speciﬁcally for network acceleration and includes an Arria 10 GX FPGA along with 8 GB of DDR4 memory and a 100 Gbps QSFP28 network interface. The board is well-suited for implementing the TOE project, as it provides high-speed networking capabilities and powerful FPGA resources. Terasic DE5a-Net FPGA Development Kit: This development kit includes a Stratix V GX FPGA along with 4 GB of DDR3 memory and multiple high-speed networking interfaces. The board is a good choice for implementing the TOE project, as it provides a large FPGA with plenty of resources and support for multiple networking protocols. Intel FPGA Development Kit: This development kit includes a Cyclone V GX FPGA along with 1 GB of DDR3 memory and multiple high-speed networking interfaces. The board is a more budget-friendly option for implementing the TOE project, while still providing a powerful FPGA and networking capabilities. Ultimately, the choice of FPGA board will depend on your speciﬁc requirements, including performance, cost, and development environment. All of the boards listed above are compatible with the OPAE and provide a good starting point for implementing the TOE project. Copyright© Fixstars Group 81

82.

②仕様検討段階での活用： (cont.) SH The cost is limited to 10,000 USD, and higher performance is better. Which FPGA board is your best choice? With a budget of $10,000 and a requirement for higher performance, I would recommend the Intel PAC with Intel Arria 10 GX FPGA. This board is designed speciﬁcally for network acceleration and offers a powerful Arria 10 GX FPGA, along with 8 GB of DDR4 memory and a 100 Gbps QSFP28 network interface. The board is well-suited for implementing the TOE project and can provide the high performance you require. The Intel PAC with Intel Arria 10 GX FPGA has a list price of around $3,000, leaving plenty of room in your budget for additional components, such as power supplies, cables, and development tools. The board is also well-supported by the OPAE community, which can help simplify the development process and provide access to a range of resources and examples. Overall, the Intel PAC with Intel Arria 10 GX FPGA is a great choice for implementing the TOE project on a budget of $10,000 while still providing the high performance you require. Copyright© Fixstars Group 82

83.

84.

85.

[beta]

③設計/実装段階での活用：
Write Verilog code for an asynchronous FIFO with parameterizable width and depth.
SH
Here's an example of an asynchronous FIFO with parameterizable width and depth in Verilog:
module ﬁfo #
(
parameter WIDTH = 8,
parameter DEPTH = 16
)
(
input clk,
input rst,
input wr_en,
input rd_en,
input [WIDTH-1:0] wr_data,
output [WIDTH-1:0] rd_data,
output full,
output empty
);
reg [WIDTH-1:0] memory [0:DEPTH-1];
reg [WIDTH-1:0] rd_data_reg;
reg [WIDTH-1:0] wr_ptr;
reg [WIDTH-1:0] rd_ptr;
reg [DEPTH-1:0] occupancy;
always @(posedge clk) begin
if (rst) begin
wr_ptr <= {WIDTH{1'b0}};

rd_ptr <= {WIDTH{1'b0}};
occupancy <= {DEPTH{1'b0}};
rd_data_reg <= {WIDTH{1'b0}};
end
else begin
if (wr_en && !full) begin
memory[wr_ptr] <= wr_data;
wr_ptr <= wr_ptr + 1;
occupancy <= occupancy + 1;
end
if (rd_en && !empty) begin
rd_data_reg <= memory[rd_ptr];
rd_ptr <= rd_ptr + 1;
occupancy <= occupancy - 1;
end
end
end
assign rd_data = rd_data_reg;
assign full = (occupancy == DEPTH);
assign empty = (occupancy == 0);
endmodule

Copyright© Fixstars Group

85

86.

③設計/実装段階での活用： (cont.) Write Verilog code for an asynchronous FIFO with parameterizable width and depth. SH This module deﬁnes an asynchronous FIFO with a conﬁgurable WIDTH and DEPTH. The module has input ports for the clock signal (clk), reset signal (rst), write enable (wr_en), and read enable (rd_en). The module also has input ports for write data (wr_data) and output ports for read data (rd_data), full ﬂag (full), and empty ﬂag (empty). The module uses an array (memory) to store data in the FIFO, and read and write pointers ( rd_ptr and wr_ptr) to keep track of the current positions in the FIFO. The module also uses an occupancy counter (occupancy) to keep track of how many elements are currently in the FIFO. In the always block, the module checks for a reset signal and updates the read and write pointers and occupancy counter based on the write and read enable signals. The read data is stored in a register (rd_data_reg) to avoid race conditions. The assign statements assign the read data, full ﬂag, and empty ﬂag based on the current occupancy of the FIFO. Copyright© Fixstars Group 86

87.

88.

ChatGPT で実現できないことはまだまだある！ Write the fastest TCP Offload Engine HDL code for Intel FPGA as detailed as possible. SH Designing a TCP Oﬄoad Engine (TOE) is a complex task that requires a deep understanding of both the TCP/IP protocol and FPGA architecture. Here are some general steps that may be involved in implementing a TOE on an Intel FPGA, along with some tips for optimizing the HDL code for maximum performance: (snip) 抽象度が高くかつ難易度の高い非機能要件を持つ要件から具体的な実装に落とし込むのはエンジニアの仕事！ Copyright© Fixstars Group 88

89.

90.

実践的！FPGA開発セミナー vol.21（2023/04/26）

株式会社フィックスターズ

関連スライド

Kaggleスコアアップセミナー～画像系コンペ入門[前編]（2023/08/02）

いまさらきけないCUDA高速化（2024/12/19）

いまさら聞けない ARM を使ったNEONの基礎と活用事例（2021/08/05）

ROS2自律走行実現に向けて 2 ～次世代ロボット開発フレームワークROS2のビルドシステム徹底理解～（2022/11/30）

いまさら聞けない！CUDA高速化入門 ～プログラミングモデルとアーキテクチャの解説、高速化の実践～（2021/10/29）

CUDA高速化セミナー vol.2 ～CUDAアーキテクチャの進化～（2022/06/23）

各ページのテキスト

いまさら聞けない！CUDA高速化入門～プログラミングモデルとアーキテクチャの解説、高速化の実践～（2021/10/29）