[DL輪読会]Human Pose Estimation @ ECCV2018

DEEP LEARNING JP [DL Seminar] Human Pose Estimation @ ECCV2018 Hiromi Nakagawa, Matsuo Lab http://deeplearning.jp/

http://deeplearning.jp/

Agenda 1. ECCV2018 2. Human Pose Estimationのトレンド・キーワード 3. Human Pose Estimation @ ECCV2018 2

ECCV 2018 • ECCV – European Conference on Computer Vision – CVPR、ICCVと並んでCV系のトップ会議 • ECCV 2018 – 2018/09/08~2018/09/14 @ Munich, Germany – 2439 main conference submissions – 776 accepted (59 orals, 717 posters, 31.8% acceptance rate) • ECCV2018のAccepted Papersの中で（目についた）Human Pose Estimationに関する論文を14本ほど紹介 3

4.

ECCV 2018 • ちなみに：Best Paperは6D Object Detection・Pose Estimationに関する論文 Implicit 3D Orientation Learning for 6D Object Detection from RGB Images Martin Sundermeyer, Zoltan-Csaba Marton, Maximilian Durner, Manuel Brucker, Rudolph Triebel 4

5.

ECCV 2018 • ちなみに：Honorable MentionはGroup NormalizationとGANimation Group Normalization Yuxin Wu, Kaiming He 5

6.

ECCV 2018 • ちなみに：Honorable MentionはGroup NormalizationとGANimation GANimation: Anatomically-aware Facial Animation from a Single Image Albert Pumarola, Antonio Agudo, Aleix M. Martinez, Alberto Sanfeliu, Francesc Moreno-Noguer 6

7.

Human Pose Estimationのトレンド・キーワード • Single- / Multi- Person • Multi-Person: Top-Down / Bottom-Up – Top-Down: Person Detection → Single-Person Pose Estimation x N • High Accuracy but Slow. Dependent to Person Detector Performance. – Bottom-Up: Joint Candidate Detection → Grouping • Fast but Low Accuracy. Complex Partitioning. • Single Image / Sequential Images – Temporal Coherence • With / Without Depth • 2D / 3D Pose Estimation • Supervised / Semi-Supervised / Unsupervised 7

8.

Human Pose Estimation @ ECCV2018 Multi-Person (Mainly) Single-Person • • • • Pose Proposal Networks [Sekii] 2D Pose Partition Networks for Multi-Person Pose Estimation [Nie+] 2D MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network [Kocabas+] PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model [Papandreou+] 2D • • • • • • • • • Deeply Learned Compositional Models for Human Pose Estimation [Tang+] 2D Learning 3D Huma Pose from Structure and Motion [Dabral+] 3D Exploiting temporal information for 3D human pose estimation [Hossain+] 3D Integral Human Pose Regression [Sun+] 3D Multi-Scale Structure-Aware Network for Human Pose Estimation [Sun+] 2D Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation [Rhodin+] 3D 3D Ego-Pose Estimation via Imitation Learning [Yuan+] 3D Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera [Marcard+] Deformable Pose Traversal Convolution for 3D Action and Gesture Recognition [Weng+] 3D 2D 3D 8

9.

Human Pose Estimation @ ECCV2018 Multi-Person (Mainly) Single-Person • • • • Pose Proposal Networks [Sekii] 2D Pose Partition Networks for Multi-Person Pose Estimation [Nie+] 2D MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network [Kocabas+] PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model [Papandreou+] 2D • • • • • • • • • Deeply Learned Compositional Models for Human Pose Estimation [Tang+] 2D Learning 3D Huma Pose from Structure and Motion [Dabral+] 3D Exploiting temporal information for 3D human pose estimation [Hossain+] 3D Integral Human Pose Regression [Sun+] 3D Multi-Scale Structure-Aware Network for Human Pose Estimation [Sun+] 2D Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation [Rhodin+] 3D 3D Ego-Pose Estimation via Imitation Learning [Yuan+] 3D Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera [Marcard+] Deformable Pose Traversal Convolution for 3D Action and Gesture Recognition [Weng+] 3D 2D 3D 9

10.

Pose Proposal Network [Sekii] • YOLOなどのSingle-Shot Detectorで人の体の部位とその接続（limb）をGrid-wiseに検出 • Bottom-up式にそれらをマージすることで、任意数の人物のposeを高精度かつ高速に推定 10

11.

Pose Proposal Network [Sekii] • MPII Human Poseのデータセット – Single-Person：SoTAと同程度の精度を保ちながら380FPS(11倍の速度)でKeypointsを検出 – Multi-Person：SoTAと同程度の精度を保ちながら180FPSで姿勢を推定 11

12.

Pose Partition Networks for Multi-Person Pose Estimation [Nie+] • 関節位置のConfidence Map(a)と人物の重心(b)を元に関節候補を重心に埋め込み(c) • 人ごとに関節を分け(d)、local greedy inferenceによって各人のposeを推定(e) 12

13.

Pose Partition Networks for Multi-Person Pose Estimation [Nie+] • 関節位置のConfidence Map(a)と人物の重心(b)を元に関節候補を重心に埋め込み(c) • 人ごとに関節を分け(d)、local greedy inferenceによって各人のposeを推定(e) • MPIIの精度(AP)は[Sekii]よりよいが、推論速度は不明（0.77sec??） ↑ 1.3FPS?? 13

14.

PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model [Papandreou+] • Pose EstimationとInstance Segmentationを同時に行うBottom-up式の手法の提案 • 人物のkeypointsを検出した後に相互の関係性を推論し、人物ごとのposeにgroupingする • keypointのlocalizationの精度向上や接続を推論するためにoffsetという概念を導入 14

15.

PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model [Papandreou+] • Pose EstimationとInstance Segmentationを同時に行うBottom-up式の手法の提案 • 人物のkeypointsを検出した後に相互の関係性を推論し、人物ごとのposeにgroupingする • keypointのlocalizationの精度向上や接続を推論するためにoffsetという概念を導入 • 推論時間の記述はなし 15

16.

MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network [Kocabas+] • Bottom-up式の2D Multi-Person Pose Estimation • Person Detection, Person Segmentation, Pose EstimationのMulti-Task Learning 16

17.

MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network [Kocabas+] • Bottom-up式の2D Multi-Person Pose Estimation • Person Detection, Person Segmentation, Pose EstimationのMulti-Task Learning • PersonLabより高精度(AP)、推論速度は27FPS (1 person) ~ 15FPS (20 person)、 COCOで23FPS (~3 person) 17

18.

Human Pose Estimation @ ECCV2018 Multi-Person (Mainly) Single-Person • • • • Pose Proposal Networks [Sekii] 2D Pose Partition Networks for Multi-Person Pose Estimation [Nie+] 2D MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network [Kocabas+] PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model [Papandreou+] 2D • • • • • • • • • Deeply Learned Compositional Models for Human Pose Estimation [Tang+] 2D Learning 3D Huma Pose from Structure and Motion [Dabral+] 3D Exploiting temporal information for 3D human pose estimation [Hossain+] 3D Integral Human Pose Regression [Sun+] 3D Multi-Scale Structure-Aware Network for Human Pose Estimation [Sun+] 2D Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation [Rhodin+] 3D 3D Ego-Pose Estimation via Imitation Learning [Yuan+] 3D Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera [Marcard+] Deformable Pose Traversal Convolution for 3D Action and Gesture Recognition [Weng+] 3D 2D 3D 18

19.

Deeply Learned Compositional Models for Human Pose Estimation [Tang+] • Compositional modelは、Bottom-Up/Top-Downの2つの推論によって部位間の階層性を表現でき低次の曖昧性を解消できるが、既存手法は非現実的な関係性を仮定していたり状態空間のサイズが莫大になる • DNNの学習によって複雑で現実的な階層関係を獲得し、方向・スケール・形の情報をコンパクトに埋め込む表現によって状態空間のサイズを削減 19

20.

Learning 3D Human Pose from Structure and Motion [Dabral+] • 解剖学に着想を得た、illegal-angle lossとsymmetry lossの2つの損失関数を導入 • 少量のDepthやSkelton情報を用いる弱教師あり学習と、直前Nフレームの活用によって、 in-the-wildな画像からanatomically coherent でsmoothな3D pose estimationを達成 20

21.

Exploiting temporal information for 3D human pose estimation [Hossain+] • 単体の2DのPoseのみでは3Dに変換する時に曖昧性があり、temporally incoherentになる • skip connectionつきのLSTMを用いて、2Dのhuman poseからtemporally coherentに3D human poseを推定する手法を提案 21

22.

Integral Human Pose Regression [Sun+] • 既存の多くのheatmap-basedな手法は微分不可能な閾値処理によってEnd-to-Endでなかった一方、 joint regression-basedな手法は3D pose estimationで使われているが、学習が難しい • heatmapとregressionを統一的に扱えるintegral regressionの手法を導入、多様なデータセットや条件で包括的にその効果を検証した 22

23.

Multi-Scale Structure-Aware Network for Human Pose Estimation [Ke+] • 既存研究はスケール変化に敏感かつ、人体構造に関するpriorが反映されていない • multi-scale supervision, multi-scale regression network, structure-aware loss, a keypoint masking trainingのテクニックを導入、Multi-ScaleでStructure-AwareなPose Estimationを可能に 23

24.

Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation [Rhodin+] • 従来研究は教師ありはなおのこと、弱教師あり学習でも大量の3Dアノテーションが必要だった • 本研究ではmulti-viewの画像を用いて、ある視点から別の視点の画像を予測するようにモデルを学習させることで、教師なしでgeometry-awareな表現を学習、少量の教師データでPoseを推定可能 24

25.

Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation [Rhodin+] • 従来研究は教師ありはなおのこと、弱教師あり学習でも大量の3Dアノテーションが必要だった • 本研究ではmulti-viewの画像を用いて、ある視点から別の視点の画像を予測するようにモデルを学習させることで、教師なしでgeometry-awareな表現を学習、少量の教師データでPoseを推定可能 25

26.

3D Ego-Pose Estimation via Imitation Learning [Yuan+] • Ego-Pose Estimationにおいて、既存の手法が物理法則を考慮していない点を指摘 • 物理シミュレーションに基づくImitation Learningを活用することによって物理的に自然なPose Estimationを可能にし、シミュレーションからリアルへのDomain Adaptationでも効果を発揮することを確認 26

27.

Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera [Marcard+] • IMU(慣性計測装置)とハンディカムなどの動画から、正確な3DのPose復元を行う研究 • 特に複数人が写っている場合の人物特定や雑然とした背景などの難しさを工夫して解決 • 提案手法によって3D Pose in the Wildという新たなデータセットを作成・提案した 27

28.

Deformable Pose Traversal Convolution for 3D Action and Gesture Recognition [weng+] • 全身のPoseを横断的に走査するTraversal ConvolutionとAttentionを活用したDeformable Convolutionを組み合わせることで、3D Action/Gesture Recognitionに重要な関節の特徴を認識するDeformable Pose Traversal Convolutionを提案 28

29.

まとめ・感想 • Multi-Personでは人数に比例して速度が低下しないBottom-Upアプローチが多かった – データセットの違いなどで精度の比較は完全ではないが、速度で言うとPose Proposal Networkが最速？ • Single-Personは個別に様々なテーマ – 3D – In-The-Wildな環境 – データの効率的な利用 – 人体構造や時空間的な一貫性などの反映 – (データセットや評価指標がまちまちで結果は比較しきれませんでした。誰かまとめてほしい。) 29

[DL輪読会]Human Pose Estimation @ ECCV2018

Deep Learning JP

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

【拡散モデル勉強会】拡散モデルの数理

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

【拡散モデル勉強会】Introduction to Diffusion Models

【DL輪読会】Conditional Flow Matching

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

各ページのテキスト