Audio spotforming using nonnegative tensor factorization with attractor-based regularization

1.3K Views

September 15, 24

スライド概要

Shoma Ayano, Li Li, Shogo Seki, and Daichi Kitamura,"Audio spotforming using nonnegative tensor factorization with attractor-based regularization,"Proceedings of European Signal Processing Conference (EUSIPCO 2024), pp. 121–125, Lyon, France, August 2024.

profile-image

北村研究室の学内・対外発表の発表スライドをまとめています.

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

関連スライド

各ページのテキスト
1.

32nd European Signal Conference (EUSIPCO) TH2.SC4.4 Audio Spotforming Using Nonnegative Tensor Factorization with Attractor-Based Regularization Shoma Ayano (National Institute of Technology, Kagawa College) Li Li (CyberAgent Inc.) Shogo Seki (CyberAgent Inc.) Daichi Kitamura (National Institute of Technology, Kagawa College)

2.

2 Introduction • Target speaker extraction – Extract only the target source from the observed signal Interf. Target speaker extraction Target Interf. • Applications – Automatic speech recognition – Hearing-aid systems

3.

3 Introduction • Beamforming (BF) – Extract the sources arriving from specific direction • Spotforming (using multiple microphone arrays) – Extract the sources arriving from specific area Beamforming Spotforming Interf. Interf. Interf. Interf. Target area Target direction Mic. array Mic. array Interf. Mic. array

4.

4 Conventional method: modeling • Situation – Each microphone array extracts only the specific direction • Obtain output signals including target and interference sources – Common components among all the BF outputs correspond to the target source Unique comp. (Interfs.) Interf. 2 Interf. 1 BF Common comp. (Target) Target Array 1 Array 2 BF output 1 BF output 2

5.

5 Conventional method: overview • Spotforming by NMF using multiple microphone arrays [Kagimoto+, 2022] BF BF Concatenate NMF Masking Wiener filter Wiener filter Delay and sum

6.

6 Conventional method: overview • Spotforming by NMF using multiple microphone arrays [Kagimoto+, 2022] Interf. 2 BF BF Concatenate Interf. 1 NMF BF Masking Target Array 1 Array 2 BF BF output 1 output 2 Apply BF to each of microphone arrays Wiener filter Wiener filter Delay and sum

7.

7 Conventional method: overview • Spotforming by NMF using multiple microphone arrays [Kagimoto+, 2022] BF BF Concatenate NMF Common comp. (Target) Masking BF BF output 1 output 2 Estimate common components Wiener filter Wiener filter Delay and sum

8.

8 Conventional method: overview • Spotforming by NMF using multiple microphone arrays [Kagimoto+, 2022] BF BF Concatenate NMF Masking Enhance only the target source as post-processing Wiener filter Wiener filter Delay and sum

9.

Conventional method: NMF • Nonnegative matrix factorization (NMF) [Lee+, 1999] – Low-rank decomposition with nonnegative constraint • Extracts limited number of nonnegative bases and their coefficients – Applying NMF to spectrograms • Estimates frequently appearing spectral components Amplitude Basis matrix Activation matrix (spectral patterns) (time-varying gains) Frequency Frequency Nonnegative matrix (power spectrogram) Time Time Amplitude : # of freq. bins : # of time frames : # of bases 9

10.

Conventional method: NMF • Concatenate BF outputs and apply NMF BF BF Concatenate Common comp. (Target) NMF Masking Wiener filter Wiener filter Delay and sum BF BF output 1 output 2 10

11.

11 Conventional method: NMF • Concatenate BF outputs and apply NMF BF BF output 2 BF output 1 BF NMF input Concatenate NMF Concatenate Masking Wiener filter Wiener filter Delay and sum NMF input Basis mat. Activation mat.

12.

Conventional method: masking 12 • Generate binary mask Activation mat. BF BF Concatenate NMF Masking Wiener filter Separate into each array Array 1 Array 2 Obtain only active components by binary thresholding using a Wiener filter Delay and sum Calculate logical conjunction to extract the common components

13.

Conventional method: post-process • Post-processing Wiener filter for BF output 1 BF BF Concatenate NMF Masking Wiener filter Element-wise multiplication Wiener filter for BF output 2 Wiener filter Delay and sum Apply delay-and-sum operation for further enhancement 13

14.

14 Proposed method: motivation • Lack of model interpretability – Concatenate along time axis may collapse consistency BF output 2 BF output 1 BF BF Concatenate NMF input NMF Masking Concatenate • We introduce nonnegative tensor factorization (NTF) – A new regularization is also incorporated for further performance improvement Wiener filter Wiener filter Delay and sum

15.

Proposed method: motivation 15 • Lack of model interpretability – Concatenate along time axis may collapse consistency BF output 2 BF output 1 BF BF Nonnegative Concatenate tensor NMF factorization NMF input Masking Concatenate • We introduce nonnegative tensor factorization (NTF) – A new regularization is also incorporated for further performance improvement Wiener filter Wiener filter Delay and sum

16.

16 Proposed method: model • Applying NTF to BF outputs – BF outputs (three-dimensional tensor) is decomposed into basis, activation, and allocation matrices – Allocation matrix assigns each of basis vectors to the corresponding BF outputs BF output 2 BF output 1 Tensor Allocation matrix Basis matrix Activation matrix

17.

17 Proposed method: model • Applying NTF to BF outputs – BF outputs (three-dimensional tensor) is decomposed into basis, activation, and allocation matrices – Allocation matrix assigns each of basis vectors to the corresponding BF outputs BF output 2 BF output 1 Tensor Allocation matrix Basis matrix Activation matrix Contribution to both and Common comp. (target)

18.

18 Proposed method: model • Applying NTF to BF outputs – BF outputs (three-dimensional tensor) is decomposed into basis, activation, and allocation matrices – Allocation matrix assigns each of basis vectors to the corresponding BF outputs BF output 2 BF output 1 Tensor Allocation matrix Basis matrix Activation matrix Contribution to only Unique comp. (interf.)

19.

19 Proposed method: model • Applying NTF to BF outputs – BF outputs (three-dimensional tensor) is decomposed into basis, activation, and allocation matrices – Allocation matrix assigns each of basis vectors to the corresponding BF outputs BF output 2 BF output 1 Allocation matrix Nearest HARD allocation matrix Tensor Basis matrix Minimize error Activation matrix Regularize to be close

20.

20 Proposed method: regularization • Attractor-based regularization – Define , , and as attractor vectors – Each attractor vectors indicates • and • : common components (target source) One-hot attractor vector : unique components (interference sources) Central attractor vector One-hot attractor vector Without regularization One-hot attractor vector Central attractor vector One-hot attractor vector With attractor-based regularization – Regularization enhances the components to be more common/unique discriminative basis estimation

21.

Proposed method: optimization 21 • Optimization problem Data fidelity of NTF – Attractor-based regularization term Attractor-based reg. : KL divergence • adaptively finds the nearest attractor basis and calculates errors • Solve by majorization-minimization algorithm [Hunter+, 2000] – Iterative update algorithm can be obtained (as usual)

22.

Experiment: condition • Simulation using Pyroomacoustics [Scheibler+, 2018] Situation A Situation B Reverberation time T60: 256 ms 22

23.

23 Experiment: condition • The other conditions BF BF Nonnegative Concatenate tensor NMF factorization Masking Wiener filter Wiener filter Property Value Sampling frequency 16 kHz BF algorithm Minimum variance distortion-less response Window length 32 ms Shift length 16 ms Initial value of allocation matrix Set all the column to central attractor vector Initial value of basis and activation matrices 10 patterns of pseudo random value In the range (0, 1) Iteration of parameter updates 100 iterations Hyperparameters 𝜏 = 0.0025 / 0.005 / 0.01 𝜇 = 0 for first 50 iters. 𝜇 = 10000 for last 50 iters. Evaluation score Source-to-distortion ratio (SDR)

24.

Experiment: results • Situation A (2 mic. arrays) – Reverberant case (T60 = 256 ms) Situation A 24

25.

Experiment: results • Situation A (2 mic. arrays) and – No reverberation case (T60 = 0 ms) – Reverberant case (T60 = 256 ms) 25

26.

Experiment: results • Situation B (3 mic. arrays) – Reverberant case (T60 = 256 ms) Situation B 26

27.

Experiment: results • Behavior of allocation matrix – Situation B (3 mic. arrays) Interf. speaker Target speaker Interf. speaker Interf. speaker Situation B 27

28.

Experiment: results • Behavior of allocation matrix – Situation B (3 mic. arrays) Interf. speaker Target speaker Interf. speaker Interf. speaker Situation B 28

29.

Experiment: results • Behavior of allocation matrix – Situation B (3 mic. arrays) Interf. speaker Target speaker Interf. speaker Interf. speaker Situation B 29

30.

Experiment: results • Behavior of allocation matrix – Situation B (3 mic. arrays) Interf. speaker Target speaker Interf. speaker Interf. speaker Situation B 30

31.

Experiment: results • Behavior of allocation matrix – Situation B (3 mic. arrays) Interf. speaker Target speaker Interf. speaker Interf. speaker Situation B 31

32.

Experiment: results • Behavior of allocation matrix – Situation B (3 mic. arrays) Interf. speaker Target speaker Interf. speaker Interf. speaker Situation B 32

33.

Conclusion • In this study – We proposed a new spotforming method using NTF • Benefit of proposed method – Decomposition model with NTF has higher model interpretability • Easily extend the model, e.g., introducing regularization – Attractor-based regularization automatically allocates multiple basis vectors to target/interference sources • Experiment – Outperforms conventional spotforming algorithm – Indicates robust performance • We can easily tune the hyperparameters! Thank you for your attention! 33