Diffusion Conditional Expectation Model
for Efficient and Robust Target Speech Extraction

Leying Zhang^1,2, Yao Qian², Linfeng Yu¹, Heming Wang³, Xinkai Wang¹,
Hemin Yang², Long Zhou², Shujie Liu², Yanmin Qian¹, Michael Zeng²

¹Shanghai Jiao Tong University, China ²Microsoft, USA ³The Ohio State University, USA

Abstract

Target Speech Extraction (TSE) is a crucial task in speech processing that focuses on isolating the clean speech of a specific speaker from complex mixtures. While discriminative methods are commonly used for TSE, they can introduce distortion in terms of speech perception quality. On the other hand, generative approaches, particularly diffusion-based methods, can enhance speech quality perceptually but suffer from slower inference speed. We propose an efficient generative approach named Diffusion Conditional Expectation Model (DCEM) for TSE. It can handle multi- and single-speaker scenarios in both noisy and clean conditions. Additionally, we introduce Regenerate-DCEM (R-DCEM) that can regenerate and optimize speech quality based on pre-processed speech from a discriminative model. Our method outperforms conventional methods in terms of both intrusive and non-intrusive metrics and demonstrates notable strengths in inference efficiency and robustness to unseen tasks.

Multi-speaker noisy scenario

Mixture: The input multi-speaker mixture with background noise.
Ground Truth: The clean speech of the target speaker.
DCEM: Extraction results of our proposed DCEM method.
DPCCN: Extraction results of discriminative DPCCN method.
DiffSep+SV: Extraction results of diffusion-based DiffSep+SV method.
We also provide the corresponding mel spectrogram of each audio file for better visualization.

WAV_ID	Mixture	Ground Truth	DCEM	DPCCN	DiffSep+SV

Multi-speaker clean scenario

Mixture: The input multi-speaker mixture without background noise.
Ground Truth: The clean speech of the target speaker.
DCEM: Extraction results of our proposed DCEM method.
DPCCN: Extraction results of discriminative DPCCN method.
DiffSep+SV: Extraction results of diffusion-based DiffSep+SV method.
We also provide the corresponding mel spectrogram of each audio file for better visualization.

WAV_ID	Mixture	Ground Truth	DCEM	DPCCN	DiffSep+SV

Single-speaker noisy scenario

Mixture: The input noisy single-speaker speech.
Ground Truth: The clean speech of the target speaker.
DCEM: Enhancement results of our proposed DCEM method.
DCCRN: Enhancement results of discriminative DPCCN method.
SGMSE+: Enhancement results of diffusion based SGMSE+ method.
We also provide the corresponding mel spectrogram of each audio file for better visualization.

1

WAV_ID	Mixture	Ground Truth	DCEM	DCCRN	SGMSE+

Diffusion Conditional Expectation Model for Efficient and Robust Target Speech Extraction

Abstract

Multi-speaker noisy scenario

Multi-speaker clean scenario

Single-speaker noisy scenario

Diffusion Conditional Expectation Model
for Efficient and Robust Target Speech Extraction