BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Abstract: Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a scheduling network and a score network, which can train with a novel bilateral modeling objective. We show that the new surrogate objective can achieve a lower bound of the log marginal likelihood tighter than a conventional surrogate. We also find that BDDM allows inheriting pre-trained score network parameters from any DPMs and consequently enables speedy and stable learning of the scheduling network and optimization of a noise schedule for sampling. Our experiments demonstrate that BDDMs can generate high-fidelity audio samples with as few as 3 sampling steps. Moreover, comparing to other state-of-the-art diffusion-based neural vocoders, BDDMs produce comparable or higher quality samples indistinguishable from human speech, notably with only 7 sampling steps (143x faster than WaveGrad and 28.6x faster than DiffWave).

Fast and high-fidelity speech generation using BDDMs:

By introducing a scheduling network optimized with our derived loss, we can generate high-fidelity speech with as few as 3 steps.

Text: Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition.

Step 0 (White Noise): Note: Consider lower volume before listening
Step 1:
Step 2:
Step 3:

LJ speech samples generated from different neural vocoders:

Note: Different rows correspond to different noise schedules or sampling methods for inference.

Text Under the new rule visitors were not allowed to pass into the interior of the prison, but were detained between the grating. This Commission can recommend no procedures for the future protection of our Presidents which will guarantee security.
Ground Truth
WaveNet (MoL)
WaveGlow
MelGAN
HiFi-GAN V1
WaveGrad
DiffWave
BDDM - 3 steps
BDDM - 7 steps
BDDM - 12 steps

VCTK samples from different generative diffusion models:

Note: Different rows correspond to different noise schedules or sampling methods for inference.

Text Frankly, we should all have such problems. I felt he was excellent.
Ground Truth
DDPM - 8 steps (Grid Search)
DDPM - 1000 steps (Linear)
DDIM - 8 steps (Linear)
DDIM - 100 steps (Linear)
NE - 8 steps (Linear)
BDDM - 8 steps