DDD:
A Perceptually Superior
Low-Response-Time
DNN-Based Declipper



*Jayeon Yi, Junghyun Koo, Kyogu Lee
*This work was done during an internship at MARG (Music and Audio Research Group), Seoul National university



Paper: TBD

Code: https://github.com/stet-stet/DDD

ABSTRACT

Clipping is a common nonlinear distortion that occurs whenever input or output of an audio system exceeds the supported range. This phenomenon undermines not only the perception of speech quality but also downstream processes utilizing the disrupted signal. Therefore, a real-time-capable, robust, and low-response-time method for speech declipping (SD) is desired. In this work, we introduce DDD (Demucs-Discriminator-Declipper), a real-time-capable speech-declipping deep neural network (DNN) that requires less response time by design. We first observe that a previously untested real-time-capable DNN model, Demucs, exhibits a reasonable declipping performance. Then we utilize adversarial learning objectives to increase the perceptual quality of output speech without additional inference overhead. Subjective evaluation shows that DDD outperforms the baselines by a wide margin in terms of speech quality. We perform detailed waveform and spectral analyses to gain an insight into the output behavior of DDD in comparison to the baselines. Finally, our streaming simulations also show that DDD is capable of sub-decisecond mean response times, outperforming the state-of-the-art DNN approach by a factor of six.

Audio Samples


Inference on Voiceband-DEMAND dataset.
Below are some reconstructed Voicebank-DEMAND dataset samples used in our MUSHRA-like test.
All DNN models are trained on the Voiceband-DEMAND dataset (16kHz).
The test-split samples below are normalized to -27 LUFS, and used as such for subjective testing.

Please use devices such as speakers, headphones, and earphones in a quiet environment to analyze the sound source.

Reference Clipped
(SNR = 1dB)
A-SPADE DDD
(proposed)
T-UNet DD
(proposed)

Inference on DNS-Challenge dataset.
Below are some reconstructed samples from the DNS-Challenge Dataset, used in our MUSHRA-like test.
The test-split samples below are normalized to -27 LUFS, and used as such for subjective testing.

Please use devices such as speakers, headphones, and earphones in a quiet environment to analyze the sound source.

Reference Clipped
(SNR = 1dB)
A-SPADE
(unused in this test)
DDD
(proposed)
T-UNet DD
(proposed)
-
-
-
-

Inference on Voiceband-DEMAND dataset.
All DNN models are trained on the Voiceband-DEMAND dataset (16kHz).
The test-split samples below are normalized to -27 LUFS.
We also disclose the the full set of reconstructed test splits at various SNRs (1, 3, 7, 15dB)

Please use devices such as speakers, headphones, and earphones in a quiet environment to analyze the sound source.

Reference SNR Clipped A-SPADE DDD
(proposed)
T-UNet DD
(proposed)
3dB
7dB
15dB
3dB
7dB
15dB

Inference on DNS-Challenge dataset.
All DNN models are trained on the Voiceband-DEMAND dataset (16kHz).
The test-split samples below are normalized to -27 LUFS.
We also disclose the the full set of reconstructed test splits at various SNRs (1, 3, 7, 15dB)

Please use devices such as speakers, headphones, and earphones in a quiet environment to analyze the sound source.

Reference SNR Clipped A-SPADE DDD
(proposed)
T-UNet DD
(proposed)
3dB
7dB
15dB
3dB
7dB
15dB