MiniGAN
Towards an Efficient and Accurate Speech Enhancement by a Comprehensive Ablation Study
Authors: Lidia Abad, Fernando López, Jordi Luque.Paper: link
Abstract. In recent years, significant advancements in speech enhancement have been made through phase reconstruction, dual-branch methodologies, or attention mechanisms. These methods produce exceptional results but at the expense of a high computational budget. This work aims to enhance the efficiency of the MP-SENet architecture by introducing MiniGAN, a generative adversarial network in the time-frequency domain. It features an encoder-decoder structure with residual connections, conformers, and parallel processing of signal magnitude and phase. We employ data augmentation techniques in training, investigate the impact of various loss terms, and examine architectural alterations to achieve lower operational costs without compromising performance. Our results on the VoiceBank+DEMAND evaluation set report that MiniGAN achieves competitive figures in objective metrics, obtaining a PESQ of 2.95, while maintaining low latency and reducing computational complexity. The suggested MiniGAN system is ideally suited for real-time applications on resource-constrained devices, as it achieves a real-time factor of 0.24 and has a mere 373k parameters.
Model Structure
The model consists of a generator and a discriminator. The noisy signal magnitude and phase are obtained in TF domain after performing the STFT. The generator employs as input the concatenated magnitude and phase, which are estimated from the noisy signal. In the latent space of the generator, two conformers are placed: one focuses on the time and the other on the frequency features. They are followed by two decoders used for parallel processing of the magnitude and phase of the noisy signal. The magnitude mask decoder outputs a mask that is applied to the noisy signal magnitude, obtaining the enhanced signal magnitude. Finally, by means of the phase and magnitude of the enhanced signal, the enhanced audio is reconstructed in time domain through the Inverse Short Time Fourier Transform (ISTFT).
Audio Samples
Audio samples for the models developed in our work. Two samples from the VoiceBank+DEMAND public dataset (VB+D Audio 1 and VB+D Audio 2) and two audios recorded to test our models on different speaker and noises (Non VB+D Audio1 and Non VB+D Audio2). Non clean version of the last audios are given as the noisy audios are not synthetic.
VB+D Audio 1 | VB+D Audio 2 | Non VB+D Audio 1 | Non VB+D Audio 2 | |
---|---|---|---|---|
Noisy | ||||
Clean | ||||
MiniGAN | ||||
MiniGAN-FT | ||||
MiniGAN-48 | ||||
MiniGAN-NDA | ||||
MiniGAN-WN | ||||
MiniGAN-ED |