UniFormer: Unifying Convolution and Self-attention for Visual Recognition
The discrepancy between the cost function used for training a speech
enhancement model and human auditory perception usually makes the quality of
enhanced speech unsatisfactory. Objective evaluation metrics which consider
human perception can hence serve as a bridge to reduce the gap. Our previously
proposed MetricGAN was designed to optimize objective metrics by connecting the
metric with a discriminator. Because only the scores of the target evaluation
functions are needed during training, the metrics can even be
non-differentiable. In this study, we propose a MetricGAN+ in which three
training techniques incorporating domain-knowledge of speech processing are
proposed. With these techniques, experimental results on the VoiceBank-DEMAND
dataset show that MetricGAN+ can increase PESQ score by 0.3 compared to the
previous MetricGAN and achieve state-of-the-art results (PESQ score = 3.15).