Improve model card for CleanMel (#1)
Browse files- Improve model card for CleanMel (afe254a8769a51d0ce830316899e09479d449707)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
|
@@ -1,7 +1,88 @@
|
|
| 1 |
---
|
| 2 |
-
license: bigscience-openrail-m
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
- cn
|
| 6 |
-
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
- cn
|
| 5 |
+
license: bigscience-openrail-m
|
| 6 |
+
pipeline_tag: audio-to-audio
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR
|
| 10 |
+
|
| 11 |
+
The CleanMel model was presented in the paper [CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR](https://huggingface.co/papers/2502.20040).
|
| 12 |
+
|
| 13 |
+
CleanMel is a single-channel Mel-spectrogram denoising and dereverberation network designed to improve both speech quality and automatic speech recognition (ASR) performance. It takes noisy and reverberant microphone recordings as input and predicts the corresponding clean Mel-spectrogram. This enhanced Mel-spectrogram can then be either transformed to a speech waveform with a neural vocoder or directly used for ASR.
|
| 14 |
+
|
| 15 |
+
The proposed network employs interleaved cross-band and narrow-band processing in the Mel-frequency domain, which allows it to learn full-band spectral patterns and narrow-band properties of signals, respectively. A key advantage of Mel-spectrogram enhancement, compared to linear-frequency domain or time-domain speech enhancement, is that Mel-frequency presents speech in a more compact way, making it easier to learn. This compactness benefits both speech quality and ASR. Experimental results on five English and one Chinese datasets demonstrate significant improvements.
|
| 16 |
+
|
| 17 |
+
* π [Paper](https://huggingface.co/papers/2502.20040)
|
| 18 |
+
* π [Project Page](https://audio.westlake.edu.cn/Research/CleanMel.html)
|
| 19 |
+
* π» [GitHub Repository](https://github.com/Audio-WestlakeU/CleanMel)
|
| 20 |
+
* π [Hugging Face Demo](https://huggingface.co/spaces/SaoYear/CleanMel)
|
| 21 |
+
|
| 22 |
+
## Overview π
|
| 23 |
+
<p align="center"><img src="https://github.com/Audio-WestlakeU/CleanMel/raw/main/src/imgs/cleanmel_arch.png" alt="CleanMel Architecture" width="60%"/></p>
|
| 24 |
+
|
| 25 |
+
**CleanMel** enhances logMel spectrograms for improved speech quality and ASR performance. Outputs are compatible with:
|
| 26 |
+
- ποΈ Vocoders for enhanced waveforms
|
| 27 |
+
- π€ ASR systems for transcription
|
| 28 |
+
|
| 29 |
+
## Quick Start β‘
|
| 30 |
+
|
| 31 |
+
### Environment Setup
|
| 32 |
+
```bash
|
| 33 |
+
conda create -n CleanMel python=3.10.14
|
| 34 |
+
conda activate CleanMel
|
| 35 |
+
pip install -r requirements.txt
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
### Inference
|
| 39 |
+
Pretrained models can be downloaded manually from the [WestlakeAudioLab/CleanMel](https://huggingface.co/WestlakeAudioLab/CleanMel) repository, or automatically with the help of the `huggingface-hub` package.
|
| 40 |
+
|
| 41 |
+
```bash
|
| 42 |
+
# Inference with pretrained models from huggingface
|
| 43 |
+
## Offline example (offline_CleanMel_S_mask)
|
| 44 |
+
cd shell
|
| 45 |
+
bash inference.sh 0, offline S mask huggingface
|
| 46 |
+
|
| 47 |
+
## Online example (online_CleanMel_S_map)
|
| 48 |
+
bash inference.sh 0, online S map huggingface
|
| 49 |
+
|
| 50 |
+
# Inference with local pretrained models
|
| 51 |
+
cd shell
|
| 52 |
+
bash inference.sh 0, offline S mask
|
| 53 |
+
|
| 54 |
+
## Online example (online_CleanMel_S_map)
|
| 55 |
+
bash inference.sh 0, online S map
|
| 56 |
+
```
|
| 57 |
+
**Custom Input**: Modify `speech_folder` in `inference.sh`
|
| 58 |
+
|
| 59 |
+
**Output**: Results saved to `output_folder` (default to `./my_output`)
|
| 60 |
+
|
| 61 |
+
## Performance π
|
| 62 |
+
### Speech Enhancement
|
| 63 |
+
<p align="center"><img src="https://github.com/Audio-WestlakeU/CleanMel/raw/main/src/imgs/dnsmos_performance.png" alt="DNSMOS Performance" width="70%"/></p>
|
| 64 |
+
<p align="center"><img src="https://github.com/Audio-WestlakeU/CleanMel/raw/main/src/imgs/pesq_performance.png" alt="PESQ Performance" width="40%"/></p>
|
| 65 |
+
|
| 66 |
+
### ASR Accuracy
|
| 67 |
+
<p align="center"><img src="https://github.com/Audio-WestlakeU/CleanMel/raw/main/src/imgs/asr_performance.png" alt="ASR Performance" width="40%"/></p>
|
| 68 |
+
|
| 69 |
+
π‘ ASR implementation details are available in the [`asr_infer` branch](https://github.com/Audio-WestlakeU/CleanMel/tree/asr_infer) of the GitHub repository.
|
| 70 |
+
|
| 71 |
+
## Citation π
|
| 72 |
+
If you find CleanMel useful, please cite our work:
|
| 73 |
+
```bibtex
|
| 74 |
+
@ARTICLE{11097896,
|
| 75 |
+
author={Shao, Nian and Zhou, Rui and Wang, Pengyu and Li, Xian and Fang, Ying and Yang, Yujie and Li, Xiaofei},
|
| 76 |
+
journal={IEEE Transactions on Audio, Speech and Language Processing},
|
| 77 |
+
title={CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR},
|
| 78 |
+
year={2025},
|
| 79 |
+
volume={},
|
| 80 |
+
number={},
|
| 81 |
+
pages={1-13},
|
| 82 |
+
doi={10.1109/TASLPRO.2025.3592333}}
|
| 83 |
+
}
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
## Acknowledgement π
|
| 87 |
+
- Built using [NBSS](https://github.com/Audio-WestlakeU/NBSS) template
|
| 88 |
+
- Vocoder implementation from [Vocos](https://github.com/gemelo-ai/vocos)
|