Audio-to-Audio
English
cn
SaoYear nielsr HF Staff commited on
Commit
131972f
Β·
verified Β·
1 Parent(s): cf82a4e

Improve model card for CleanMel (#1)

Browse files

- Improve model card for CleanMel (afe254a8769a51d0ce830316899e09479d449707)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +84 -3
README.md CHANGED
@@ -1,7 +1,88 @@
1
  ---
2
- license: bigscience-openrail-m
3
  language:
4
  - en
5
  - cn
6
- pipeline_tag: audio-to-audio; ASR
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  language:
3
  - en
4
  - cn
5
+ license: bigscience-openrail-m
6
+ pipeline_tag: audio-to-audio
7
+ ---
8
+
9
+ # CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR
10
+
11
+ The CleanMel model was presented in the paper [CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR](https://huggingface.co/papers/2502.20040).
12
+
13
+ CleanMel is a single-channel Mel-spectrogram denoising and dereverberation network designed to improve both speech quality and automatic speech recognition (ASR) performance. It takes noisy and reverberant microphone recordings as input and predicts the corresponding clean Mel-spectrogram. This enhanced Mel-spectrogram can then be either transformed to a speech waveform with a neural vocoder or directly used for ASR.
14
+
15
+ The proposed network employs interleaved cross-band and narrow-band processing in the Mel-frequency domain, which allows it to learn full-band spectral patterns and narrow-band properties of signals, respectively. A key advantage of Mel-spectrogram enhancement, compared to linear-frequency domain or time-domain speech enhancement, is that Mel-frequency presents speech in a more compact way, making it easier to learn. This compactness benefits both speech quality and ASR. Experimental results on five English and one Chinese datasets demonstrate significant improvements.
16
+
17
+ * πŸ“š [Paper](https://huggingface.co/papers/2502.20040)
18
+ * 🌐 [Project Page](https://audio.westlake.edu.cn/Research/CleanMel.html)
19
+ * πŸ’» [GitHub Repository](https://github.com/Audio-WestlakeU/CleanMel)
20
+ * πŸš€ [Hugging Face Demo](https://huggingface.co/spaces/SaoYear/CleanMel)
21
+
22
+ ## Overview πŸš€
23
+ <p align="center"><img src="https://github.com/Audio-WestlakeU/CleanMel/raw/main/src/imgs/cleanmel_arch.png" alt="CleanMel Architecture" width="60%"/></p>
24
+
25
+ **CleanMel** enhances logMel spectrograms for improved speech quality and ASR performance. Outputs are compatible with:
26
+ - πŸŽ™οΈ Vocoders for enhanced waveforms
27
+ - πŸ€– ASR systems for transcription
28
+
29
+ ## Quick Start ⚑
30
+
31
+ ### Environment Setup
32
+ ```bash
33
+ conda create -n CleanMel python=3.10.14
34
+ conda activate CleanMel
35
+ pip install -r requirements.txt
36
+ ```
37
+
38
+ ### Inference
39
+ Pretrained models can be downloaded manually from the [WestlakeAudioLab/CleanMel](https://huggingface.co/WestlakeAudioLab/CleanMel) repository, or automatically with the help of the `huggingface-hub` package.
40
+
41
+ ```bash
42
+ # Inference with pretrained models from huggingface
43
+ ## Offline example (offline_CleanMel_S_mask)
44
+ cd shell
45
+ bash inference.sh 0, offline S mask huggingface
46
+
47
+ ## Online example (online_CleanMel_S_map)
48
+ bash inference.sh 0, online S map huggingface
49
+
50
+ # Inference with local pretrained models
51
+ cd shell
52
+ bash inference.sh 0, offline S mask
53
+
54
+ ## Online example (online_CleanMel_S_map)
55
+ bash inference.sh 0, online S map
56
+ ```
57
+ **Custom Input**: Modify `speech_folder` in `inference.sh`
58
+
59
+ **Output**: Results saved to `output_folder` (default to `./my_output`)
60
+
61
+ ## Performance πŸ“Š
62
+ ### Speech Enhancement
63
+ <p align="center"><img src="https://github.com/Audio-WestlakeU/CleanMel/raw/main/src/imgs/dnsmos_performance.png" alt="DNSMOS Performance" width="70%"/></p>
64
+ <p align="center"><img src="https://github.com/Audio-WestlakeU/CleanMel/raw/main/src/imgs/pesq_performance.png" alt="PESQ Performance" width="40%"/></p>
65
+
66
+ ### ASR Accuracy
67
+ <p align="center"><img src="https://github.com/Audio-WestlakeU/CleanMel/raw/main/src/imgs/asr_performance.png" alt="ASR Performance" width="40%"/></p>
68
+
69
+ πŸ’‘ ASR implementation details are available in the [`asr_infer` branch](https://github.com/Audio-WestlakeU/CleanMel/tree/asr_infer) of the GitHub repository.
70
+
71
+ ## Citation πŸ“
72
+ If you find CleanMel useful, please cite our work:
73
+ ```bibtex
74
+ @ARTICLE{11097896,
75
+ author={Shao, Nian and Zhou, Rui and Wang, Pengyu and Li, Xian and Fang, Ying and Yang, Yujie and Li, Xiaofei},
76
+ journal={IEEE Transactions on Audio, Speech and Language Processing},
77
+ title={CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR},
78
+ year={2025},
79
+ volume={},
80
+ number={},
81
+ pages={1-13},
82
+ doi={10.1109/TASLPRO.2025.3592333}}
83
+ }
84
+ ```
85
+
86
+ ## Acknowledgement πŸ™
87
+ - Built using [NBSS](https://github.com/Audio-WestlakeU/NBSS) template
88
+ - Vocoder implementation from [Vocos](https://github.com/gemelo-ai/vocos)