Open to Collab

11 4 25

Yuriy Perezhohin PRO

yuriyvnv

https://scholar.google.com/citations?user=I5uzFtwAAAAJ&hl=en

AI & ML interests

Automatic Speech Recognition, Embeddings, Code Generation, Synthetic Data Generation and Filtering

Recent Activity

updated a model 5 days ago

yuriyvnv/experiments_parakeet

published a model 5 days ago

yuriyvnv/experiments_parakeet

updated a dataset 6 days ago

yuriyvnv/synthetic_asr_et_sl

View all activity

Organizations

updated a model 5 days ago

yuriyvnv/experiments_parakeet

Updated 4 days ago • 13.4k

published a model 5 days ago

yuriyvnv/experiments_parakeet

Updated 4 days ago • 13.4k

updated a dataset 6 days ago

yuriyvnv/synthetic_asr_et_sl

Viewer • Updated 6 days ago • 80.7k • 37

published a dataset 6 days ago

yuriyvnv/synthetic_asr_et_sl

Viewer • Updated 6 days ago • 80.7k • 37

posted an update 9 days ago

Post

324

🎯 WAVe-1B-Multimodal-NL: Word-Level Speech Quality Assessment for Dutch

Following the release of the Portuguese model, we're releasing the Dutch variant of WAVe — a 1B multimodal embedding model that assesses synthetic speech quality at the word level, thereby improving the quality of synthetically augmented datasets for training ASR models.

Trained on CommonVoice 16.1 Dutch with 5 corruption strategies, this model catches mispronunciations, timing errors, and prosody issues in synthetic data that sentence-level embeddings miss entirely.
Resources

- Dutch model: yuriyvnv/WAVe-1B-Multimodal-NL
- Portuguese model: yuriyvnv/WAVe-1B-Multimodal-PT
- Code: https://github.com/yuriyvnv/WAVe

This model builds on CommonVoice Dutch data — thanks to @mozilla and the CommonVoice community for making multilingual speech data accessible.

Would be great to hear from the Dutch NLP community — @BramVanroy @GroNLP — especially if you're working on Dutch ASR or TTS pipelines where quality filtering could help. Also tagging @hf-audio as this sits at the intersection of speech processing and data curation.

updated a model 9 days ago

yuriyvnv/WAVe-1B-Multimodal-NL

Audio Classification • 0.9B • Updated 9 days ago • 16

published a model 9 days ago

yuriyvnv/WAVe-1B-Multimodal-NL

Audio Classification • 0.9B • Updated 9 days ago • 16

replied to their post 9 days ago

🔥 Hello Everyone, given the community's increased interest in the WAVe for the Portuguese Language, the team has retrained the model for over 100 epochs to further extend learning. The results are much better than those from the previous version with 30 epochs.
Key improvements:

Metric	30 ep	100 ep	Change
Loss	0.49	0.22	-56%
Alignment Gap	0.079	0.118	+49%
Corrupt Similarity	0.31	0.23	-25%

The biggest win is the alignment gap nearly doubling -- the model is now much better at catching word-level errors like mispronunciations and timing artifacts. Corrupt pairs get
penalized harder (0.23 vs 0.31), so the filtering threshold becomes more reliable.

Same repo, same API, drop-in replacement:

model = AutoModel.from_pretrained("yuriyvnv/WAVe-1B-Multimodal-PT", trust_remote_code=True)

Updated README of the model card includes side-by-side training curves for both versions, check it out.

updated a model 9 days ago

yuriyvnv/WAVe-1B-Multimodal-PT

Audio Classification • 0.9B • Updated 9 days ago • 219 • 6

liked a model 19 days ago

zai-org/GLM-OCR

Image-to-Text • Updated 15 days ago • 1.45M • 1.12k

updated a model 21 days ago

yuriyvnv/3layers_wt_alignment_PT_100_epochs

Updated 21 days ago

published a model 21 days ago

yuriyvnv/3layers_wt_alignment_PT_100_epochs

Updated 21 days ago

replied to their post 22 days ago

Hello everyone, yesterday there were minor problems that prevented the usage of the Embedding model. Mainly because of the Processor Class.
Posting here that the team has already solved the bugs.
If there is any problem with your usage, first delete the cached model (.cache folder in Hugging Face), redownload it, and if the issue persists, post a thread on the model page.

posted an update 23 days ago

Post

2193

🎯 WAVe: 1B Multimodal Embedding Model for Word-Level Speech Quality

Multimodal embeddings for speech + transcript that verify quality at the word level, not just sentence level. Catches mispronunciations, timing errors, and prosody issues that sentence-level filters miss.

📊 Impact on Portuguese ASR:
• 34% reduction in training steps
• 50% better cross-domain generalization
• 30% less synthetic data needed
• Word-aligned attention finds errors other methods miss

🏗️ Architecture:
• Text: XLM-RoBERTa (278M params)
• Audio: Wav2Vec2-BERT 2.0 (581M params)
• Word Alignment: Multi-head attention + GLU (14M params)
• Total: 1B parameters

from transformers import AutoModel, AutoProcessor

  processor = AutoProcessor.from_pretrained(
      "yuriyvnv/WAVe-1B-Multimodal-PT",
      trust_remote_code=True
  )
  model = AutoModel.from_pretrained(
      "yuriyvnv/WAVe-1B-Multimodal-PT",
      trust_remote_code=True
  )

# Assess speech-transcript alignment

inputs = processor(text="Olá, como está?", audio=audio_array, sampling_rate=16000, return_tensors="pt")
  quality = model(**inputs).quality_score.item()

Perfect for filtering synthetic speech datasets before ASR training.

Model: yuriyvnv/WAVe-1B-Multimodal-PT
Code to create WAVe : https://github.com/yuriyvnv/WAVe
#multimodal #speech #embeddings #asr
#syntheticdata #qualityassessment