facebook
/

pe-a-frame-base

lematt1991 commited on 3 days ago

Commit

092f216

verified ·

1 Parent(s): 7430d69

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md CHANGED Viewed

@@ -121,6 +121,28 @@ similarities = torch.einsum("btd,bd->bt", audio_embeds, text_embeds)
 # similarities shape: [batch_size, num_frames]
 ```
 ## Citation
 ```bibtex

 # similarities shape: [batch_size, num_frames]
 ```
+### Usage with 🤗 Transformers
+```python
+model = PeAudioFrameLevelModel.from_pretrained("facebook/pe-a-frame-large")
+processor = PeAudioProcessor.from_pretrained("facebook/pe-a-frame-large")
+inputs = transform(audio=[audio_file], text=descriptions, return_tensors="pt").to(device)
+with torch.inference_mode():
+    outputs = model(**inputs)
+# Access embeddings
+audio_embeds = outputs.audio_embeds  # Shape: [batch_size, num_frames, embed_dim]
+text_embeds = outputs.text_audio_embeds    # Shape: [batch_size, embed_dim]
+# Compute similarity between audio frames and text
+# audio_embeds is frame-level, so you can see which frames match the description
+similarities = torch.einsum("btd,bd->bt", audio_embeds, text_embeds)
+# similarities shape: [batch_size, num_frames]
+```
 ## Citation
 ```bibtex