SAGE Teaser

System Capabilities

SAGE-MM operates as the core decision-maker within the SAGE system. It functions in two distinct stages:

  1. Stage-1 (Context VLM): The model analyzes initial sampled frames and metadata to determine if the query can be answered immediately ("single-turn") or if it requires tool usage ("multi-turn").
  2. Stage-2 (Iterative Reasoner): If tools are needed, the model enters a loop where it calls tools, analyzes their output, and updates its context until a final answer is derived.

Supported Tools

The model is trained to generate JSON-formatted actions to invoke the following tools:

  • web-search: Search the internet for external knowledge (e.g., sports standings, cast lists).
  • transcribe-speech: Perform ASR on specific timestamped segments of the video.
  • ground-event: Locate start/end timestamps for specific visual events.
  • extract-video-parts: Extract high-resolution frames or subclips from specific timestamps.
  • analyze: Perform detailed visual analysis on extracted media.

Usage

Note: SAGE-MM outputs JSON action strings. It requires a runtime environment (provided in our GitHub repo) to parse these strings, execute the tools, and feed the observation back to the model.

License

This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.

Downloads last month
-
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for allenai/SAGE-MM-Qwen2.5-VL-7B-SFT_RL

Finetuned
(1)
this model
Quantizations
2 models

Datasets used to train allenai/SAGE-MM-Qwen2.5-VL-7B-SFT_RL

Collection including allenai/SAGE-MM-Qwen2.5-VL-7B-SFT_RL