GitHub Repo: https://github.com/allenai/SAGE
Project Page: https://praeclarumjj3.github.io/sage/

System Capabilities

SAGE-MM operates as the core decision-maker within the SAGE system. It functions in two distinct stages:

Stage-1 (Context VLM): The model analyzes initial sampled frames and metadata to determine if the query can be answered immediately ("single-turn") or if it requires tool usage ("multi-turn").
Stage-2 (Iterative Reasoner): If tools are needed, the model enters a loop where it calls tools, analyzes their output, and updates its context until a final answer is derived.

Supported Tools

The model is trained to generate JSON-formatted actions to invoke the following tools:

web-search: Search the internet for external knowledge (e.g., sports standings, cast lists).
transcribe-speech: Perform ASR on specific timestamped segments of the video.
ground-event: Locate start/end timestamps for specific visual events.
extract-video-parts: Extract high-resolution frames or subclips from specific timestamps.
analyze: Perform detailed visual analysis on extracted media.

Usage

Note: SAGE-MM outputs JSON action strings. It requires a runtime environment (provided in our GitHub repo) to parse these strings, execute the tools, and feed the observation back to the model.