CPPO: Contrastive Perception for Vision Language Policy Optimization

Abstract

We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.

🚀 Highlights

✨ Contrastive Perception Policy Optimization (CPPO) — A framework for improving vision–language policy reinforcement learning via contrastive perception training.
📈 Stronger Empirical Performance — Demonstrates consistent gains on complex multimodal reasoning tasks.
🔍 Entropy-Based Perception Token Detection — Automatically locates informative visual tokens through perturbation sensitivity.
📊 Contrastive Perception Loss (CPL) — Encourages the policy to gain discriminative perception.
🧠 No External Supervision — Perception improvement is gained purely from information-removing and information-preserving augmentations without the use of ground-truth visual information.

Inference

CPPO models are based on the HuggingFace Qwen2.5-VL model. When running inference, format your prompts with the following instruction template to ensure outputs include reasoning within <think> </think> tags and final answers in \boxed{} notation:

from PIL import Image
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration

# Load model and processor
model_name = "path/to/cppo-7B"
processor = AutoProcessor.from_pretrained(model_name)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)

# Instruction template
instruction_following = (
    r"You FIRST think about the reasoning process as an internal monologue and then provide the final answer. "
    r"The reasoning process MUST BE enclosed within <think> </think> tags. "
    r"The final answer MUST BE put in \boxed{}."
)

# Prepare prompt with instruction following
prompt = "Your question here. " + instruction_following
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": Image.open("path/to/image.jpg"),
            },
            {"type": "text", "text": prompt},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)

# Generate output
inputs = processor(text=[text], images=image_inputs, padding=True, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=4096)
generated_ids = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, outputs)
]
response = processor.decode(generated_ids[0], skip_special_tokens=True)
print(response)