Diffusers documentation

Cosmos

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.36.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Cosmos

Cosmos World Foundation Model Platform for Physical AI by NVIDIA.

Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

Loading original format checkpoints

Original format checkpoints that have not been converted to diffusers-expected format can be loaded using the from_single_file method.

import torch
from diffusers import Cosmos2TextToImagePipeline, CosmosTransformer3DModel

model_id = "nvidia/Cosmos-Predict2-2B-Text2Image"
transformer = CosmosTransformer3DModel.from_single_file(
    "https://huggingface.co/nvidia/Cosmos-Predict2-2B-Text2Image/blob/main/model.pt",
    torch_dtype=torch.bfloat16,
).to("cuda")
pipe = Cosmos2TextToImagePipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess."
negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."

output = pipe(
    prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1)
).images[0]
output.save("output.png")

CosmosTextToWorldPipeline

class diffusers.CosmosTextToWorldPipeline

< >

( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLCosmos scheduler: EDMEulerScheduler safety_checker: CosmosSafetyChecker = None )

Parameters

  • text_encoder (T5EncoderModel) — Frozen text-encoder. Cosmos uses T5; specifically the t5-11b variant.
  • tokenizer (T5TokenizerFast) — Tokenizer of class T5Tokenizer.
  • transformer (CosmosTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
  • scheduler (FlowMatchEulerDiscreteScheduler) — A scheduler to be used in combination with transformer to denoise the encoded image latents.
  • vae (AutoencoderKLCosmos) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.

Pipeline for text-to-world generation using Cosmos Predict1.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 36 guidance_scale: float = 7.0 fps: int = 30 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) ~CosmosPipelineOutput or tuple

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
  • height (int, defaults to 720) — The height in pixels of the generated image.
  • width (int, defaults to 1280) — The width in pixels of the generated image.
  • num_frames (int, defaults to 121) — The number of frames in the generated video.
  • num_inference_steps (int, defaults to 36) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • guidance_scale (float, defaults to 7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1.
  • fps (int, defaults to 30) — The frames per second of the generated video.
  • num_videos_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
  • generator (torch.Generator or List[torch.Generator], optional) — A torch.Generator to make generation deterministic.
  • latents (torch.Tensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a CosmosPipelineOutput instead of a plain tuple.
  • callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, optional) — A function or a subclass of PipelineCallback or MultiPipelineCallbacks that is called at the end of each denoising step during the inference. with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
  • callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.

Returns

~CosmosPipelineOutput or tuple

If return_dict is True, CosmosPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import CosmosTextToWorldPipeline
>>> from diffusers.utils import export_to_video

>>> model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Text2World"
>>> pipe = CosmosTextToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> prompt = "A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."

>>> output = pipe(prompt=prompt).frames[0]
>>> export_to_video(output, "output.mp4", fps=30)

encode_prompt

< >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

Parameters

  • prompt (str or List[str], optional) — prompt to be encoded
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • do_classifier_free_guidance (bool, optional, defaults to True) — Whether to use classifier free guidance or not.
  • num_videos_per_prompt (int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • device — (torch.device, optional): torch device
  • dtype — (torch.dtype, optional): torch dtype

Encodes the prompt into text encoder hidden states.

CosmosVideoToWorldPipeline

class diffusers.CosmosVideoToWorldPipeline

< >

( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLCosmos scheduler: EDMEulerScheduler safety_checker: CosmosSafetyChecker = None )

Parameters

  • text_encoder (T5EncoderModel) — Frozen text-encoder. Cosmos uses T5; specifically the t5-11b variant.
  • tokenizer (T5TokenizerFast) — Tokenizer of class T5Tokenizer.
  • transformer (CosmosTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
  • scheduler (FlowMatchEulerDiscreteScheduler) — A scheduler to be used in combination with transformer to denoise the encoded image latents.
  • vae (AutoencoderKLCosmos) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.

Pipeline for image-to-world and video-to-world generation using Cosmos Predict-1.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None video: typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]] = None prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 36 guidance_scale: float = 7.0 input_frames_guidance: bool = False augment_sigma: float = 0.001 fps: int = 30 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) ~CosmosPipelineOutput or tuple

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
  • height (int, defaults to 720) — The height in pixels of the generated image.
  • width (int, defaults to 1280) — The width in pixels of the generated image.
  • num_frames (int, defaults to 121) — The number of frames in the generated video.
  • num_inference_steps (int, defaults to 36) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • guidance_scale (float, defaults to 7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1.
  • fps (int, defaults to 30) — The frames per second of the generated video.
  • num_videos_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
  • generator (torch.Generator or List[torch.Generator], optional) — A torch.Generator to make generation deterministic.
  • latents (torch.Tensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a CosmosPipelineOutput instead of a plain tuple.
  • callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, optional) — A function or a subclass of PipelineCallback or MultiPipelineCallbacks that is called at the end of each denoising step during the inference. with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
  • callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.

Returns

~CosmosPipelineOutput or tuple

If return_dict is True, CosmosPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

Image conditioning:

>>> import torch
>>> from diffusers import CosmosVideoToWorldPipeline
>>> from diffusers.utils import export_to_video, load_image

>>> model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Video2World"
>>> pipe = CosmosVideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> prompt = "The video depicts a long, straight highway stretching into the distance, flanked by metal guardrails. The road is divided into multiple lanes, with a few vehicles visible in the far distance. The surrounding landscape features dry, grassy fields on one side and rolling hills on the other. The sky is mostly clear with a few scattered clouds, suggesting a bright, sunny day."
>>> image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input.jpg"
... )

>>> video = pipe(image=image, prompt=prompt).frames[0]
>>> export_to_video(video, "output.mp4", fps=30)

Video conditioning:

>>> import torch
>>> from diffusers import CosmosVideoToWorldPipeline
>>> from diffusers.utils import export_to_video, load_video

>>> model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Video2World"
>>> pipe = CosmosVideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.transformer = torch.compile(pipe.transformer)
>>> pipe.to("cuda")

>>> prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
>>> video = load_video(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
... )[
...     :21
... ]  # This example uses only the first 21 frames

>>> video = pipe(video=video, prompt=prompt).frames[0]
>>> export_to_video(video, "output.mp4", fps=30)

encode_prompt

< >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

Parameters

  • prompt (str or List[str], optional) — prompt to be encoded
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • do_classifier_free_guidance (bool, optional, defaults to True) — Whether to use classifier free guidance or not.
  • num_videos_per_prompt (int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • device — (torch.device, optional): torch device
  • dtype — (torch.dtype, optional): torch dtype

Encodes the prompt into text encoder hidden states.

Cosmos2TextToImagePipeline

class diffusers.Cosmos2TextToImagePipeline

< >

( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler safety_checker: CosmosSafetyChecker = None )

Parameters

  • text_encoder (T5EncoderModel) — Frozen text-encoder. Cosmos uses T5; specifically the t5-11b variant.
  • tokenizer (T5TokenizerFast) — Tokenizer of class T5Tokenizer.
  • transformer (CosmosTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
  • scheduler (FlowMatchEulerDiscreteScheduler) — A scheduler to be used in combination with transformer to denoise the encoded image latents.
  • vae (AutoencoderKLWan) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.

Pipeline for text-to-image generation using Cosmos Predict2.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 768 width: int = 1360 num_inference_steps: int = 35 guidance_scale: float = 7.0 num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) ~CosmosImagePipelineOutput or tuple

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
  • height (int, defaults to 768) — The height in pixels of the generated image.
  • width (int, defaults to 1360) — The width in pixels of the generated image.
  • num_inference_steps (int, defaults to 35) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • guidance_scale (float, defaults to 7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1.
  • num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
  • generator (torch.Generator or List[torch.Generator], optional) — A torch.Generator to make generation deterministic.
  • latents (torch.Tensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a CosmosImagePipelineOutput instead of a plain tuple.
  • callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, optional) — A function or a subclass of PipelineCallback or MultiPipelineCallbacks that is called at the end of each denoising step during the inference. with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
  • callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.

Returns

~CosmosImagePipelineOutput or tuple

If return_dict is True, CosmosImagePipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import Cosmos2TextToImagePipeline

>>> # Available checkpoints: nvidia/Cosmos-Predict2-2B-Text2Image, nvidia/Cosmos-Predict2-14B-Text2Image
>>> model_id = "nvidia/Cosmos-Predict2-2B-Text2Image"
>>> pipe = Cosmos2TextToImagePipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess."
>>> negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."

>>> output = pipe(
...     prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1)
... ).images[0]
>>> output.save("output.png")

encode_prompt

< >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

Parameters

  • prompt (str or List[str], optional) — prompt to be encoded
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • do_classifier_free_guidance (bool, optional, defaults to True) — Whether to use classifier free guidance or not.
  • num_images_per_prompt (int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • device — (torch.device, optional): torch device
  • dtype — (torch.dtype, optional): torch dtype

Encodes the prompt into text encoder hidden states.

Cosmos2VideoToWorldPipeline

class diffusers.Cosmos2VideoToWorldPipeline

< >

( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler safety_checker: CosmosSafetyChecker = None )

Parameters

  • text_encoder (T5EncoderModel) — Frozen text-encoder. Cosmos uses T5; specifically the t5-11b variant.
  • tokenizer (T5TokenizerFast) — Tokenizer of class T5Tokenizer.
  • transformer (CosmosTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
  • scheduler (FlowMatchEulerDiscreteScheduler) — A scheduler to be used in combination with transformer to denoise the encoded image latents.
  • vae (AutoencoderKLWan) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.

Pipeline for video-to-world generation using Cosmos Predict2.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None video: typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]] = None prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 93 num_inference_steps: int = 35 guidance_scale: float = 7.0 fps: int = 16 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 sigma_conditioning: float = 0.0001 ) ~CosmosPipelineOutput or tuple

Parameters

  • image (PIL.Image.Image, np.ndarray, torch.Tensor, optional) — The image to be used as a conditioning input for the video generation.
  • video (List[PIL.Image.Image], np.ndarray, torch.Tensor, optional) — The video to be used as a conditioning input for the video generation.
  • prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
  • height (int, defaults to 704) — The height in pixels of the generated image.
  • width (int, defaults to 1280) — The width in pixels of the generated image.
  • num_frames (int, defaults to 93) — The number of frames in the generated video.
  • num_inference_steps (int, defaults to 35) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • guidance_scale (float, defaults to 7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1.
  • fps (int, defaults to 16) — The frames per second of the generated video.
  • num_videos_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
  • generator (torch.Generator or List[torch.Generator], optional) — A torch.Generator to make generation deterministic.
  • latents (torch.Tensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a CosmosPipelineOutput instead of a plain tuple.
  • callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, optional) — A function or a subclass of PipelineCallback or MultiPipelineCallbacks that is called at the end of each denoising step during the inference. with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
  • callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.
  • max_sequence_length (int, defaults to 512) — The maximum number of tokens in the prompt. If the prompt exceeds this length, it will be truncated. If the prompt is shorter than this length, it will be padded.
  • sigma_conditioning (float, defaults to 0.0001) — The sigma value used for scaling conditioning latents. Ideally, it should not be changed or should be set to a small value close to zero.

Returns

~CosmosPipelineOutput or tuple

If return_dict is True, CosmosPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import Cosmos2VideoToWorldPipeline
>>> from diffusers.utils import export_to_video, load_image

>>> # Available checkpoints: nvidia/Cosmos-Predict2-2B-Video2World, nvidia/Cosmos-Predict2-14B-Video2World
>>> model_id = "nvidia/Cosmos-Predict2-2B-Video2World"
>>> pipe = Cosmos2VideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess."
>>> negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."
>>> image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/yellow-scrubber.png"
... )

>>> video = pipe(
...     image=image, prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1)
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=16)

encode_prompt

< >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

Parameters

  • prompt (str or List[str], optional) — prompt to be encoded
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • do_classifier_free_guidance (bool, optional, defaults to True) — Whether to use classifier free guidance or not.
  • num_videos_per_prompt (int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • device — (torch.device, optional): torch device
  • dtype — (torch.dtype, optional): torch dtype

Encodes the prompt into text encoder hidden states.

Cosmos2_5_PredictBasePipeline

class diffusers.Cosmos2_5_PredictBasePipeline

< >

( text_encoder: Qwen2_5_VLForConditionalGeneration tokenizer: AutoTokenizer transformer: CosmosTransformer3DModel vae: AutoencoderKLWan scheduler: UniPCMultistepScheduler safety_checker: CosmosSafetyChecker = None )

Parameters

  • text_encoder (Qwen2_5_VLForConditionalGeneration) — Frozen text-encoder. Cosmos Predict2.5 uses the Qwen2.5 VL encoder.
  • tokenizer (AutoTokenizer) — Tokenizer associated with the Qwen2.5 VL encoder.
  • transformer (CosmosTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
  • scheduler (UniPCMultistepScheduler) — A scheduler to be used in combination with transformer to denoise the encoded image latents.
  • vae (AutoencoderKLWan) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.

Pipeline for Cosmos Predict2.5 base model.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None video: typing.Optional[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 93 num_inference_steps: int = 36 guidance_scale: float = 7.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 conditional_frame_timestep: float = 0.1 ) ~CosmosPipelineOutput or tuple

Parameters

  • image (PIL.Image.Image, np.ndarray, torch.Tensor, optional) — Optional single image for Image2World conditioning. Must be None when video is provided.
  • video (List[PIL.Image.Image], np.ndarray, torch.Tensor, optional) — Optional input video for Video2World conditioning. Must be None when image is provided.
  • prompt (str or List[str], optional) — The prompt or prompts to guide generation. Required unless prompt_embeds is supplied.
  • height (int, defaults to 704) — The height in pixels of the generated image.
  • width (int, defaults to 1280) — The width in pixels of the generated image.
  • num_frames (int, defaults to 93) — Number of output frames. Use 93 for world (video) generation; set to 1 to return a single frame.
  • num_inference_steps (int, defaults to 35) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • guidance_scale (float, defaults to 7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1.
  • num_videos_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
  • generator (torch.Generator or List[torch.Generator], optional) — A torch.Generator to make generation deterministic.
  • latents (torch.Tensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a CosmosPipelineOutput instead of a plain tuple.
  • callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, optional) — A function or a subclass of PipelineCallback or MultiPipelineCallbacks that is called at the end of each denoising step during the inference. with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
  • callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.
  • max_sequence_length (int, defaults to 512) — The maximum number of tokens in the prompt. If the prompt exceeds this length, it will be truncated. If the prompt is shorter than this length, it will be padded.

Returns

~CosmosPipelineOutput or tuple

If return_dict is True, CosmosPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation. Supports three modes:

  • Text2World: image=None, video=None, prompt provided. Generates a world clip.
  • Image2World: image provided, video=None, prompt provided. Conditions on a single frame.
  • Video2World: video provided, image=None, prompt provided. Conditions on an input clip.

Set num_frames=93 (default) to produce a world video, or num_frames=1 to produce a single image frame (the above in “*2Image mode”).

Outputs follow output_type (e.g., "pil" returns a list of num_frames PIL images per prompt).

Examples:

>>> import torch
>>> from diffusers import Cosmos2_5_PredictBasePipeline
>>> from diffusers.utils import export_to_video, load_image, load_video

>>> model_id = "nvidia/Cosmos-Predict2.5-2B"
>>> pipe = Cosmos2_5_PredictBasePipeline.from_pretrained(
...     model_id, revision="diffusers/base/pre-trianed", torch_dtype=torch.bfloat16
... )
>>> pipe = pipe.to("cuda")

>>> # Common negative prompt reused across modes.
>>> negative_prompt = (
...     "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, "
...     "over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, "
...     "underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky "
...     "movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, "
...     "fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. "
...     "Overall, the video is of poor quality."
... )

>>> # Text2World: generate a 93-frame world video from text only.
>>> prompt = (
...     "As the red light shifts to green, the red bus at the intersection begins to move forward, its headlights "
...     "cutting through the falling snow. The snowy tire tracks deepen as the vehicle inches ahead, casting fresh "
...     "lines onto the slushy road. Around it, streetlights glow warmer, illuminating the drifting flakes and wet "
...     "reflections on the asphalt. Other cars behind start to edge forward, their beams joining the scene. "
...     "The stillness of the urban street transitions into motion as the quiet snowfall is punctuated by the slow "
...     "advance of traffic through the frosty city corridor."
... )
>>> video = pipe(
...     image=None,
...     video=None,
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     num_frames=93,
...     generator=torch.Generator().manual_seed(1),
... ).frames[0]
>>> export_to_video(video, "text2world.mp4", fps=16)

>>> # Image2World: condition on a single image and generate a 93-frame world video.
>>> prompt = (
...     "A high-definition video captures the precision of robotic welding in an industrial setting. "
...     "The first frame showcases a robotic arm, equipped with a welding torch, positioned over a large metal structure. "
...     "The welding process is in full swing, with bright sparks and intense light illuminating the scene, creating a vivid "
...     "display of blue and white hues. A significant amount of smoke billows around the welding area, partially obscuring "
...     "the view but emphasizing the heat and activity. The background reveals parts of the workshop environment, including a "
...     "ventilation system and various pieces of machinery, indicating a busy and functional industrial workspace. As the video "
...     "progresses, the robotic arm maintains its steady position, continuing the welding process and moving to its left. "
...     "The welding torch consistently emits sparks and light, and the smoke continues to rise, diffusing slightly as it moves upward. "
...     "The metal surface beneath the torch shows ongoing signs of heating and melting. The scene retains its industrial ambiance, with "
...     "the welding sparks and smoke dominating the visual field, underscoring the ongoing nature of the welding operation."
... )
>>> image = load_image(
...     "https://media.githubusercontent.com/media/nvidia-cosmos/cosmos-predict2.5/refs/heads/main/assets/base/robot_welding.jpg"
... )
>>> video = pipe(
...     image=image,
...     video=None,
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     num_frames=93,
...     generator=torch.Generator().manual_seed(1),
... ).frames[0]
>>> # export_to_video(video, "image2world.mp4", fps=16)

>>> # Video2World: condition on an input clip and predict a 93-frame world video.
>>> prompt = (
...     "The video opens with an aerial view of a large-scale sand mining construction operation, showcasing extensive piles "
...     "of brown sand meticulously arranged in parallel rows. A central water channel, fed by a water pipe, flows through the "
...     "middle of these sand heaps, creating ripples and movement as it cascades down. The surrounding area features dense green "
...     "vegetation on the left, contrasting with the sandy terrain, while a body of water is visible in the background on the right. "
...     "As the video progresses, a piece of heavy machinery, likely a bulldozer, enters the frame from the right, moving slowly along "
...     "the edge of the sand piles. This machinery's presence indicates ongoing construction work in the operation. The final frame "
...     "captures the same scene, with the water continuing its flow and the bulldozer still in motion, maintaining the dynamic yet "
...     "steady pace of the construction activity."
... )
>>> input_video = load_video(
...     "https://github.com/nvidia-cosmos/cosmos-predict2.5/raw/refs/heads/main/assets/base/sand_mining.mp4"
... )
>>> video = pipe(
...     image=None,
...     video=input_video,
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     num_frames=93,
...     generator=torch.Generator().manual_seed(1),
... ).frames[0]
>>> export_to_video(video, "video2world.mp4", fps=16)

>>> # To produce an image instead of a world (video) clip, set num_frames=1 and
>>> # save the first frame: pipe(..., num_frames=1).frames[0][0].

encode_prompt

< >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

Parameters

  • prompt (str or List[str], optional) — prompt to be encoded
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • do_classifier_free_guidance (bool, optional, defaults to True) — Whether to use classifier free guidance or not.
  • num_videos_per_prompt (int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • device — (torch.device, optional): torch device
  • dtype — (torch.dtype, optional): torch dtype

Encodes the prompt into text encoder hidden states.

CosmosPipelineOutput

class diffusers.pipelines.cosmos.pipeline_output.CosmosPipelineOutput

< >

( frames: Tensor )

Parameters

  • frames (torch.Tensor, np.ndarray, or List[List[PIL.Image.Image]]) — List of video outputs - It can be a nested list of length batch_size, with each sub-list containing denoised PIL image sequences of length num_frames. It can also be a NumPy array or Torch tensor of shape (batch_size, num_frames, channels, height, width).

Output class for Cosmos any-to-world/video pipelines.

CosmosImagePipelineOutput

class diffusers.pipelines.cosmos.pipeline_output.CosmosImagePipelineOutput

< >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )

Parameters

  • images (List[PIL.Image.Image] or np.ndarray) — List of denoised PIL images of length batch_size or numpy array of shape (batch_size, height, width, num_channels). PIL images or numpy array present the denoised images of the diffusion pipeline.

Output class for Cosmos any-to-image pipelines.

Update on GitHub