A few questions about models

Hello,
I have a few questions and I would appreciate it if you could answer them in order.

1- In the text-to-video model, can I define the character’s face myself? For example, a man is walking down the street, but I use a photo for the man’s face.

2- I have a picture of a man standing on the street. If I use the image-to-video model and write a script that makes the man walk from the street into a library. Is this possible?

3- If I want to use online AI services to produce 10 1 minute text-to-video and photo-to-video videos per month, which service is better?

Thank you.

1 Like

The full version is here.


1) In text-to-video, can you define the character’s face yourself using a photo?

Yes, on many services. But the reliable method is usually not “pure text-only T2V.” It is text + a face reference image (or a few references), or image-seeded video (you start from a “hero still” that shows the correct face).

Why “text-only face control” is unreliable

Text prompts like “a man walking down the street” do not uniquely define a real person’s facial identity. Video makes this harder because the system must keep the same face across many frames while the head turns, lighting changes, and the camera moves. So most practical tools use reference conditioning.

The 3 common ways platforms implement “use my face photo”

A) Reference images inside the video generator (fastest)

  • You upload 1–3 reference images.
  • You prompt the action and scene.
  • The system tries to keep the same character.

Runway is very explicit about this workflow:

  • Gen-4 References supports “drag in up to 3 reference images.”
  • You can save references and recall them via @ in your prompt. (Runway Academy)

B) “T2V” that is actually “image + text → video”
Some tools branded “text-to-video” still require a starting image for their best results.

Runway’s Gen-4 Video page states:

  • Gen-4 requires an input image
  • Turbo costs 5 credits/second
  • Gen-4 costs 12 credits/second (Runway)

So in practice you do:

  1. Make one good “hero still” of your man (face correct).
  2. Use it as the input image.
  3. Use text to specify “walking down the street,” camera, style, etc.

C) Subject-reference models (one photo → consistent subject)
These are specifically marketed for identity consistency from one reference.

MiniMax’s Hailuo S2V-01 announcement claims:

  • “Generate character-consistent videos from just one reference image.” (MiniMax)

Policy constraint you must plan around

If your face photo depicts a real person, some services restrict uploads.

OpenAI’s Sora documentation states:

  • “Uploading images with depictions of real people is blocked.”
  • “Characters are the only way to use one’s likeness in Sora” and it requires explicit permission. (OpenAI Help Center)
    Its Characters article also emphasizes Characters are not for real people and depicting real people requires explicit consent via the appropriate flow. (OpenAI Help Center)

Bottom line for (1):
Yes, you can usually define the face using your photo, but you should expect to use reference images or an image-seeded workflow, not pure text-only T2V. Runway References and subject-reference models are directly built for this. (Runway Academy)


2) If you have a street photo, can image-to-video make the man walk into a library?

Yes, it is possible. The reliable version is multi-shot plus a keyframed doorway transition, not one continuous “single clip” with a long script.

Why your specific scene is hard

“Street → into a library” contains a doorway transition. Doorways are difficult because the model must change geometry, lighting, and environment semantics while keeping the person stable. The risk is warped doors, melting walls, or identity drift.

The practical limitation: clip length

Most services generate short clips well.

  • Luma’s Ray2 FAQ states Ray2 can generate videos up to 10 seconds, and you can use “Extend” to make longer video, but there is a cap at 30 seconds and quality may drop with each extend. (Luma AI)
  • Runway Gen-4 Video also frames generation around short clips and requires an input image. (Runway)

So a “walk into a library” sequence is normally built as several clips.

The reliable production pattern

Do it as 3–5 shots, with the doorway as its own shot:

  1. Outside walking (5–10s)
    Input image = your street photo. Prompt = “begins walking forward, natural gait, stable camera.”

  2. Approach the door (5–10s)
    Use the last good frame of shot 1 as the next input image. This reduces drift.

  3. Doorway transition (3–6s) using keyframes
    Luma Keyframes is designed for exactly this:

  • Set a start frame and an end frame
  • Luma generates the transition between them (Luma AI)

So you set:

  • Start frame: man outside at the door
  • End frame: man just inside the library
  1. Inside library continuation (5–10s)
    Again, chain from the last good frame.

What “write a script” usually means in these tools

Most SaaS tools do not execute a long screenplay-like script deterministically. You get better results by writing a shot list (a short prompt per clip) and controlling continuity with:

  • reference images
  • keyframes (start/end frames)
  • frame-chaining (use last frame as next input)

Bottom line for (2):
Yes, it is possible. It is most reliable as multiple short I2V clips plus Keyframes for the doorway transition, rather than one long clip. (Luma AI)


3) For 10 × 1-minute videos per month, which online service is better?

For your stated production goal, the best answer is usually not one service. It is a 2-tool workflow:

  • Runway for identity-focused character shots (face consistency via references). (Runway Academy)
  • Luma for doorway / scene transitions using Keyframes. (Luma AI)

If you must pick one, choose based on what hurts more: identity drift or transition failures.

Background: why “10 × 60s” needs a production mindset

You are not buying “10 generations.” You are buying 60–120 short clip generations plus retries and stitching.

Runway explicitly supports assembling longer films via its in-app editor:

  • “How to create longer videos and films” points you to Video Editor Projects. (Runway)

Runway: best when face/character continuity is the priority

Key operational facts from Runway’s docs:

  • Turbo = 5 credits/second
  • Gen-4 = 12 credits/second
  • Gen-4 requires an input image
  • Unlimited plans can generate infinite Gen-4 videos in Explore Mode
  • Explore Mode does not use credits (Runway)

Runway pricing page also indicates Unlimited includes 2250 credits monthly (for faster Credits Mode and other tools) plus the “unlimited video generations” framing. (Runway)

What Runway is “best at” for your use case

  • Keeping one person consistent across multiple short clips, especially when you use References. (Runway Academy)
  • Letting you assemble the minute inside the same ecosystem (fewer moving parts). (Runway)

Luma: best when transitions and directed changes are the priority

Key operational facts from Luma’s docs:

  • Keyframes = start frame + end frame transitions. (Luma AI)
  • Unlimited plan includes 10,000 monthly credits and Unlimited use in Relaxed Mode. (Luma AI)
  • Ray2 generation up to 10 seconds, extend cap 30 seconds (quality may drop). (Luma AI)

What Luma is “best at” for your use case

  • Forcing a believable “street → library” doorway transition using Keyframes.
  • Handling heavy iteration without blowing up costs, if you rely on Relaxed Mode. (Luma AI)

Pika: useful, often cheaper per “operation,” but usually not the backbone for your exact goal

Pika’s official pricing shows “cost per video” by feature and plan credits. (Pika)
Its FAQ states generation length depends on model/feature and that “Pikaframes generations can be up to 25 seconds” (for a specific model). (Pika)

Pika is often best as:

  • B-roll
  • stylized cutaways
  • fast variants
    rather than as the single backbone for “consistent identity + doorway transition + 10×60s/month.”

My concrete recommendation for your monthly target

Best overall reliability for your case

  • Runway Unlimited for identity shots, using References and Explore Mode for auditions. (Runway Academy)
  • Luma Unlimited for Keyframes transitions and high-retry transition shots. (Luma AI)
  • Stitch either in Runway’s editor or your normal NLE. (Runway)

If you must choose one service

  • Choose Runway if face consistency is your biggest pain. (Runway Academy)
  • Choose Luma if the transition and cinematic control is the biggest pain. (Luma AI)

Summary

  1. Yes, you can define a face using your photo, but it is usually done via reference images or image-seeded video, not pure text-only T2V. Runway References and subject-reference models are built for this. (Runway Academy)
  2. Yes, “street photo → walk into a library” is possible, but it is most reliable as multi-shot plus Keyframes for the doorway transition. (Luma AI)
  3. For 10×60s/month, the most reliable setup is Runway for identity + Luma for transitions, using their relaxed/unlimited modes for retries and stitching clips into minutes. (Runway)

Hello,
Thank you so much for your great answer.
I have a few more questions:

1- Suppose I want to produce a movie scene from a photo of a person sitting in a classroom at a university. I can have two methods:

Method-1: I can cut out a scene like this from a movie and then replace the face of one of the students with the photo I want.

Method-2: I upload the desired photo and then AI will visualize the classroom for me.

Right?

2- Can I run Runway and Luma AI locally using ComfyUI?

1 Like

Of course, you can’t use Runway or Luma locally, but if you’re online, it seems you can use them via API from ComfyUI.


1) Two methods for “a person sitting in a university classroom” from a photo

Yes. Your understanding is basically correct. The two methods are different categories of work.

Method-1: Cut an existing movie scene, then replace a student’s face

What this is:

  • Traditional VFX-style face replacement on pre-existing footage.
  • You are editing a real clip (often copyrighted), then changing one person’s identity.

What it’s good at:

  • You inherit the movie’s cinematography, lighting, acting, camera movement, set design.
  • If the base footage already matches what you want, it can look very “real” because it literally is real footage.

What makes it risky (practically and legally):

  • Copyright/IP risk: You are using movie footage you likely do not own.
  • Likeness/consent risk: You are creating a depiction that can look like a real person in a real filmed scene. Many platforms treat this as sensitive and require consent for real-person depictions.
  • Platform-policy risk: Some services restrict generating or editing depictions of real people without explicit consent. OpenAI’s “Characters” guidance is explicit that depicting a real person requires consent and prohibits certain misuse. (OpenAI Help Center)

High-level takeaway:

  • Method-1 is “edit existing copyrighted footage + identity manipulation.” It is usually the highest risk path.

Method-2: Upload the photo, then AI generates a new classroom scene around them

What this is:

  • Generative video (or image-to-video) where your photo is used as an identity reference (a conditioning input).
  • The classroom, camera, lighting, other students are generated.

What it’s good at:

  • Much safer on copyright, because you are not using an existing movie clip.
  • You can specify: “university classroom,” “late afternoon,” “rows of students,” “professor at whiteboard,” etc.
  • You can iterate shots and stitch them into a scene (your “shot factory” pipeline).

What still matters:

  • Consent still matters if the uploaded photo is a real person. Platform rules vary, but “real person depiction needs consent” is common. (OpenAI Help Center)

  • Quality is not guaranteed. You may need retries for:

    • identity drift
    • inconsistent clothing
    • weird hands or artifacts
    • continuity between shots

High-level takeaway:

  • Method-2 is “generate an original scene with an identity reference.” It is usually the lower risk, more scalable production path.

Simple framing

  • Method-1: Edit a real film (high IP + deepfake risk).
  • Method-2: Generate original footage (lower IP risk, still consent-sensitive).

2) Can you run Runway and Luma locally using ComfyUI

You can run ComfyUI locally and use Runway/Luma nodes inside it, but the actual Runway and Luma generation is not local inference. It runs on their cloud through APIs.

Runway + ComfyUI

  • ComfyUI has integrated Runway API nodes for video workflows. This is explicitly described as using the Runway API from inside ComfyUI. (docs.comfy.org)
  • Many Runway-ComfyUI node docs and troubleshooting notes describe it as a cloud-based service that needs internet and an API key. (comfyai.run)
  • Runway’s own developer docs are API-based and require versioned headers, again reinforcing “cloud API,” not “run locally.” (Runway API)

Luma + ComfyUI

  • Luma provides a ComfyUI custom node repo that explicitly says it lets you use the Luma AI API directly in ComfyUI. That is a cloud call. (GitHub)

The key distinction

  • Local ComfyUI = your workflow UI, graph orchestration, file handling runs on your machine.
  • Runway/Luma via ComfyUI = ComfyUI sends requests to their servers, then downloads results.

If your goal is “no external API calls, fully offline local generation,” then you need open weights video models inside ComfyUI (your OSS lane). Runway and Luma are not “run locally” models in the usual sense.


Practical recommendation for your classroom use case

  • If you want the safest, most repeatable pipeline: prefer Method-2 (generate original classroom shots using the person photo as a reference), then stitch multiple short clips.
  • Use Method-1 only if you have clear rights to the base footage and clear consent for the identity manipulation. Otherwise it is the path most likely to get blocked by policy or create legal risk.

Sources worth opening first

  • Runway video generation nodes in ComfyUI (Runway API integration). (docs.comfy.org)
  • Luma’s ComfyUI node repo (Luma API from ComfyUI). (GitHub)
  • Runway developer API reference (versioned API headers). (Runway API)
  • Real-person consent requirement in OpenAI “Characters” guidance (policy example). (OpenAI Help Center)

Summary

  • Your two methods are correctly understood: face replacement on existing footage vs generate original scene using a photo as identity reference.
  • Method-1 is usually the higher-risk route (copyright + consent + platform policy).
  • You can use Runway and Luma from local ComfyUI, but you cannot run them as local inference; they are API-driven cloud services.
1 Like