Video Generation

Rilo’s VideoGenerationTool uses Google’s Veo 3.1 and Kling AI models to generate and edit videos from text prompts, images, and existing videos. Perfect for creating marketing videos, product demos, or editing existing content.

Overview

VideoGenerationTool supports two main workflows:

Video Generation: Create new videos from text/images using Veo 3.1
Video Editing: Edit existing videos using Kling AI (face swap, style transfer, motion control)

Video generation uses Google’s Veo 3.1 models via Vertex AI. Video editing uses Kling AI models. Videos are generated in high quality and saved to your workflow output.

Video generation requires a Pro plan or higher. Upgrade to Pro to unlock this feature.

Video Generation (generate_video)

Create new videos from text descriptions and reference images using Veo 3.1.

Models

Veo 3.1 (Quality)

Aspect Ratios: 16:9 (landscape), 9:16 (portrait)
Resolutions: 720p, 1080p
Durations: 4, 6, 8, 15, 22, 29, 36, 43, 50, 57, 64 seconds
Audio: Native audio generation supported
Image-to-Video: Supported
Extension: Auto-extension for videos >8s (up to 64s)
Use Case: High-quality video generation

Veo 3.1 Fast

Aspect Ratios: 16:9 (landscape), 9:16 (portrait)
Resolutions: 720p, 1080p
Durations: 4, 6, 8, 15, 22, 29, 36, 43, 50, 57, 64 seconds
Audio: Native audio generation supported
Image-to-Video: Supported
Extension: Auto-extension for videos >8s (up to 64s)
Use Case: Speed-optimized video generation

Configuration

Video generation uses video_generation_config with these fields:

Prompt Guide (HOW the video is made) - All Optional

cinematography: Camera movement, shot types, angles
lighting: Light sources, quality, direction
sound_style: Ambient sounds, music description
visual_style: Aesthetic, mood, style references

Technical Settings

model: Required - “veo-3.1-generate-001” or “veo-3.1-fast-generate-001”
aspect_ratio: Required - “16:9” or “9:16”
resolution: Required - “720p” or “1080p” (1080p limited to 8s max)
duration_seconds: Required - One of: 4, 6, 8, 15, 22, 29, 36, 43, 50, 57, 64
enable_audio: Required - true/false for native audio generation
negative_prompt: Optional - Elements to exclude from generation
first_frame: Optional - Path to image for first frame (interpolation)
last_frame: Optional - Path to image for last frame (requires first_frame)
reference_images: Optional - List of reference images (max 3) with type designation

Input Structure

Scenes are passed separately from config via input parameter:

visual (required): Description of what is seen
dialogue (optional): Spoken words for that scene
sound_effects (optional): Per-scene sound effects

Reference Images

path: Path to the image file
reference_type: “asset” (object/character/product) or “style” (aesthetics)
Portrait mode (9:16) does not support reference_images

Example: Text-to-Video

from library.video_generation_tool import (
    VideoGenerationTool,
    VideoGenerationInput,
    VideoGenerationConfig,
    VideoGenerationScene
)

video_tool = VideoGenerationTool()

# Build input from scenes (WHAT the video shows)
video_input = VideoGenerationInput(
    scenes=[
        VideoGenerationScene(
            visual="A sunset over mountains with a lake in the foreground",
            dialogue="Welcome to our beautiful destination",
            sound_effects="Gentle water lapping, birds chirping"
        )
    ]
)

# Build config (HOW the video is made)
video_config = VideoGenerationConfig(
    model="veo-3.1-generate-001",
    aspect_ratio="16:9",
    resolution="720p",
    duration_seconds=8,
    enable_audio=True,
    cinematography="Slow panning shot, wide angle",
    lighting="Golden hour, warm tones",
    visual_style="Cinematic, professional"
)

result = await video_tool.generate_video(input=video_input, config=video_config)

Example: Image-to-Video with Reference

# Use product image from previous block
product_image_path = inputs["product_photo_block"]["image_path"]

video_config = VideoGenerationConfig(
    model="veo-3.1-generate-001",
    aspect_ratio="16:9",
    resolution="720p",
    duration_seconds=8,
    enable_audio=True,
    reference_images=[
        ReferenceImage(path=product_image_path, reference_type="asset")
    ]
)

video_input = VideoGenerationInput(
    scenes=[
        VideoGenerationScene(
            visual="The product rotating on a marble surface"
        )
    ]
)

result = await video_tool.generate_video(input=video_input, config=video_config)

Video Editing (edit_video)

Edit existing videos or generate with character/motion control using Kling AI.

Models

Kling VIDEO O1

Workflow: Video editing (face swap, style transfer)
Input Video: Required (3-10 seconds, max 32MB)
Reference Images: Up to 4 images
Aspect Ratios: 16:9, 1:1, 9:16
Use Case: Edit existing videos with face swap or style transfer

Kling VIDEO 2.6 Pro

Workflow: Motion-control (follow motion from reference video)
Reference Video: Required (3-30s, max 100MB)
Reference Images: Required (1-7 images) - character/subject for video
Output Duration: 5 or 10 seconds
Aspect Ratios: 16:9, 1:1, 9:16
Voice Cloning: Supported (5-30s audio files)
Use Case: Generate video following motion from reference video

Configuration

Video editing uses video_editing_config with these fields:

Common Settings

model: Required - “kling-video-o1” or “kling-video-2.6-pro”
aspect_ratio: Required - “16:9”, “1:1”, or “9:16”
keep_original_sound: Optional - Preserve audio from input (default: true)

Video Inputs

input_video: Required for O1 - Source video for editing (3-10s, max 32MB)
reference_video: Required for 2.6 Pro - Motion reference (3-30s, max 100MB)
reference_images: O1: max 4, 2.6 Pro: max 7 - Character/object reference

2.6 Pro-Specific Settings

cfg_scale: Default 0.5 - Prompt adherence (0-1)
duration_seconds: Default 5 - Output duration (5 or 10)
sound: Default true - Enable audio generation
negative_prompt: Optional - Elements to exclude (2-2500 chars)
character_orientation: Optional - “image” or “video” - prioritize ref images vs ref video
reference_voices: Optional - Voice cloning audio files

Image Reference Syntax

Prompts can reference images using placeholders. The system automatically indexes reference images, allowing you to reference them by their position in the reference_images array.

Reference images are indexed starting from 0. When you provide reference_images, they can be referenced in your prompts by their position in the array.

Example: Face Swap (O1)

from library.video_generation_tool import VideoGenerationTool, VideoEditingConfig

video_tool = VideoGenerationTool()

# Get input video and face reference
input_video = video_editing_config.get("input_video")
reference_images = [video_editing_config.get("reference_images")[0]]

config = VideoEditingConfig(
    model="kling-video-o1",
    aspect_ratio="16:9",
    keep_original_sound=True,
    input_video=input_video,
    reference_images=reference_images
)

result = await video_tool.edit_video(
    prompt="Replace the person's face with the reference image, keeping motion natural",
    config=config
)

Example: Motion Control (2.6 Pro)

reference_video = video_editing_config.get("reference_video")
reference_images = video_editing_config.get("reference_images")  # Required, 1-7 images

config = VideoEditingConfig(
    model="kling-video-2.6-pro",
    aspect_ratio="16:9",
    reference_video=reference_video,
    reference_images=reference_images,
    cfg_scale=0.5,
    duration_seconds=5,
    sound=True
)

result = await video_tool.edit_video(
    prompt="A dancer performing the same moves as the reference",
    config=config
)

Credit Costs

Video generation and editing are credit-intensive operations:

Video Generation (Veo 3.1)

Starting cost: 1000 credits per generation
Cost may vary by duration and model in the tool configuration

Video Editing (Kling AI)

Starting cost: 600 credits per edit
Cost may vary by duration and model in the tool configuration

Video generation and editing are credit-intensive. Ensure you have sufficient credits (1000+ for generation, 600+ for editing) before running video workflows.

Performance

Video Generation: 11 seconds to 6 minutes per segment
Video Editing: 1-5 minutes per operation
Videos >8s: Use auto-extension for generation
1080p resolution: Limited to 8s max for generation

Limitations

Portrait mode (9:16) does not support reference_images
1080p resolution limited to 8s max
Maximum duration: 64 seconds (with extension)

Video Editing

O1 requires input_video (3-10s, max 32MB)
2.6 Pro requires reference_video (3-30s, max 100MB)
Prompt must be 2-2500 characters
Reference images required for 2.6 Pro (1-7 images)

Best Practices

Use Reference Images for Products

When generating videos with products, use reference_images with reference_type=“asset” to ensure the product appears correctly.

Keep Scene Descriptions Generic

Don’t hardcode specific values from previous blocks. Use generic descriptions and reference images for specific objects.

Choose Appropriate Duration

Start with shorter videos (4-8s) for faster generation and lower costs. Extend if needed.

Use Batch Operations

For multiple videos, consider generating them in parallel workflows to save time.

Use Cases

Marketing Videos

Create product demos, promotional videos, and social media content.

Content Creation

Generate video content for blogs, tutorials, and presentations.

Video Editing

Face swap, style transfer, and motion replication for existing videos.

Character Consistency

Generate videos with consistent characters using reference images.

Image Generation - Generate reference images for videos
Vision Analysis - Analyze generated videos
Plans and Credits - Credit costs and plan requirements

Video generation is a powerful but credit-intensive feature. Plan your video workflows carefully to manage credit consumption effectively.

​Overview

​Video Generation (generate_video)

​Models

​Veo 3.1 (Quality)

​Veo 3.1 Fast

​Configuration

​Prompt Guide (HOW the video is made) - All Optional

​Technical Settings

​Input Structure

​Reference Images

​Example: Text-to-Video

​Example: Image-to-Video with Reference

​Video Editing (edit_video)

​Models

​Kling VIDEO O1

​Kling VIDEO 2.6 Pro

​Configuration

​Common Settings

​Video Inputs

​2.6 Pro-Specific Settings

​Image Reference Syntax

​Example: Face Swap (O1)

​Example: Motion Control (2.6 Pro)

​Credit Costs

​Video Generation (Veo 3.1)

​Video Editing (Kling AI)

​Performance

​Limitations

​Video Generation

​Video Editing

​Best Practices

​Use Cases

Marketing Videos

Content Creation

Video Editing

Character Consistency

​Related Features

Overview

Video Generation (generate_video)

Models

Veo 3.1 (Quality)

Veo 3.1 Fast

Configuration

Prompt Guide (HOW the video is made) - All Optional

Technical Settings

Input Structure

Reference Images

Example: Text-to-Video

Example: Image-to-Video with Reference

Video Editing (edit_video)

Models

Kling VIDEO O1

Kling VIDEO 2.6 Pro

Configuration

Common Settings

Video Inputs

2.6 Pro-Specific Settings

Image Reference Syntax

Example: Face Swap (O1)

Example: Motion Control (2.6 Pro)

Credit Costs

Video Generation (Veo 3.1)

Video Editing (Kling AI)

Performance

Limitations

Video Generation

Video Editing

Best Practices

Use Cases

Related Features