Skip to main content
Rilo’s VideoGenerationTool uses Google’s Veo 3.1 and Kling AI models to generate and edit videos from text prompts, images, and existing videos. Perfect for creating marketing videos, product demos, or editing existing content.

Overview

VideoGenerationTool supports two main workflows:
  • Video Generation: Create new videos from text/images using Veo 3.1
  • Video Editing: Edit existing videos using Kling AI (face swap, style transfer, motion control)
Video generation uses Google’s Veo 3.1 models via Vertex AI. Video editing uses Kling AI models. Videos are generated in high quality and saved to your workflow output.
Video generation requires a Pro plan or higher. Upgrade to Pro to unlock this feature.

Video Generation (generate_video)

Create new videos from text descriptions and reference images using Veo 3.1.

Models

Veo 3.1 (Quality)

  • Aspect Ratios: 16:9 (landscape), 9:16 (portrait)
  • Resolutions: 720p, 1080p
  • Durations: 4, 6, 8, 15, 22, 29, 36, 43, 50, 57, 64 seconds
  • Audio: Native audio generation supported
  • Image-to-Video: Supported
  • Extension: Auto-extension for videos >8s (up to 64s)
  • Use Case: High-quality video generation

Veo 3.1 Fast

  • Aspect Ratios: 16:9 (landscape), 9:16 (portrait)
  • Resolutions: 720p, 1080p
  • Durations: 4, 6, 8, 15, 22, 29, 36, 43, 50, 57, 64 seconds
  • Audio: Native audio generation supported
  • Image-to-Video: Supported
  • Extension: Auto-extension for videos >8s (up to 64s)
  • Use Case: Speed-optimized video generation

Configuration

Video generation uses video_generation_config with these fields:

Prompt Guide (HOW the video is made) - All Optional

  • cinematography: Camera movement, shot types, angles
  • lighting: Light sources, quality, direction
  • sound_style: Ambient sounds, music description
  • visual_style: Aesthetic, mood, style references

Technical Settings

  • model: Required - “veo-3.1-generate-preview” or “veo-3.1-fast-generate-preview”
  • aspect_ratio: Required - “16:9” or “9:16”
  • resolution: Required - “720p” or “1080p” (1080p limited to 8s max)
  • duration_seconds: Required - One of: 4, 6, 8, 15, 22, 29, 36, 43, 50, 57, 64
  • enable_audio: Required - true/false for native audio generation
  • negative_prompt: Optional - Elements to exclude from generation
  • first_frame: Optional - Path to image for first frame (interpolation)
  • last_frame: Optional - Path to image for last frame (requires first_frame)
  • reference_images: Optional - List of reference images (max 3) with type designation

Input Structure

Scenes are passed separately from config via input parameter:
  • visual (required): Description of what is seen
  • dialogue (optional): Spoken words for that scene
  • sound_effects (optional): Per-scene sound effects

Reference Images

  • path: Path to the image file
  • reference_type: “asset” (object/character/product) or “style” (aesthetics)
  • Portrait mode (9:16) does not support reference_images

Example: Text-to-Video

from library.video_generation_tool import (
    VideoGenerationTool,
    VideoGenerationInput,
    VideoGenerationConfig,
    VideoGenerationScene
)

video_tool = VideoGenerationTool()

# Build input from scenes (WHAT the video shows)
video_input = VideoGenerationInput(
    scenes=[
        VideoGenerationScene(
            visual="A sunset over mountains with a lake in the foreground",
            dialogue="Welcome to our beautiful destination",
            sound_effects="Gentle water lapping, birds chirping"
        )
    ]
)

# Build config (HOW the video is made)
video_config = VideoGenerationConfig(
    model="veo-3.1-generate-preview",
    aspect_ratio="16:9",
    resolution="720p",
    duration_seconds=8,
    enable_audio=True,
    cinematography="Slow panning shot, wide angle",
    lighting="Golden hour, warm tones",
    visual_style="Cinematic, professional"
)

result = await video_tool.generate_video(input=video_input, config=video_config)

Example: Image-to-Video with Reference

# Use product image from previous block
product_image_path = inputs["product_photo_block"]["image_path"]

video_config = VideoGenerationConfig(
    model="veo-3.1-generate-preview",
    aspect_ratio="16:9",
    resolution="720p",
    duration_seconds=8,
    enable_audio=True,
    reference_images=[
        ReferenceImage(path=product_image_path, reference_type="asset")
    ]
)

video_input = VideoGenerationInput(
    scenes=[
        VideoGenerationScene(
            visual="The product rotating on a marble surface"
        )
    ]
)

result = await video_tool.generate_video(input=video_input, config=video_config)

Video Editing (edit_video)

Edit existing videos or generate with character/motion control using Kling AI.

Models

Kling VIDEO O1

  • Workflow: Video editing (face swap, style transfer)
  • Input Video: Required (3-10 seconds, max 32MB)
  • Reference Images: Up to 4 images
  • Aspect Ratios: 16:9, 1:1, 9:16
  • Use Case: Edit existing videos with face swap or style transfer

Kling VIDEO 2.6 Pro

  • Workflow: Motion-control (follow motion from reference video)
  • Reference Video: Required (3-30s, max 100MB)
  • Reference Images: Required (1-7 images) - character/subject for video
  • Output Duration: 5 or 10 seconds
  • Aspect Ratios: 16:9, 1:1, 9:16
  • Voice Cloning: Supported (5-30s audio files)
  • Use Case: Generate video following motion from reference video

Configuration

Video editing uses video_editing_config with these fields:

Common Settings

  • model: Required - “kling-video-o1” or “kling-video-2.6-pro”
  • aspect_ratio: Required - “16:9”, “1:1”, or “9:16”
  • keep_original_sound: Optional - Preserve audio from input (default: true)

Video Inputs

  • input_video: Required for O1 - Source video for editing (3-10s, max 32MB)
  • reference_video: Required for 2.6 Pro - Motion reference (3-30s, max 100MB)
  • reference_images: O1: max 4, 2.6 Pro: max 7 - Character/object reference

2.6 Pro-Specific Settings

  • cfg_scale: Default 0.5 - Prompt adherence (0-1)
  • duration_seconds: Default 5 - Output duration (5 or 10)
  • sound: Default true - Enable audio generation
  • negative_prompt: Optional - Elements to exclude (2-2500 chars)
  • character_orientation: Optional - “image” or “video” - prioritize ref images vs ref video
  • reference_voices: Optional - Voice cloning audio files

Image Reference Syntax

Prompts can reference images using placeholders. The system automatically indexes reference images, allowing you to reference them by their position in the reference_images array.
Reference images are indexed starting from 0. When you provide reference_images, they can be referenced in your prompts by their position in the array.

Example: Face Swap (O1)

from library.video_generation_tool import VideoGenerationTool, VideoEditingConfig

video_tool = VideoGenerationTool()

# Get input video and face reference
input_video = video_editing_config.get("input_video")
reference_images = [video_editing_config.get("reference_images")[0]]

config = VideoEditingConfig(
    model="kling-video-o1",
    aspect_ratio="16:9",
    keep_original_sound=True,
    input_video=input_video,
    reference_images=reference_images
)

result = await video_tool.edit_video(
    prompt="Replace the person's face with the reference image, keeping motion natural",
    config=config
)

Example: Motion Control (2.6 Pro)

reference_video = video_editing_config.get("reference_video")
reference_images = video_editing_config.get("reference_images")  # Required, 1-7 images

config = VideoEditingConfig(
    model="kling-video-2.6-pro",
    aspect_ratio="16:9",
    reference_video=reference_video,
    reference_images=reference_images,
    cfg_scale=0.5,
    duration_seconds=5,
    sound=True
)

result = await video_tool.edit_video(
    prompt="A dancer performing the same moves as the reference",
    config=config
)

Credit Costs

Video generation and editing are credit-intensive operations:

Video Generation (Veo 3.1)

  • Starting cost: 1000 credits per generation
  • Cost may vary by duration and model in the tool configuration

Video Editing (Kling AI)

  • Starting cost: 600 credits per edit
  • Cost may vary by duration and model in the tool configuration
Video generation and editing are credit-intensive. Ensure you have sufficient credits (1000+ for generation, 600+ for editing) before running video workflows.

Performance

  • Video Generation: 11 seconds to 6 minutes per segment
  • Video Editing: 1-5 minutes per operation
  • Videos >8s: Use auto-extension for generation
  • 1080p resolution: Limited to 8s max for generation

Limitations

Video Generation

  • Portrait mode (9:16) does not support reference_images
  • 1080p resolution limited to 8s max
  • Maximum duration: 64 seconds (with extension)

Video Editing

  • O1 requires input_video (3-10s, max 32MB)
  • 2.6 Pro requires reference_video (3-30s, max 100MB)
  • Prompt must be 2-2500 characters
  • Reference images required for 2.6 Pro (1-7 images)

Best Practices

When generating videos with products, use reference_images with reference_type=“asset” to ensure the product appears correctly.
Don’t hardcode specific values from previous blocks. Use generic descriptions and reference images for specific objects.
Start with shorter videos (4-8s) for faster generation and lower costs. Extend if needed.
For multiple videos, consider generating them in parallel workflows to save time.

Use Cases

Marketing Videos

Create product demos, promotional videos, and social media content.

Content Creation

Generate video content for blogs, tutorials, and presentations.

Video Editing

Face swap, style transfer, and motion replication for existing videos.

Character Consistency

Generate videos with consistent characters using reference images.

Video generation is a powerful but credit-intensive feature. Plan your video workflows carefully to manage credit consumption effectively.