Skip to main content
Rilo’s VisionTool uses Google’s Gemini models to analyze images and videos, extract information, and answer questions about visual content. Perfect for content moderation, OCR, object detection, and video summarization.

Overview

VisionTool supports:
  • Image Analysis: Describe, classify, extract text (OCR), detect objects, answer questions
  • Video Analysis: Summarize, transcribe, identify key moments, answer questions about content
  • Structured Output: Extract specific data using Pydantic schemas
  • Multiple Sources: Local files, URLs, YouTube URLs, base64 strings
Vision analysis uses Google’s Gemini models via Vertex AI. Supports both images and videos with comprehensive analysis capabilities.

Models

Gemini 3 Flash Preview

  • Image Support: Yes
  • Video Support: Yes
  • YouTube Support: Yes
  • Use Case: Fast analysis with good quality

Gemini 3 Pro Preview

  • Image Support: Yes
  • Video Support: Yes
  • YouTube Support: Yes
  • Use Case: High-quality analysis for complex tasks

Image Analysis

Analyze images from files, URLs, or base64 strings.

Basic Image Analysis

from library.vision_tool import VisionTool, ImageAnalysisConfig

vision_tool = VisionTool()

config = ImageAnalysisConfig(
    model="gemini-3-flash-preview",
    analysis_prompt="Describe this image in detail"
)

# From local file
result = await vision_tool.analyze_image(
    image_source="path/to/image.jpg",
    config=config
)

# From URL
result = await vision_tool.analyze_image(
    image_source="https://example.com/image.jpg",
    config=config
)

# From base64
result = await vision_tool.analyze_image(
    image_source="data:image/jpeg;base64,/9j/4AAQ...",
    config=config
)

Structured Output (Pydantic Schema)

Extract specific data from images using structured schemas:
from pydantic import BaseModel
from typing import List, Optional
from library.vision_tool import VisionTool, ImageAnalysisConfig

class ProductInfo(BaseModel):
    name: str
    price: Optional[float]
    description: str
    tags: List[str]

config = ImageAnalysisConfig(
    model="gemini-3-flash-preview",
    analysis_prompt="Extract product information from this image",
    output_type=ProductInfo
)

result = await vision_tool.analyze_image(image_source, config)
# result.analysis is a ProductInfo instance
product = result.analysis
print(f"Product: {product.name}, Price: {product.price}")

Use Cases for Image Analysis

  • OCR: Extract text from images, documents, screenshots
  • Object Detection: Identify objects, people, products in images
  • Content Moderation: Detect inappropriate content
  • Product Information: Extract product details from photos
  • Document Analysis: Extract data from forms, receipts, invoices
  • Image Classification: Categorize images by content

Video Analysis

Analyze videos from local files, URLs, or YouTube URLs.

Basic Video Analysis

from library.vision_tool import VisionTool, VideoAnalysisConfig

vision_tool = VisionTool()

config = VideoAnalysisConfig(
    model="gemini-3-flash-preview",
    analysis_prompt="Summarize the key points discussed in this video"
)

# From local file
result = await vision_tool.analyze_video(
    video_source="path/to/video.mp4",
    config=config
)

# From URL
result = await vision_tool.analyze_video(
    video_source="https://example.com/video.mp4",
    config=config
)

# From YouTube URL
result = await vision_tool.analyze_video(
    video_source="https://youtube.com/watch?v=...",
    config=config
)

Analyzing Video Segments

Analyze specific time segments of a video:
config = VideoAnalysisConfig(
    model="gemini-3-flash-preview",
    analysis_prompt="Summarize the key points discussed",
    start_offset="1m30s",  # Start at 1 minute 30 seconds
    end_offset="5m",       # End at 5 minutes
    fps=1                  # Analyze 1 frame per second
)

result = await vision_tool.analyze_video(video_source, config)

Use Cases for Video Analysis

  • Video Summarization: Generate summaries of video content
  • Transcription: Extract spoken text from videos
  • Key Moment Identification: Find important scenes or events
  • Content Analysis: Understand video themes and topics
  • YouTube Analysis: Analyze YouTube videos directly from URLs
  • Meeting Analysis: Summarize meeting recordings

Configuration

ImageAnalysisConfig

  • model: Required - “gemini-3-flash-preview” or “gemini-3-pro-preview”
  • analysis_prompt: Required - Detailed prompt for what to analyze
  • output_type: Optional - Pydantic schema for structured output

VideoAnalysisConfig

  • model: Required - “gemini-3-flash-preview” or “gemini-3-pro-preview”
  • analysis_prompt: Required - Detailed prompt for what to analyze
  • start_offset: Optional - Start time (e.g., ”30s”, “1m30s”)
  • end_offset: Optional - End time (e.g., “2m”, “120s”)
  • fps: Optional - Frames per second to analyze (default: 1)
  • output_type: Optional - Pydantic schema for structured output

Using Images/Videos from Previous Blocks

Get visual content from previous workflow blocks:
# Get image path from previous block output
image_path = inputs["previous_block"]["image_path"]

config = ImageAnalysisConfig(**image_analysis_config)
result = await vision_tool.analyze_image(image_path, config)

# Get video path from previous block output
video_path = inputs["video_block"]["video_path"]

config = VideoAnalysisConfig(**video_analysis_config)
result = await vision_tool.analyze_video(video_path, config)

Output Structure

ImageAnalysisResult

  • analysis: Text or Pydantic model dict
  • model: Model used for analysis
  • tokens_used: Number of tokens consumed

VideoAnalysisResult

  • analysis: Text or Pydantic model dict
  • model: Model used for analysis
  • duration_analyzed: Duration of video analyzed
  • tokens_used: Number of tokens consumed

Credit Costs

Vision analysis consumes credits based on model and content complexity:
  • Flash model: ~1-3 credits per analysis
  • Pro model: ~2-5 credits per analysis
  • Video analysis: Additional credits based on duration and fps
Credit costs vary based on image/video size, analysis complexity, and model used. Simple analyses cost fewer credits than complex structured extractions.

Limitations

  • Direct file upload: Limited to less than 20MB (use URL for larger files)
  • Video analysis: Limited to ~1 hour via URL
  • Social media content: Images/videos downloaded first (may take extra time)
  • Base64 input: Only supported for images, not videos
  • Video segment analysis: Requires valid start/end offsets

Best Practices

When extracting specific data, use Pydantic schemas for reliable structured output.
Use start_offset and end_offset to analyze only relevant segments, reducing processing time and costs.
Use Flash for speed, Pro for complex analysis requiring higher accuracy.
For files larger than 20MB, use URLs instead of direct uploads.

Use Cases

OCR & Text Extraction

Extract text from images, documents, and screenshots.

Content Moderation

Detect inappropriate content in images and videos.

Product Analysis

Extract product information from photos.

Video Summarization

Generate summaries and transcripts from videos.

Object Detection

Identify objects, people, and products in images.

YouTube Analysis

Analyze YouTube videos directly from URLs.

Vision analysis is a powerful tool for extracting information from visual content. Use structured output schemas for reliable data extraction.