Overview
VisionTool supports:- Image Analysis: Describe, classify, extract text (OCR), detect objects, answer questions
- Video Analysis: Summarize, transcribe, identify key moments, answer questions about content
- Structured Output: Extract specific data using Pydantic schemas
- Multiple Sources: Local files, URLs, YouTube URLs, base64 strings
Vision analysis uses Google’s Gemini models via Vertex AI. Supports both images and videos with comprehensive analysis capabilities.
Models
Gemini 3 Flash Preview
- Image Support: Yes
- Video Support: Yes
- YouTube Support: Yes
- Use Case: Fast analysis with good quality
Gemini 3 Pro Preview
- Image Support: Yes
- Video Support: Yes
- YouTube Support: Yes
- Use Case: High-quality analysis for complex tasks
Image Analysis
Analyze images from files, URLs, or base64 strings.Basic Image Analysis
Structured Output (Pydantic Schema)
Extract specific data from images using structured schemas:Use Cases for Image Analysis
- OCR: Extract text from images, documents, screenshots
- Object Detection: Identify objects, people, products in images
- Content Moderation: Detect inappropriate content
- Product Information: Extract product details from photos
- Document Analysis: Extract data from forms, receipts, invoices
- Image Classification: Categorize images by content
Video Analysis
Analyze videos from local files, URLs, or YouTube URLs.Basic Video Analysis
Analyzing Video Segments
Analyze specific time segments of a video:Use Cases for Video Analysis
- Video Summarization: Generate summaries of video content
- Transcription: Extract spoken text from videos
- Key Moment Identification: Find important scenes or events
- Content Analysis: Understand video themes and topics
- YouTube Analysis: Analyze YouTube videos directly from URLs
- Meeting Analysis: Summarize meeting recordings
Configuration
ImageAnalysisConfig
model: Required - “gemini-3-flash-preview” or “gemini-3-pro-preview”analysis_prompt: Required - Detailed prompt for what to analyzeoutput_type: Optional - Pydantic schema for structured output
VideoAnalysisConfig
model: Required - “gemini-3-flash-preview” or “gemini-3-pro-preview”analysis_prompt: Required - Detailed prompt for what to analyzestart_offset: Optional - Start time (e.g., ”30s”, “1m30s”)end_offset: Optional - End time (e.g., “2m”, “120s”)fps: Optional - Frames per second to analyze (default: 1)output_type: Optional - Pydantic schema for structured output
Using Images/Videos from Previous Blocks
Get visual content from previous workflow blocks:Output Structure
ImageAnalysisResult
analysis: Text or Pydantic model dictmodel: Model used for analysistokens_used: Number of tokens consumed
VideoAnalysisResult
analysis: Text or Pydantic model dictmodel: Model used for analysisduration_analyzed: Duration of video analyzedtokens_used: Number of tokens consumed
Credit Costs
Vision analysis consumes credits based on model and content complexity:- Flash model: ~1-3 credits per analysis
- Pro model: ~2-5 credits per analysis
- Video analysis: Additional credits based on duration and fps
Credit costs vary based on image/video size, analysis complexity, and model used. Simple analyses cost fewer credits than complex structured extractions.
Limitations
- Direct file upload: Limited to less than 20MB (use URL for larger files)
- Video analysis: Limited to ~1 hour via URL
- Social media content: Images/videos downloaded first (may take extra time)
- Base64 input: Only supported for images, not videos
- Video segment analysis: Requires valid start/end offsets
Best Practices
Use Structured Output for Data Extraction
Use Structured Output for Data Extraction
When extracting specific data, use Pydantic schemas for reliable structured output.
Optimize Video Analysis
Optimize Video Analysis
Use start_offset and end_offset to analyze only relevant segments, reducing processing time and costs.
Choose Appropriate Model
Choose Appropriate Model
Use Flash for speed, Pro for complex analysis requiring higher accuracy.
Handle Large Files
Handle Large Files
For files larger than 20MB, use URLs instead of direct uploads.
Use Cases
OCR & Text Extraction
Extract text from images, documents, and screenshots.
Content Moderation
Detect inappropriate content in images and videos.
Product Analysis
Extract product information from photos.
Video Summarization
Generate summaries and transcripts from videos.
Object Detection
Identify objects, people, and products in images.
YouTube Analysis
Analyze YouTube videos directly from URLs.
Related Features
- Image Generation - Generate images for analysis
- Video Generation - Generate videos that can be analyzed
- Configs - Configure vision analysis settings
Vision analysis is a powerful tool for extracting information from visual content. Use structured output schemas for reliable data extraction.