Veo 3.1 Review for Client Work. Honest Wins, Real Fails, Hybrid Workflow
40+ client clips later, here is where Google's Veo 3.1 crushes agency work, where it falls flat, and the hybrid pipeline we ship every brief with.
Veo 3.1 is Google's most capable text-to-video model in May 2026. We've used it for 40+ client clips across UGC, product reels, kinetic typography, and brand films. Here's an honest breakdown of where it crushes and where it falls flat for agency-grade work.
What Veo 3.1 is good at
The first time you generate an 8-second clip and the model returns physics that actually look right, you understand why the hype is not just hype. Veo 3.1 is the first text-to-video model we have used at the agency that delivers shot-grade output on the first or second try instead of the tenth.
Five things Veo 3.1 does exceptionally well for client work right now:
- Physical realism. Hair, fabric, water, smoke, glass, and skin all behave like the physical world. We have shot product splash scenes that would have cost €3,000 in studio time and gotten believable liquid behavior on the second prompt.
- Motion coherence. Subjects do not warp between frames. A dog running through a park stays the same dog. A coffee cup tilting pours liquid in the correct direction.
- Native audio sync. Veo 3.1 generates synchronized ambient sound, foley, and SFX with the visuals. Footsteps land on the right frames. Glass clinks when glass touches glass. This alone removes 40 minutes of post per clip.
- 8-second clips. Long enough for a hero shot, a product reveal, an establishing shot, or a UGC beat. Short enough to keep the model coherent end-to-end.
- Prompt adherence. When you write a clear, structured prompt with camera, subject, action, environment, and lighting specified, Veo 3.1 listens. This was the biggest leap from Veo 2.
Lighting deserves its own line. Veo 3.1 handles golden hour, harsh midday, soft overcast, neon nightlife, and studio softbox setups with cinematographer-level accuracy. We have stopped writing "cinematic lighting" in prompts because it now defaults there. We write specific lighting setups instead.
For context on the underlying model, Google publishes their capability spec at deepmind.google/technologies/veo.
Where Veo 3.1 fails
The fails are the part most reviews skip. If you run an agency, these will cost you billable hours if you don't plan around them.
- Character consistency across clips. Generate the same character twice with the same prompt and you get two cousins, not the same person. For multi-shot narratives with a recurring face, Veo 3.1 alone is not the tool. Reference image conditioning helps but does not solve it fully.
- Exact-text rendering. Signs, packaging copy, neon text, screen UI, kinetic typography spelled out in the prompt. Veo writes letters that look like letters but spell nothing real. Brand names come out scrambled. We never let Veo render brand-critical text.
- Specific brand colors. Veo gets close to a hex but not to a hex. If your brand orange is #FF7A3B, Veo will give you a family of warm oranges that drift between shots. Useless for hard brand guidelines.
- Complex multi-character interactions. Two people talking, looking at each other, passing an object. Veo handles one focal subject beautifully and degrades fast as you add bodies and interaction logic.
- Precise camera moves. "Slow dolly in, then 90-degree pan right at second 3" is wishful thinking. Veo interprets camera language loosely. You get the vibe, not the move.
These limits define where the rest of our pipeline picks up the slack.
Our hybrid workflow
We do not ship a client deliverable as raw Veo output. We ship a Veo + Remotion + ffmpeg + ElevenLabs composite. Here is the split that actually works:
- Veo 3.1 for hero shots (3 to 8 seconds). Product establishing shots, lifestyle moments, atmospheric B-roll, UGC-style talking moments with non-branded subjects. The shots where physical realism and lighting do the heavy lifting.
- Remotion for kinetic type, lower thirds, transitions, and brand overlays. Anything that requires exact text, exact color, or exact timing gets composited in React via Remotion. Brand-safe, programmatic, version-controlled.
- ffmpeg for the cut. Concatenation, trimming, format conversion, codec normalization. Veo outputs MP4 H.264 by default which plays nicely on every platform, but client deliverables need ProRes proxies and 9:16 / 1:1 / 16:9 exports built in one pass.
- ElevenLabs for brand VO. Veo's native audio is excellent for ambient and SFX. For voiceover that has to match a brand voice profile or a returning narrator across multiple deliverables, we route VO through ElevenLabs and lay it under the Veo visual track in Remotion.
The whole pipeline runs from a single brief into a single render queue. A 30-second campaign cut typically lands in under 4 hours of operator time including revisions. Read more about how we structure the production stack in our approach.
Cost reality
Veo 3.1 via the Gemini API costs approximately $0.50 per 8-second clip as of May 2026. That price applies to standard quality generations. Higher-fidelity modes cost more, but for client deliverables the standard tier is enough 90% of the time.
What does that mean in agency math?
- 100 client clips = roughly $50 in raw generation cost
- A typical client campaign (10 hero clips + 20 B-roll cuts) = $15
- An entire month of UGC variants across 5 client brands = under $100
Compare that to a traditional production agency quoting €5,000 to €15,000 for a comparable shot volume with location, crew, talent, and post included. The cost delta is not 2x or 5x. It is 100x. The strategic question is no longer "can we afford to test this concept" but "which of the 15 concepts on our shortlist do we want to ship this week."
For details on pricing structure and quotas, the official Gemini API docs are the source of truth.
Commercial rights and watermarking
Google permits commercial use of Veo 3.1 outputs generated through the Gemini API and Vertex AI under their standard terms. You own the right to use the generated footage in client work, advertising, and broadcast.
Every Veo 3.1 output carries an invisible SynthID watermark embedded into the pixel data. The watermark is robust to common transformations (resize, recompress, crop) and survives most editing pipelines. It does not affect visible quality but allows AI-generated content to be identified by Google's detector.
For client SOWs we recommend three disclosure lines:
- A clause stating that AI-assisted generation is part of the production process
- A clause confirming commercial use rights flow through to the client
- A clause acknowledging that SynthID watermarking is present on the deliverables
Most clients in 2026 do not care about the AI-assisted part. They care about turnaround speed and brand safety. The disclosure protects both sides.
When NOT to use Veo
Three categories where we still book real production:
- Talking-head testimonials. A real founder, a real customer, on camera. The credibility of the human face matters more than the polish. Use real footage.
- Exact product close-ups. Logo on packaging, label macro, screen UI, anything where the product must be photographically accurate to specification. Use product photography or 3D renders.
- Continuous narratives over 30 seconds. The 8-second cap means stitching, and stitching means character and lighting drift. For long-form story we still rely on traditional production or hybrid live-action + Veo B-roll.
For everything else (UGC variants, product reels, atmospheric brand films, social cutdowns, paid social creative) Veo 3.1 plus our Remotion pipeline is now our default. See the full list of what we ship in our services.
If you want a parallel case study where AI tools replaced a full production cycle, read how we built an Etsy shop with AI agents in 48 hours.
FAQ
Can I use Veo 3.1 for commercial client work?
Yes. Google permits commercial use of Veo 3.1 outputs generated through the Gemini API and Vertex AI under their standard terms. Every output carries an invisible SynthID watermark for AI-content traceability. We disclose AI-generated footage in client SOWs and recommend you do the same.
How much does Veo 3.1 cost per video?
Approximately $0.50 per 8-second clip via the Gemini API as of May 2026. A full 100-clip client campaign costs around $50 in raw generation. Compare that to a traditional agency budget of $5,000 or more for equivalent shot volume.
What is the maximum length of a Veo 3.1 video?
Each generation maxes at 8 seconds of continuous footage. Longer narratives require stitching multiple clips together, which introduces character and lighting consistency challenges. For sequences over 8 seconds we use Remotion to build the timeline and Veo clips as hero shots.
Does Veo 3.1 generate audio with the video?
Yes. Veo 3.1 generates native synchronized audio including ambient sound, sound effects, and dialogue. Audio quality is impressive for ambient and SFX, less reliable for brand-specific voice work, which is why we route brand voiceovers through ElevenLabs.
Want Veo-powered creative for your brand?
We turn a brief into a Veo + Remotion pipeline and ship the cut in days. Free first call, no pitch deck.
Book a Free Call →