We had 48 hardcoded Remotion templates that made every client's video look the same. Here's how we replaced them with an AI-driven scene description language that treats layout as data — and never wrote another template again.
The Template Trap
When we first built programmatic video generation into our marketing platform, we did what everyone does: we wrote templates. Four Remotion compositions — Feature Announcement, Product Demo, Social Teaser, Release Notes — each in three durations and four visual styles.
That's 48 variants. Every one of them was a monolithic React component with hardcoded layout positions, bespoke animation timing, and pixel-perfect element placement that only worked at one aspect ratio. Adding a new element meant touching every file. Supporting square format for social meant forking everything. Again.
The worst part? A fintech startup and a food delivery app got the exact same layout. Different hex colors, same structure. The videos looked generated because they were.
Treating Layout as Data, Not Code
We kept running into the same wall: we were encoding design decisions in React components when we should have been encoding them in data. Video layouts are a spatial problem — where does the title go, how do the bullet points reveal, when does the scene crossfade. That's not a code problem. That's a direction problem.
So we tried something different. Instead of writing more templates, we designed a structured format that could describe any marketing video scene — and handed the actual composition work to an LLM.
We'll be honest, we weren't sure this would work. Spatial reasoning felt like a stretch for a language model. But it turns out that if you give the AI tight enough constraints — a fixed vocabulary of element types, bounded numeric ranges, validated fields — it's surprisingly good at arranging things on screen. The trick was never asking it to be creative in an open-ended way. We gave it a box to work in, and it filled the box well.
The Scene Description Language
This was the hard part. Not the AI integration — that was relatively straightforward once the schema existed. Designing a vocabulary that was expressive enough for professional-looking motion graphics but rigid enough to never produce garbage output took us longer than we'd like to admit.
We ended up with seven element types: text, badge, button, list, image, shape, and container. Each one maps to a specific visual component — badges are pill-shaped labels, containers can render as browser windows or phone mockups or terminals, lists have per-item stagger timing.
Everything is positioned with percentages. {x: 50, y: 40} means center-horizontal, slightly above middle. We added an anchor point system — nine positions like center, topLeft, bottomRight — so you can say "put the bottom-left corner of this element at coordinates 10, 90" without doing offset math.
This sounds like a small thing. It wasn't. Percentage-based positioning is the reason we can render the same scene description at 1920×1080 and 1080×1080 without changing anything. That one decision saved us weeks of work we would have otherwise spent maintaining parallel layouts.
Each element gets two animation properties: an entrance (how it appears — slide up, spring pop, fade in, typewriter effect) and an optional continuous animation (subtle ongoing motion — a floating drift, a pulse, a Ken Burns zoom). The AI picks both per element. We validate the whole thing server-side with a schema before it hits the renderer. If the AI returns something malformed, we catch it there and fall back to a simpler deterministic composition.
Why Frame-Based Animation Matters
Our videos render on AWS Lambda. There's no browser animation runtime there, no requestAnimationFrame, no CSS transition engine. Every frame gets rendered independently in parallel across dozens of workers. So every animation has to be a pure function of the frame number — give it frame 247 and it produces the exact same output every time, no matter which Lambda instance runs it.
We built a small animation engine on top of Remotion's interpolate() and spring() primitives. Entrance animations combine opacity, translation, and scale. A five-item list can stagger its reveals 0.2 seconds apart just by incrementing a delay value per item. Continuous animations use sine waves for floating motion, subtle scale oscillation for pulses, and slow pan-and-zoom for Ken Burns effects on images. They're tiny details, but they make the output feel like motion graphics instead of a slideshow.
We spent a while trying to get the spring physics right for the "pop" entrance animation — the damping and stiffness values went through maybe twenty iterations before the bounce felt natural without being distracting. That kind of tuning doesn't show up in architecture diagrams but it's where the output quality actually lives.
Making It Brand-Aware
Generic videos with different color swatches was the whole problem we were trying to solve, so brand awareness had to go deeper than palette swapping.
When someone generates a video, we pull their full brand context from the database: colors (primary, secondary, accent, plus any extended palette we extracted from their codebase), font names and actual font files, their logo, and what we call "brand intelligence" — an elevator pitch, key features, selling points, and tone of voice that we extract automatically from their connected GitHub repository.
All of this gets fed to the AI when it composes scenes. It writes real marketing copy for headlines and CTAs based on what the product actually does. It adapts layout density and visual weight to match the brand's personality. A developer tools company gets clean, minimal compositions. A consumer app gets bolder, more energetic layouts.
The font loading was a fun challenge. We inject @font-face declarations dynamically at render time from remote font file URLs. It means the brand's actual typeface shows up in the video — their heading font on titles, their body font on descriptions. We're consistently surprised by how much this one detail moves the needle on whether output feels "on-brand" or "obviously generated."
The Full Pipeline
The request flow is pretty linear: assemble brand context from the database, feed it to the AI scene generator, validate the returned JSON against our schema, pick the right Remotion composition for the duration and aspect ratio, ship it to Lambda for parallel rendering, then poll until it's done and store the result.
A 30-second video renders in under a minute. Lambda parallelizes across frame chunks, so render time scales with compute rather than video length.
The fallback system deserves a mention because it's boring and that's the point. If the AI call fails for any reason — timeout, bad output, rate limit — we generate deterministic scenes that still use the brand's real colors, name, and description. They're simpler layouts, but they're never broken. In practice the fallback fires on a small percentage of requests. But it means our effective render success rate is 100%, and we sleep better knowing that nobody's brand is getting a white screen.
What We'd Tell Our Past Selves
Designing the constraint schema was more important than any prompt engineering we did. The schema is what makes the output reliable — not clever instructions, not few-shot examples, not temperature tuning. The AI can't place text at -500% or set a font size to zero because the schema literally won't accept it. We wish we'd internalized that earlier instead of spending time on prompt iteration.
Commit to percentages for positioning from day one. We briefly considered pixel-based coordinates and we're glad we didn't go down that road. Every time we add a new aspect ratio, it just works. No layout code changes.
The fallback system was one of those things that felt like over-engineering when we built it. It's not. When you're generating content that represents someone's brand, you can't ship failures. The boring reliability code is what makes the flashy AI code viable in production.
And the best signal that the architecture is right: every major feature we've added since — voiceover support, logo placement, brand font loading — plugged in without rethinking anything. Voiceover was adding an audio component. Logos were a new image source type. Fonts were a loader function. We haven't written a new template in months, and we don't think we ever will again.
