My journey to top 14 contributors to Remotion

Update: Remotion is at 34,000 stars now!

I'm now among the top 14 (out of 303) contributors to Remotion.

For those who never had a pleasure working with it, Remotion is a popular programmatic video creation framework with 25,000+ GitHub stars used by companies like FAL AI (337,000,000 USD raised), Creatify AI (18,500,000 USD raised), Extreme Reach (over 1000 employees) and many many others. To put it in perspective, this puts it into the top 0.000535% of all repositories on GitHub by stars (popularity), or in 0.0048% of repositories that have at least some stars.

Here's how I got there - starting with giving the Python SDK an uplift and ending with shipping a full AI-powered video generation template.

How it all started

I started using Remotion over three years ago as part of my job. It seemed like a very unique take on video creation that I haven't really seen before - very intuitive to use yet so powerful.

Quickly enough I noticed that it takes a long time to create templates with many moving parts (I had to create a bunch of those). Having had many years of experience with the Unity game engine, I wrote a plugin that would allow me to export Unity animation format into a custom timeline format that I would consume in a Remotion project using a relatively simple hook.

This allowed me to create all required animations in 1/10 of the time - using GUI in Unity.

Even better - it allowed non-tech folks on the team to do this without my involvement.

Python SDK improvements

Funny enough, on my next job I happened to work with Remotion again. But this time I was the one who introduced it into our tech stack. We had a pretty strict requirement on video rendering for our users - it had to be very fast, under 10 seconds for a 30-second clip. Remotion allowed us to achieve this as it supports distributed rendering on AWS Lambda (getting AWS Lambda limits increased was a painful story worth a separate blog post).

I was using Remotion's Python SDK for Lambda and noticed that type support had some issues. So I decided this was a perfect opportunity to give back to the project I used so often.

I did a cleanup and light refactoring of their Python SDK, fixed typing, added mypy to catch type errors, black formatter to keep it consistent, and added more tests (#5533, #5379, #5639, #5654).

My PRs were swiftly accepted - Jonny Burger, the maker of Remotion, does an amazing job with the project, kudos to him.

Along the way I also found a few smaller issues in templates which I also fixed (#5620, #5839).

Contributing a prompt-to-video template

After I got comfortable with the project after the Python SDK updates, I decided that it was time to add something of more significant value to the project. This resulted in a sizeable #5867.

Nowadays short-form video is dominating the world. This is the most consumed video format as of the time of writing. Combined with AI - it works great to create short animated stories about history, philosophy, math, and many other subjects.

So the idea was to add a prompt-to-video template - it should increase Remotion adoption and hopefully help the project.

The requirements I set for it were the following:

Super smooth developer experience - we are talking here as smooth as it gets - the user should only do actions that directly contribute to the output. Thus - no setup, messing with config files, etc. Just run it and enjoy the result.
The quality of videos must be on the level of commercial tools.
Proper separation of concerns - the renderer must be part of the Remotion bundle (as it usually lives on a worker that renders it), and story/timeline generation must be separate - so developers can easily integrate it into their apps without having to modify the project.

To achieve this, I decided to write a dedicated CLI that can be used to generate content and a timeline in a custom format, which will be used by the Remotion template to render a video from it.

Architecture Overview

The complete solution consists of a 4-stage pipeline:

Input (title + topic)
        │
        ▼
┌───────────────────┐
│  Story Generation │  ← OpenAI  with structured output
└───────────────────┘
        │
        ▼
┌───────────────────┐
│  Asset Generation │  ← OpenAI images + ElevenLabs voice
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Timeline Creation │  ← Frame-accurate synchronization of images, voice and subtitles
└───────────────────┘
        │
        ▼
┌───────────────────┐
│  Video Rendering  │  ← Remotion composition
└───────────────────┘
        │
        ▼
    Output (MP4)

Each stage uses defined data structures and can be replaced with the end user app's custom business logic (e.g., you can skip the content generation step and use your existing content matched to the required schema).

Stage 1: Story Generation

To achieve consistent output, we are using LLMs with structured outputs. The temperature is set to 1 so we get creative results to make videos more diverse and fun. Zod schemas are used across the project to ensure proper typing.

But as soon as I added the OpenAI SDK - I ran into issues with Zod. OpenAI uses a newer version of Zod, Remotion uses an older one, and it is pinned across all packages and can't be changed. After a quick search I found zod-to-json-schema that allows you to convert Zod schemas to JSON schemas and use those in requests to OpenAI.

As we are not using the OpenAI SDK because of the Zod version mismatch, we will be doing good old requests to their API using fetch.

The request ended up looking like this:

const jsonSchema = zodToJsonSchema(schema) as any;
const res = await fetch("https://api.openai.com/v1/chat/completions", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiKey}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "gpt-4.1",
    messages: [{ role: "user", content: prompt }],
    response_format: {
      type: "json_schema",
      json_schema: {
        name: "response",
        schema: {
          type: jsonSchema.type || "object",
          properties: jsonSchema.properties,
          required: jsonSchema.required,
          additionalProperties: jsonSchema.additionalProperties ?? false,
        },
        strict: true,
      },
    },
  }),
});

Actual story generation happens in two stages:

We generate the full story script based on the story title and topic. The prompt instructs the LLM to use storytelling's best practices as you would expect.
The second pass splits the story into slides and adds image descriptions to each one. Each image description contains a description of the image for this particular slide and a detailed character description that includes all details the LLM will need to generate consistent characters across all slides.

The separation exists in part because developers may want to allow users to use their own story as input, skipping step one.

Stage 2: Asset Generation

Next, we use detailed slide image and character descriptions from the previous step to generate images using OpenAI.

Most TTS APIs unfortunately return pure audio. We need to synchronise subtitles with audio to get properly looking results. These small details are what differentiates videos you enjoy watching from those that always feel a bit off. To achieve this we need precise timestamps for each word.

I used ElevenLabs as their API returns character-level timestamps which are later converted to word-level during the timeline generation phase. You could use any other provider and then use timestamped Whisper (or any other model that returns timestamps) to do speech-to-text. I tried this approach when testing and it worked really well in the vast majority of cases. To make it 100% precise you'd need to pass it through an LLM one more time (or even use a simple loop with word matching) to fix words that may have been incorrectly recognised by Whisper. This will result in 99% correct cases and will allow you to use any TTS provider out there.

Stage 3: Timeline Creation

This custom timeline is the bridge between AI-generated assets and video rendering. It's a JSON structure containing four parallel tracks:

Timeline Structure:
├── shortTitle    → Display title for intro slide (shown only for a few frames)
├── elements[]    → Background images with transitions and animations
├── text[]        → Synchronized text segments
└── audio[]       → Voice audio references

As this is a template, it is not intended to cover a wide range of transitions, but I wanted to include all the important basic ones so users can enjoy smooth and properly looking videos. Each background has enter and exit transitions of the following types:

const BackgroundTransitionTypeSchema = z.union([
  z.literal("fade"),
  z.literal("blur"),
  z.literal("none"),
]);

Apart from transitions, backgrounds and other elements can have animations.

At the time of writing, support for only one was added - scale, but they can be easily extended by defining new types and adding their handlers in the renderer.

const ElementAnimationSchema = TimelineElementSchema.extend({
  type: z.literal("scale"),
  from: z.number(),
  to: z.number(),
});

The final background interface looks like this:

const BackgroundElementSchema = TimelineElementSchema.extend({
  imageUrl: z.string(),
  enterTransition: BackgroundTransitionTypeSchema.optional(),
  exitTransition: BackgroundTransitionTypeSchema.optional(),
  animations: z.array(ElementAnimationSchema).optional(),
});

For text elements, I added multiple vertical align options so the LLM (or user) could pick different positioning based on the slide's content.

const TextElementSchema = TimelineElementSchema.extend({
  text: z.string(),
  position: z.union([
    z.literal("top"),
    z.literal("bottom"),
    z.literal("center"),
  ]),
  animations: z.array(ElementAnimationSchema).optional(),
});

The last missing piece is the word chunking algorithm. We want to break sentences on each slide into easily digestible groups so that users can quickly read them and won't feel that we are bombarding them with massive text (and tiny fonts). After a few tries, a segment size of around 20 characters proved to work best for vertical video with a comfortable font size.

As we iterate through words when generating the timeline, we track the character index to map word boundaries back to audio timestamps. Each text segment gets a start time from its first character and an end time from its last character. This approach creates the karaoke-style effect where text appears word-by-word in sync with the voiceover.

Animation Randomization

Static slideshows are rather boring. The template tries to make them more fun through randomized animations:

Ken Burns Effect: Background images alternate between zooming in and zooming out. This classic technique adds perceived motion to still images. The alternation is deterministic (odd/even slides).

Rotation: Each slide receives a subtle random rotation animation (up to 10 degrees). Combined with scaling, this creates a gentle floating effect that keeps the visuals engaging but not distracting.

Text Animations: Text segments get spring-based entrance animations with randomized scale and rotation. Some segments bounce, others slide in - the variety prevents the template from feeling too mechanical.

These randomizations happen at timeline generation time, not render time.

Stage 4: Video Rendering with Remotion

At this point what's left is to render out the custom timeline in Remotion. The timeline renderer consists of multiple components, as you would expect - one for each type of element in the timeline:

AIVideo Component: The orchestrator that loads the timeline JSON and creates a Remotion Sequence timeline from our timeline format.

Background Component: Renders full-screen images with Ken Burns animations. It calculates the current animation state based on the frame number and applies CSS transforms for scaling, rotation, and blur transitions.

Subtitle Component: Displays text with a stroke effect (black outline with white fill) for readability against any background. Uses Remotion's spring physics for smooth entrance animations.

The timeline's seconds timestamps are converted to frame numbers at 30 FPS, including a few frames for the intro slide. Each element type (background, text, audio) gets its own sequence track.

Conclusion

This template shows how modern AI APIs can be used to create end-to-end content creation pipelines. The key insight is that we want to use proper structures and typing everywhere we can, and use AI only for the content generation part.

However, this template can be used without AI - developers can easily use their own voiceovers, slide texts, and images and render a timeline based on those.

This template can be a great fit for various content formats - from educational explainers to social media stories to automated news summaries.

The full source code is available in the Remotion monorepo.