Architecture¶
Overview¶
The pipeline has two stages, joined by the filesystem:
build.py ──► scenes/NN_<id>.html ──┐
──► audio/NN_<id>.mp3 ──┤
──► index.html ──┘
└──► hyperframes render ──► final.mp4
Stage 1 — build.py (content + composition)¶
build.py owns everything before rendering:
- Source of truth: the
SCENESlist at the top of the file defines all content — kicker, title, bullets, narration text, and implicit references to image and audio assets. - TTS: for each scene, calls ElevenLabs via the REST API and writes
audio/NN_<id>.mp3. Files are cached — they are not re-fetched if they already exist. - Timing: probes each audio file with
ffprobe, then computes per-scene durations and start offsets with configurable lead/trail padding and crossfade overlap. - Scene blocks: emits
scenes/NN_<id>.html— one self-contained HyperFrames sub-composition per scene. Each block has its own<style>, composition root, GSAP timeline, and Ken Burns image. - Host: emits
index.html— a 44-line host that references the blocks viadata-composition-srcand contains all<audio>elements with absolute timestamps.
Stage 2 — hyperframes render (frame capture + encoding)¶
The HyperFrames CLI renders the composition deterministically:
- Compile: parses
index.html, resolves sub-compositions and media, builds a scene graph. ReportsaudioCount,videoCount, total duration. - Extract video frames: any
<video>elements (none in this project) are decoded. - Process audio: all
<audio>elements in the host are mixed into a single track. - Frame capture: Chrome/Puppeteer seeks the GSAP timeline to each frame time, screenshots the canvas. 4 workers in parallel by default.
- Encode video: H.264 yuv420p @ 30fps via libx264.
- Assemble: mux video + audio into the final MP4.
File tree¶
video/hf/
├── build.py # Stage 1 entrypoint (tracked in git)
├── index.html # Generated host composition
├── final.mp4 # Rendered output
│
├── scenes/ # Generated HyperFrames blocks (one per scene)
│ ├── 00_intro.html
│ ├── 01_llm.html
│ ├── 02_ai.html
│ ├── 03_mcpgw.html
│ ├── 04_mcpreg.html
│ ├── 05_skill.html
│ └── 06_outro.html
│
├── audio/ # ElevenLabs MP3s (generated, cached)
│ ├── 00_intro.mp3
│ └── ...
│
└── img/ # PNG image assets (tracked in git)
├── 00_intro.png
├── 01_llm.png
├── 02_ai.png
├── 03_mcpgw.png
├── 04_mcpreg.png
├── 05_skill.png
└── 06_outro.png
Naming convention¶
All assets share the same NN_<id> prefix, where NN is the zero-based scene index and <id> is the scene's id field from SCENES. This keeps audio, images, and scene blocks in sync without any mapping tables.
HyperFrames composition structure¶
Host (index.html)¶
The host is intentionally minimal. It contains:
- One
<div data-composition-src="...">per scene block (the visual sub-compositions) - Seven
<audio>elements with absolutedata-starttimes (scene start + lead pad) - A
#fade-outoverlay clip for the closing fade - A single GSAP timeline that only drives the final fade
<div id="root" data-composition-id="root" data-width="1920" data-height="1080"
data-start="0" data-duration="182.572">
<!-- Scene blocks -->
<div data-composition-id="scene-intro"
data-composition-src="scenes/00_intro.html"
data-start="0.000" data-duration="15.946"
data-track-index="1" data-width="1920" data-height="1080"></div>
<!-- ... more scenes ... -->
<!-- Audio — must live in the host, not inside sub-compositions -->
<audio id="aud-0" class="clip"
data-start="0.600" data-duration="14.446"
data-track-index="3" data-volume="1"
src="audio/00_intro.mp3"></audio>
<!-- ... -->
</div>
Audio must be in the host
HyperFrames only assembles audio declared directly in the root index.html. Audio elements inside sub-composition blocks (data-composition-src files) are rendered visually but their audio tracks are not extracted into the final MP4. Always declare <audio> in the host with absolute start times.
Scene blocks (scenes/NN_<id>.html)¶
Each scene block is a full standalone HTML file containing:
- A
<div data-composition-id="scene-<id>">root scoped todata-start="0"(time is relative within the block) - All visual structure: accent bar, background fill, split-panel content, footer
- A dedicated GSAP timeline registered as
window.__timelines["scene-<id>"] - The Ken Burns effect on the image (
#img-kb)
The HyperFrames runtime loads each block into an isolated context, finds its window.__timelines entry, and seeks it in sync with the host timeline, offset by data-start.
Timing model¶
scene_duration = PAD_LEAD + audio_duration + PAD_TRAIL
= 0.6s + X seconds + 0.9s
scene_start[0] = 0
scene_start[i] = scene_start[i-1] + scene_duration[i-1] - OVERLAP
- 0.4s crossfade
total_duration = scene_start[-1] + scene_duration[-1]
Within each scene block, audio starts at data-start="0.6" (= PAD_LEAD). In the host, the audio data-start is scene_start[i] + PAD_LEAD — the same moment expressed as an absolute timeline position.