MetaHuman AI Avatar - Build Plan

Hardware

What we have vs what we need

We Have

NVIDIA RTX 3090 (24GB)

Primary MetaHuman GPU. 24GB VRAM handles full Nanite, Lumen, hair. 10,496 CUDA cores. 328 tensor cores for AI inference.

90% - excellent for MetaHuman

We Have

NVIDIA RTX 5060 (8GB)

Secondary GPU. Blackwell architecture, great for dev/testing. 8GB VRAM limits high-fidelity rendering.

60% - good for dev, tight for production

We Have

16GB RAM

Sufficient for UE5 + MetaHuman + single NPC pipeline. Use Low VRAM mode.

85% usage expected

We Have

Hotbot.fun Server

DigitalOcean droplet. Runs Bolt TTS, Piper voices, API routing. Can offload LLM calls.

We Have

Android Phone + Termux

Ryan (Claude) running on mobile. HeyBro for physical automation. Remote control hub.

Recommended

Microphone

USB condenser mic for clean STT input. Blue Yeti or similar. $30-50.

Cloud Option

RunPod GPU Pods (API)

RTX 4090 ($0.34/hr) or A100 80GB ($1.99/hr). Expand VRAM on demand. Pixel Stream to OBS. No hardware purchase needed.

Software Stack

Every component needed to build the avatar

🎮

Unreal Engine 5.4+

Core rendering engine. MetaHuman plugin for photorealistic avatars. Target UE 5.6 for native local TTS/lip-sync.

Free

🎤

Speech-to-Text

Whisper plugin for local GPU-accelerated STT. Or built-in Audio Capture. Converts user speech to text.

Free / Local

🧠

LLM / AI Brain

OpenAI GPT-4o, Anthropic Claude, xAI Grok, or local Ollama (Llama3). VaRest plugin for HTTP API calls.

APIs Ready

🔊

Text-to-Speech

UE5.6 native local TTS (offline). Or ElevenLabs for ultra-realistic voices. Piper TTS already on server.

Piper Ready

👄

Lip Sync

Runtime MetaHuman Lip Sync Plugin (free). NVIDIA Audio2Face. Auto-generates visemes from audio.

Free

📡

Live Streaming

Offworld Live / Spout2 for transparent background output. OBS Spout2 Capture for streaming.

Free

Unreal Engine Plugins

Enable these in Edit > Plugins

🧬

MetaHuman

Photorealistic digital humans with full facial rigging and body animation.

Built-in

🔗

Live Link

Real-time data streaming for facial motion capture and animation.

Built-in

🌐

VaRest

HTTP/REST API calls from Blueprints. Connect to OpenAI, Anthropic, Grok, Ollama.

Free Marketplace

🎧

Audio Capture

Microphone input capture for real-time speech recognition.

Built-in

Build Steps

From zero to live streaming AI avatar

1

Create MetaHuman

Use MetaHuman Creator (metahuman.unrealengine.com) to design your avatar. Download to UE project via Quixel Bridge.

2

Enable Plugins

Edit > Plugins: Enable MetaHuman, Live Link, Audio Capture, VaRest. Restart UE.

Plugins: MetaHuman, LiveLink, AudioCapture, VaRest, HTTP

3

Setup STT Pipeline

Install Whisper plugin or use Audio Capture. Blueprint: Microphone input -> STT -> Text string. Trigger on voice activity detection.

Microphone -> Whisper (Local GPU) -> "What's the weather?"

4

Connect LLM Brain

Use VaRest to make HTTP POST to your AI provider. Send STT text, receive response.

POST api.openai.com/v1/chat/completions
{model: "gpt-4o", messages: [{role: "user", content: stt_text}]}

5

Generate Voice

UE5.6: Window > Audio > Text to Speech (offline GPU). Or API call to ElevenLabs. Response text -> Audio clip.

LLM Response -> TTS Engine -> WAV Audio -> Play Sound Component

6

Lip Sync Animation

Runtime MetaHuman Lip Sync plugin: Audio -> Auto visemes -> Drive MetaHuman face. Add idle animations and head nods for realism.

TTS Audio -> Lip Sync Plugin -> MetaHuman ARKit Face -> Realistic Mouth Movement

7

Go Live

Offworld Live / Spout2 for transparent background. Package as .exe. OBS Spout2 Capture -> Stream to YouTube/Twitch.

UE5 Render -> Spout2 -> OBS -> YouTube Live (1080p/60fps)

Performance Targets

RTX 3090 (local) + RunPod cloud GPUs (on demand)

⚡

Local RTX 3090

24GB VRAM handles full MetaHuman with Nanite and Lumen enabled. Run STT + LLM inference alongside rendering.

🏠

Cloud Burst Mode

Need more power? RunPod API spins up 80GB A100 in seconds. Pixel Stream to OBS. Shut down when done. $0.34-1.99/hr.

🎯

Target Specs

1080p / 60fps render. 1-3 second end-to-end latency. Single MetaHuman NPC. Real-time voice interaction.

Cloud GPU Strategy

Scalable VRAM on demand via RunPod API - pay by the second, not the month

RunPod GPU Pod

→

UE5 + MetaHuman Render

→

Pixel Streaming (WebRTC)

→

Local Browser

→

OBS Capture

→

YouTube / Twitch

☁

Why Cloud GPU?

Our RTX 5060 has 8GB VRAM - tight for MetaHumans with Nanite, Lumen, hair physics. Cloud GPUs give us 24-80GB VRAM on demand without buying hardware. Spin up when streaming, shut down when done.

🚀

RunPod REST API

Programmatic GPU management. Launch pods, scale, and monitor via API. Per-second billing. Deploy custom Docker containers with UE5 packaged builds. Auto-shutdown when idle.

curl -X POST https://api.runpod.ai/v2/pods \
-H "Authorization: Bearer $RUNPOD_KEY" \
-d '{"gpu": "NVIDIA RTX 4090", "image": "ue5-metahuman:latest"}'

🎥

Pixel Streaming + OBS

UE5 Pixel Streaming sends rendered frames via WebRTC to any browser. Capture in OBS as Browser Source or Window Capture. Start OBS Virtual Camera to pipe to Zoom/Discord/YouTube.

Built into UE5

GPU Pricing Tiers

GPU	VRAM	$/Hour	Best For	Daily (8hr)
RTX 4090	24 GB	$0.34	MetaHuman 1080p, single NPC	$2.72
A100 80GB	80 GB	$1.99	4K, multiple NPCs, Nanite+Lumen	$15.92
H100	80 GB	$1.99	Max performance, AI + render	$15.92
RTX 3090	24 GB	$0.22	Budget MetaHuman rendering	$1.76
Local 5060	8 GB	FREE	Development, low-res testing	$0

Cost Comparison: Buy vs Rent

Local (Buy Hardware)

RTX 4090: $2,000-2,800 upfront
Zero ongoing GPU cost
24GB VRAM always available
Break-even vs cloud: ~3,500 hours
No latency from streaming
Full offline capability

Cloud (RunPod Rental)

$0 upfront investment
RTX 4090: $0.34/hr (~$10/month casual)
Scale to 80GB VRAM (A100) instantly
Multi-GPU with NVLink available
Upgrade GPUs as new ones release
API-driven automation from Ryan

Integration Architecture

A

Ryan (Termux) Controls Everything

Ryan uses RunPod API to spin up GPU pods, deploy MetaHuman builds, start Pixel Streaming, and tear down when done. Full automation from phone. Bolt can trigger cloud sessions via voice command on call.

# Ryan spins up MetaHuman cloud session
ssh forge@hotbot.fun 'curl -X POST https://api.runpod.ai/v2/pods \
  -H "Authorization: Bearer $RUNPOD_KEY" \
  -d "{\"gpu\":\"NVIDIA RTX 4090\",\"image\":\"ue5-meta:latest\"}"'

# Pixel stream URL returned -> OBS Browser Source
# Stream goes live on YouTube

B

Hybrid Mode: Local Dev + Cloud Stream

Develop and test on local 5060 (free). When ready to go live at full quality, push the packaged build to RunPod and stream at 1080p/60fps with 24-80GB VRAM. Best of both worlds.

C

OBS Virtual Camera Pipeline

Pixel Stream from RunPod renders in browser. OBS captures browser as source. OBS Virtual Camera outputs to Zoom, Discord, Google Meet. Your AI MetaHuman joins meetings as a "webcam."

RunPod (RTX 4090) -> Pixel Stream -> Browser -> OBS Capture -> Virtual Camera -> Zoom/Discord

How We Make Episodes

Our fully automated AI Zone YouTube episode pipeline — from idea to published video

Episode Script

→

Grok Image Gen

→

Kokoro TTS

→

FFmpeg Concat

→

FFmpeg Video

→

YouTube Upload

Tools & APIs We Use

🐍

Python + build_all_episodes.py

Master script that orchestrates the entire pipeline. Contains all episode data: titles, 9-part scripts, slide prompts, outros, and subscribe text. Runs on the Hotbot.fun server at /home/forge/youtube/.

Running

🎨

xAI Grok Imagine API

Generates 9 cinematic slide images per episode using the grok-2-image-1212 model. Each prompt creates photorealistic AI art at 1920x1080 resolution. Replaced older Pollinations approach for better quality.

🎙

Kokoro TTS (10 Voices)

Local text-to-speech engine with 10 voice actors: Fenrir, Bella, Michael, Nova, Adam, Sarah, Eric, Emma, Onyx, and Heart. Each episode section uses a different voice for variety. 24kHz PCM output.

Free / Local

🎬

FFmpeg

Audio concatenation with 0.3-second gaps between sections, and final video assembly. Combines slideshow images with full audio track. H.264 encoding, AAC 128k audio, 1920x1080 at 30fps.

Free / Local

📺

YouTube Data API v3

Automated upload via our youtube_content.js service on port 8774. OAuth2 authentication, playlist management, thumbnail uploads, and SEO metadata generation.

Running

🧠

OpenAI GPT-4o-mini + DALL-E 3

GPT-4o-mini generates SEO-optimized titles, descriptions, and tags via the /generate endpoint. DALL-E 3 creates custom thumbnails via the /thumbnail endpoint. Integrated into youtube_content.js.

APIs Ready

Step-by-Step Episode Pipeline

1

Write the Episode Script

Each episode lives in the EPISODES dictionary inside build_all_episodes.py. Every episode has a title, 9 main content parts (each a self-contained paragraph), an intro using the iconic AI Zone opening, an outro, a subscribe call-to-action, and a tease for the next episode. The script is pure Python — no external content needed.

EPISODES = {
    7: {
        "title": "Can AI Feel Emotions?",
        "main_parts": ["What if the voice...", "Welcome to Episode 7...", ...],
        "slide_prompts": ["cinematic robot face with tear...", ...],
        "outro": "We now return control of your mind...",
        "subscribe": "If this made you think, hit subscribe..."
    }
}

2

Generate 9 Cinematic Slide Images

The grok_slides.py script sends each slide prompt to the xAI Grok Imagine API. The model grok-2-image-1212 generates photorealistic images which are downloaded, resized to 1920x1080 using PIL, and saved as slides/slide01.jpg through slides/slide09.jpg in each episode directory.

# Grok API call for each slide
POST https://api.x.ai/v1/images/generations
{"model": "grok-2-image-1212", "prompt": "cinematic robot face...", "n": 1}
# Response: base64 image -> decode -> resize 1920x1080 -> save

3

Generate Voice Audio with Kokoro TTS

Each of the 9 main sections, plus the intro, outro, and subscribe text, gets its own voice file using a different Kokoro voice. The script writes text to files (intro_text.txt, main_text.txt, etc.) then runs Kokoro TTS to generate WAV audio at 24kHz. 10 voices rotate: am_fenrir, af_bella, am_michael, af_nova, am_adam, af_sarah, am_eric, bf_emma, am_onyx, af_heart.

# Kokoro generates each section with a different voice
kokoro --voice am_fenrir --output intro_fenrir.wav "There is a fifth dimension..."
kokoro --voice af_bella --output main1_bella.wav "What if the voice..."
# ... 9 more sections with rotating voices

4

Concatenate Audio with Gaps

FFmpeg concat demuxer joins all voice WAV files in order, inserting a 0.3-second silence (gap.wav) between each section. This creates the full episode audio track: kokoro_full.wav. A concat list file tells FFmpeg the exact order and timing.

# voice_concat.txt
file 'intro_fenrir.wav'
file 'gap.wav'
file 'main1_bella.wav'
file 'gap.wav'
...
# FFmpeg concat
ffmpeg -f concat -safe 0 -i voice_concat.txt -c copy kokoro_full.wav

5

Build Final Video (Slideshow + Audio)

FFmpeg creates the final MP4 by combining the slide images with the full audio track. Each slide displays for the duration of its corresponding audio section. Output: 1920x1080, H.264 video codec, AAC 128k audio, 30fps. Result is a polished episodeN_kokoro.mp4.

# slideshow_kokoro.txt tells FFmpeg how long each slide displays
file 'slides/slide01.jpg'
duration 15.2
file 'slides/slide02.jpg'
duration 12.8
...
# Build video
ffmpeg -f concat -safe 0 -i slideshow_kokoro.txt \
  -i kokoro_full.wav -c:v libx264 -c:a aac -b:a 128k \
  -pix_fmt yuv420p -shortest episode7_kokoro.mp4

6

SEO Optimization & YouTube Upload

Our youtube_content.js automation service (port 8774) handles the final mile. The /full-package endpoint uses GPT-4o-mini to generate an SEO-optimized title, description with keywords, and tags. DALL-E 3 generates a custom thumbnail. The video is uploaded via YouTube Data API v3 with OAuth2 authentication and added to the AI Zone playlist.

# Generate SEO + thumbnail + upload
curl -X POST localhost:8774/full-package \
  -d '{"topic":"Can AI Feel Emotions?"}'
# Returns: title, description, tags, thumbnail URL

# Upload to YouTube (automated via API)
# -> Video goes to AI Shorts playlist: PLjz4hAuc5RCoknWSDHlVK1ADWKzM64Udf

Season 1: The AI Zone (Episodes 7-14)

Episode	Title	Slides	Voices	Status
7	Can AI Feel Emotions?	9	10 actors	Built
8	Will AI Take Your Job?	9	10 actors	Built
9	Should AI Have Rights?	9	10 actors	Built
10	Can AI Be Your Best Friend?	9	10 actors	Built
11	Is AI Watching You Right Now?	9	10 actors	Built
12	Can AI Predict the Future?	9	10 actors	Built
13	Will AI Become Conscious?	9	10 actors	Built
14	Can AI Fall in Love? (Finale)	9	10 actors	Built

File Locations

Server (Hotbot.fun)

Episode builder: /home/forge/youtube/build_all_episodes.py
Image generator: /home/forge/youtube/grok_slides.py
YouTube automation: /home/forge/hotbot-ivr/youtube_content.js
Episode files: /home/forge/youtube/episode{7-14}/
Each episode dir: slides/, audio WAVs, final MP4

Local (Ryan - Termux)

Built episodes: ~/youtube_episodes/episode{7-14}_kokoro.mp4
Episode builder copy: ~/build_all_episodes.py
AI Shorts skill: ~/.claude/commands/ai-short.md
Wingman upload: ~/wingman-upload.sh

Resources & Tutorials

Learn from the best

🎬

Voice AI NPCs Full Demo

YouTube by Georgy Dev - Complete speech-to-speech MetaHuman walkthrough with Blueprints.

Free Tutorial

📖

Local TTS Lip Sync

Unreal Engine Forums tutorial - Set up local TTS with automatic lip synchronization.

Free Tutorial

🎙

ElevenLabs Real-Time

Step-by-step integration guide for ultra-realistic AI voices with MetaHuman.

Free Tutorial

💰

Runtime AI Chatbot Integrator

UE Marketplace plugin. One-click AI chatbot integration for MetaHumans. $20-50.

Optional Paid

Our Existing Infrastructure

What we've already built at Hotbot.fun

🤖

Bolt - Phone AI

Voice AI on Twilio. Piper TTS (lessac voice). Handles calls, responds with AI. Already running 24/7.

Running

📱

Ryan - Mobile AI

Claude on Termux Android. Controls HeyBro for phone automation. Can send commands to any app.

Running

🎭

HeyBro - Phone Agent

Android accessibility AI agent. Taps, types, navigates. Controlled remotely via broadcast intents.

Running

🔊

Piper TTS Server

Multiple voice models ready. Amy, Lessac voices. Fast local TTS generation on server.

Running