A fully autonomous conversational AI avatar powered by Unreal Engine, real-time voice synthesis, and large language models. Live stream ready.
What we have vs what we need
Primary MetaHuman GPU. 24GB VRAM handles full Nanite, Lumen, hair. 10,496 CUDA cores. 328 tensor cores for AI inference.
90% - excellent for MetaHumanSecondary GPU. Blackwell architecture, great for dev/testing. 8GB VRAM limits high-fidelity rendering.
60% - good for dev, tight for productionSufficient for UE5 + MetaHuman + single NPC pipeline. Use Low VRAM mode.
85% usage expectedDigitalOcean droplet. Runs Bolt TTS, Piper voices, API routing. Can offload LLM calls.
Ryan (Claude) running on mobile. HeyBro for physical automation. Remote control hub.
USB condenser mic for clean STT input. Blue Yeti or similar. $30-50.
RTX 4090 ($0.34/hr) or A100 80GB ($1.99/hr). Expand VRAM on demand. Pixel Stream to OBS. No hardware purchase needed.
Every component needed to build the avatar
Core rendering engine. MetaHuman plugin for photorealistic avatars. Target UE 5.6 for native local TTS/lip-sync.
FreeWhisper plugin for local GPU-accelerated STT. Or built-in Audio Capture. Converts user speech to text.
Free / LocalOpenAI GPT-4o, Anthropic Claude, xAI Grok, or local Ollama (Llama3). VaRest plugin for HTTP API calls.
APIs ReadyUE5.6 native local TTS (offline). Or ElevenLabs for ultra-realistic voices. Piper TTS already on server.
Piper ReadyRuntime MetaHuman Lip Sync Plugin (free). NVIDIA Audio2Face. Auto-generates visemes from audio.
FreeOffworld Live / Spout2 for transparent background output. OBS Spout2 Capture for streaming.
FreeEnable these in Edit > Plugins
Photorealistic digital humans with full facial rigging and body animation.
Built-inReal-time data streaming for facial motion capture and animation.
Built-inHTTP/REST API calls from Blueprints. Connect to OpenAI, Anthropic, Grok, Ollama.
Free MarketplaceMicrophone input capture for real-time speech recognition.
Built-inFrom zero to live streaming AI avatar
Use MetaHuman Creator (metahuman.unrealengine.com) to design your avatar. Download to UE project via Quixel Bridge.
Edit > Plugins: Enable MetaHuman, Live Link, Audio Capture, VaRest. Restart UE.
Install Whisper plugin or use Audio Capture. Blueprint: Microphone input -> STT -> Text string. Trigger on voice activity detection.
Use VaRest to make HTTP POST to your AI provider. Send STT text, receive response.
UE5.6: Window > Audio > Text to Speech (offline GPU). Or API call to ElevenLabs. Response text -> Audio clip.
Runtime MetaHuman Lip Sync plugin: Audio -> Auto visemes -> Drive MetaHuman face. Add idle animations and head nods for realism.
Offworld Live / Spout2 for transparent background. Package as .exe. OBS Spout2 Capture -> Stream to YouTube/Twitch.
RTX 3090 (local) + RunPod cloud GPUs (on demand)
24GB VRAM handles full MetaHuman with Nanite and Lumen enabled. Run STT + LLM inference alongside rendering.
Need more power? RunPod API spins up 80GB A100 in seconds. Pixel Stream to OBS. Shut down when done. $0.34-1.99/hr.
1080p / 60fps render. 1-3 second end-to-end latency. Single MetaHuman NPC. Real-time voice interaction.
Scalable VRAM on demand via RunPod API - pay by the second, not the month
Our RTX 5060 has 8GB VRAM - tight for MetaHumans with Nanite, Lumen, hair physics. Cloud GPUs give us 24-80GB VRAM on demand without buying hardware. Spin up when streaming, shut down when done.
Expandable VRAMProgrammatic GPU management. Launch pods, scale, and monitor via API. Per-second billing. Deploy custom Docker containers with UE5 packaged builds. Auto-shutdown when idle.
UE5 Pixel Streaming sends rendered frames via WebRTC to any browser. Capture in OBS as Browser Source or Window Capture. Start OBS Virtual Camera to pipe to Zoom/Discord/YouTube.
Built into UE5| GPU | VRAM | $/Hour | Best For | Daily (8hr) |
|---|---|---|---|---|
| RTX 4090 | 24 GB | $0.34 | MetaHuman 1080p, single NPC | $2.72 |
| A100 80GB | 80 GB | $1.99 | 4K, multiple NPCs, Nanite+Lumen | $15.92 |
| H100 | 80 GB | $1.99 | Max performance, AI + render | $15.92 |
| RTX 3090 | 24 GB | $0.22 | Budget MetaHuman rendering | $1.76 |
| Local 5060 | 8 GB | FREE | Development, low-res testing | $0 |
Ryan uses RunPod API to spin up GPU pods, deploy MetaHuman builds, start Pixel Streaming, and tear down when done. Full automation from phone. Bolt can trigger cloud sessions via voice command on call.
Develop and test on local 5060 (free). When ready to go live at full quality, push the packaged build to RunPod and stream at 1080p/60fps with 24-80GB VRAM. Best of both worlds.
Pixel Stream from RunPod renders in browser. OBS captures browser as source. OBS Virtual Camera outputs to Zoom, Discord, Google Meet. Your AI MetaHuman joins meetings as a "webcam."
Our fully automated AI Zone YouTube episode pipeline — from idea to published video
Master script that orchestrates the entire pipeline. Contains all episode data: titles, 9-part scripts, slide prompts, outros, and subscribe text. Runs on the Hotbot.fun server at /home/forge/youtube/.
Generates 9 cinematic slide images per episode using the grok-2-image-1212 model. Each prompt creates photorealistic AI art at 1920x1080 resolution. Replaced older Pollinations approach for better quality.
Local text-to-speech engine with 10 voice actors: Fenrir, Bella, Michael, Nova, Adam, Sarah, Eric, Emma, Onyx, and Heart. Each episode section uses a different voice for variety. 24kHz PCM output.
Free / LocalAudio concatenation with 0.3-second gaps between sections, and final video assembly. Combines slideshow images with full audio track. H.264 encoding, AAC 128k audio, 1920x1080 at 30fps.
Free / LocalAutomated upload via our youtube_content.js service on port 8774. OAuth2 authentication, playlist management, thumbnail uploads, and SEO metadata generation.
GPT-4o-mini generates SEO-optimized titles, descriptions, and tags via the /generate endpoint. DALL-E 3 creates custom thumbnails via the /thumbnail endpoint. Integrated into youtube_content.js.
Each episode lives in the EPISODES dictionary inside build_all_episodes.py. Every episode has a title, 9 main content parts (each a self-contained paragraph), an intro using the iconic AI Zone opening, an outro, a subscribe call-to-action, and a tease for the next episode. The script is pure Python — no external content needed.
The grok_slides.py script sends each slide prompt to the xAI Grok Imagine API. The model grok-2-image-1212 generates photorealistic images which are downloaded, resized to 1920x1080 using PIL, and saved as slides/slide01.jpg through slides/slide09.jpg in each episode directory.
Each of the 9 main sections, plus the intro, outro, and subscribe text, gets its own voice file using a different Kokoro voice. The script writes text to files (intro_text.txt, main_text.txt, etc.) then runs Kokoro TTS to generate WAV audio at 24kHz. 10 voices rotate: am_fenrir, af_bella, am_michael, af_nova, am_adam, af_sarah, am_eric, bf_emma, am_onyx, af_heart.
FFmpeg concat demuxer joins all voice WAV files in order, inserting a 0.3-second silence (gap.wav) between each section. This creates the full episode audio track: kokoro_full.wav. A concat list file tells FFmpeg the exact order and timing.
FFmpeg creates the final MP4 by combining the slide images with the full audio track. Each slide displays for the duration of its corresponding audio section. Output: 1920x1080, H.264 video codec, AAC 128k audio, 30fps. Result is a polished episodeN_kokoro.mp4.
Our youtube_content.js automation service (port 8774) handles the final mile. The /full-package endpoint uses GPT-4o-mini to generate an SEO-optimized title, description with keywords, and tags. DALL-E 3 generates a custom thumbnail. The video is uploaded via YouTube Data API v3 with OAuth2 authentication and added to the AI Zone playlist.
| Episode | Title | Slides | Voices | Status |
|---|---|---|---|---|
| 7 | Can AI Feel Emotions? | 9 | 10 actors | Built |
| 8 | Will AI Take Your Job? | 9 | 10 actors | Built |
| 9 | Should AI Have Rights? | 9 | 10 actors | Built |
| 10 | Can AI Be Your Best Friend? | 9 | 10 actors | Built |
| 11 | Is AI Watching You Right Now? | 9 | 10 actors | Built |
| 12 | Can AI Predict the Future? | 9 | 10 actors | Built |
| 13 | Will AI Become Conscious? | 9 | 10 actors | Built |
| 14 | Can AI Fall in Love? (Finale) | 9 | 10 actors | Built |
Learn from the best
YouTube by Georgy Dev - Complete speech-to-speech MetaHuman walkthrough with Blueprints.
Free TutorialUnreal Engine Forums tutorial - Set up local TTS with automatic lip synchronization.
Free TutorialStep-by-step integration guide for ultra-realistic AI voices with MetaHuman.
Free TutorialUE Marketplace plugin. One-click AI chatbot integration for MetaHumans. $20-50.
Optional PaidWhat we've already built at Hotbot.fun
Voice AI on Twilio. Piper TTS (lessac voice). Handles calls, responds with AI. Already running 24/7.
RunningClaude on Termux Android. Controls HeyBro for phone automation. Can send commands to any app.
RunningAndroid accessibility AI agent. Taps, types, navigates. Controlled remotely via broadcast intents.
RunningMultiple voice models ready. Amy, Lessac voices. Fast local TTS generation on server.
Running