Type a script. Pick a voice. Export a TikTok-ready vertical video with burned-in word-by-word captions that hit the beat. All on-device. No account. No watermark.
American and British English plus seven more languages. Curated, character-driven, ready to ship.
Whisper-base transcribes the generated audio back to find exact word timings, so captions highlight in perfect sync with the voice.
Six preset looks covering common creator aesthetics — bold pop, minimal underline, kinetic one-word, and more. Burned into the vertical export.
Step 1 — Type your script. Paste the copy for your TikTok, Reel, or Short. Insert pause markers like [pause:500] anywhere you want a breath.
Step 2 — Pick a voice. Audition any of the 28 curated Kokoro-82M voices. Each has its own personality — narrator, friend, reporter, warm, clinical.
Step 3 — Generate and align. Kokoro synthesises the speech locally. Whisper-base then listens to the result and extracts word-level timing, so the karaoke captions land precisely on each word.
Step 4 — Style and export. Choose one of six caption styles. Export as a 9:16 vertical MP4 with captions burned in. Upload straight to TikTok, Reels, Shorts.
Everything happens in your browser. The Kokoro and Whisper models download once (~700 MB combined, cached forever) and then every future video is local, instant, and private.
CapCut and Canva have auto-captions, but their TTS locks behind subscriptions and every export routes through their servers. Speechify and ElevenLabs do premium voices but don't ship caption alignment or vertical video export. Free alternatives usually watermark the output or cap daily generations.
PixVoice sits where those lanes don't overlap: neural-quality TTS (Kokoro-82M), burned-in karaoke captions (Whisper word alignment), vertical export, and zero gatekeeping. The trade-off is a ~700 MB first-run model download. After that, every export is free and local forever.
Yes — fully free, forever. No account, no signup, no credit limits, no watermark. The models download once into your browser cache and run locally after that.
No. Kokoro-82M (TTS) and Whisper-base (caption alignment) run entirely in your browser via WebGPU and WebAssembly. There's no upload endpoint on our side — technically, we couldn't see your script even if we wanted to.
Vertical 9:16 MP4 — ready for TikTok, Instagram Reels, and YouTube Shorts. Six caption styling presets cover the most common creator looks.
No. WebAssembly is the default backend and runs everywhere Chrome, Edge, or Firefox runs. WebGPU, when available, makes generation 2-5x faster but isn't required.
28 curated Kokoro voices across American English, British English, and seven additional languages. Each tagged and character-driven — narrator, warm, reporter, clinical, and more.