PixVoice
Open studio
Karaoke captions · word-by-word

Free karaoke caption
generator in your browser.

Type a script. Pick a voice. Export a TikTok-ready vertical video with burned-in word-by-word captions that hit the beat. All on-device. No account. No watermark.

28 voices

American and British English plus seven more languages. Curated, character-driven, ready to ship.

Word-level alignment

Whisper-base transcribes the generated audio back to find exact word timings, so captions highlight in perfect sync with the voice.

Six caption styles

Six preset looks covering common creator aesthetics — bold pop, minimal underline, kinetic one-word, and more. Burned into the vertical export.

How it works

Text in, vertical video out.
Zero server.

Step 1 — Type your script. Paste the copy for your TikTok, Reel, or Short. Insert pause markers like [pause:500] anywhere you want a breath.

Step 2 — Pick a voice. Audition any of the 28 curated Kokoro-82M voices. Each has its own personality — narrator, friend, reporter, warm, clinical.

Step 3 — Generate and align. Kokoro synthesises the speech locally. Whisper-base then listens to the result and extracts word-level timing, so the karaoke captions land precisely on each word.

Step 4 — Style and export. Choose one of six caption styles. Export as a 9:16 vertical MP4 with captions burned in. Upload straight to TikTok, Reels, Shorts.

Everything happens in your browser. The Kokoro and Whisper models download once (~700 MB combined, cached forever) and then every future video is local, instant, and private.

Why PixVoice vs. the alternatives

The free lane nobody else occupies.

CapCut and Canva have auto-captions, but their TTS locks behind subscriptions and every export routes through their servers. Speechify and ElevenLabs do premium voices but don't ship caption alignment or vertical video export. Free alternatives usually watermark the output or cap daily generations.

PixVoice sits where those lanes don't overlap: neural-quality TTS (Kokoro-82M), burned-in karaoke captions (Whisper word alignment), vertical export, and zero gatekeeping. The trade-off is a ~700 MB first-run model download. After that, every export is free and local forever.

Frequently asked

Answers.

Is this really free? +

Yes — fully free, forever. No account, no signup, no credit limits, no watermark. The models download once into your browser cache and run locally after that.

Does my text or audio ever leave my device? +

No. Kokoro-82M (TTS) and Whisper-base (caption alignment) run entirely in your browser via WebGPU and WebAssembly. There's no upload endpoint on our side — technically, we couldn't see your script even if we wanted to.

What platforms are the exports for? +

Vertical 9:16 MP4 — ready for TikTok, Instagram Reels, and YouTube Shorts. Six caption styling presets cover the most common creator looks.

Does it need WebGPU? +

No. WebAssembly is the default backend and runs everywhere Chrome, Edge, or Firefox runs. WebGPU, when available, makes generation 2-5x faster but isn't required.

How many voices are there? +

28 curated Kokoro voices across American English, British English, and seven additional languages. Each tagged and character-driven — narrator, warm, reporter, clinical, and more.

The studio awaits.

Start generating