Audio Generator Node

The Audio Generator node is a unified audio creation engine that supports three modes: voice (text-to-speech), music generation, and sound effects. It integrates directly into your Flow Studio workflows, enabling you to produce narration, soundtracks, and ambient audio alongside your visual content.

What It Does

This node generates audio content from text input. Depending on the mode, it can convert text to natural-sounding speech, compose original music tracks, or create sound effects from descriptions. Generated audio is saved to your asset library and can be connected to a Video Combiner Node for video-audio merging.

Modes

The Audio Generator operates in three distinct modes:

Mode	Description	Models
Voice	Text-to-speech with system or custom voices	MiniMax Speech-02-HD
Music	AI-composed music from genre tags or descriptions	ACE-Step (Standard), ElevenLabs (Premium)
SFX	Sound effects from text descriptions	Stable Audio

Voice Mode

Voice mode converts text into natural-sounding speech using system voices or your own cloned voice profiles.

System Voices

XainFlow includes 27 built-in voices across English (17) and Spanish (10), each with a unique style and personality. System voices are available immediately — no setup required.

Custom Voice Profiles

You can clone your own voice from an audio sample (5–30 seconds of clear speech). Custom voices are scoped to your workspace and project. A single voice profile can be used for both TTS and video lip-sync if cloned for both targets.

Parameter	Description
Voice	Select a system voice or custom voice profile
Text	The text to convert to speech (up to 5,000 characters)
Model	`Speech-02-HD` (recommended)

Voice Credits

Model	Cost
Speech-02-HD	10 credits per 100 characters
Voice Clone (TTS)	1,500 credits (one-time)
Voice Clone (Video)	7 credits (one-time)

Music Mode

Music mode generates original tracks from genre tags or natural language descriptions.

Parameter	Description	Range
Prompt	Genre tags (Standard) or description (Premium)	Free-form text
Duration	Length of the generated track	5–240s (Standard), 10–300s (Premium)
Tier	Generation quality tier	`Standard` or `Premium`
Lyrics	Song lyrics with section markers	Optional (Standard only)
Instrumental	Generate without vocals	On/Off (Premium only)

Music Credits

Tier	Model	Cost
Standard	ACE-Step	~0.2 credits/second (minimum 1)
Premium	ElevenLabs	800 credits/minute

tip

Standard tier is ideal for background music and short loops. Use Premium for commercial-quality tracks where you need the highest audio fidelity.

SFX Mode

SFX mode generates sound effects from text descriptions — explosions, ambient rain, footsteps, laser blasts, and more.

Parameter	Description	Range
Description	What the sound effect should be	Free-form text
Duration	Length of the effect	0.5–22 seconds

SFX Credits

Model	Cost
Stable Audio	1 credit per second (minimum 1)

Usage

Drag an Audio Generator node onto the canvas.
Select the Mode (Voice, Music, or SFX).
Connect a Prompt Generator to the prompt input, or type directly.
For Voice mode, select a system voice or custom voice profile.
Execute the node to generate your audio.

The node includes a built-in audio player with play/pause, seek, and duration display. You can navigate through your execution history to replay past generations.

Prompt Generator Node -- supply text prompts for any audio mode.
Video Combiner Node -- merge generated audio into video clips.
Video Generator Node -- create videos that pair with your audio.

What It Does​

Modes​

Voice Mode​

System Voices​

Custom Voice Profiles​

Voice Credits​

Music Mode​

Music Credits​

SFX Mode​

SFX Credits​

Usage​

Related Nodes​