Audio Generator Node
The Audio Generator node is a unified audio creation engine that supports three modes: voice (text-to-speech), music generation, and sound effects. It integrates directly into your Flow Studio workflows, enabling you to produce narration, soundtracks, and ambient audio alongside your visual content.
What It Does
This node generates audio content from text input. Depending on the mode, it can convert text to natural-sounding speech, compose original music tracks, or create sound effects from descriptions. Generated audio is saved to your asset library and can be connected to a Video Combiner Node for video-audio merging.
Modes
The Audio Generator operates in three distinct modes:
| Mode | Description | Models |
|---|---|---|
| Voice | Text-to-speech with system or custom voices | MiniMax Speech-02-HD |
| Music | AI-composed music from genre tags or descriptions | ACE-Step (Standard), ElevenLabs (Premium) |
| SFX | Sound effects from text descriptions | Stable Audio |
Voice Mode
Voice mode converts text into natural-sounding speech using system voices or your own cloned voice profiles.
System Voices
XainFlow includes 27 built-in voices across English (17) and Spanish (10), each with a unique style and personality. System voices are available immediately — no setup required.
Custom Voice Profiles
You can clone your own voice from an audio sample (5–30 seconds of clear speech). Custom voices are scoped to your workspace and project. A single voice profile can be used for both TTS and video lip-sync if cloned for both targets.
| Parameter | Description |
|---|---|
| Voice | Select a system voice or custom voice profile |
| Text | The text to convert to speech (up to 5,000 characters) |
| Model | Speech-02-HD (recommended) |
Voice Credits
| Model | Cost |
|---|---|
| Speech-02-HD | 10 credits per 100 characters |
| Voice Clone (TTS) | 1,500 credits (one-time) |
| Voice Clone (Video) | 7 credits (one-time) |
Music Mode
Music mode generates original tracks from genre tags or natural language descriptions.
| Parameter | Description | Range |
|---|---|---|
| Prompt | Genre tags (Standard) or description (Premium) | Free-form text |
| Duration | Length of the generated track | 5–240s (Standard), 10–300s (Premium) |
| Tier | Generation quality tier | Standard or Premium |
| Lyrics | Song lyrics with section markers | Optional (Standard only) |
| Instrumental | Generate without vocals | On/Off (Premium only) |
Music Credits
| Tier | Model | Cost |
|---|---|---|
| Standard | ACE-Step | ~0.2 credits/second (minimum 1) |
| Premium | ElevenLabs | 800 credits/minute |
Standard tier is ideal for background music and short loops. Use Premium for commercial-quality tracks where you need the highest audio fidelity.
SFX Mode
SFX mode generates sound effects from text descriptions — explosions, ambient rain, footsteps, laser blasts, and more.
| Parameter | Description | Range |
|---|---|---|
| Description | What the sound effect should be | Free-form text |
| Duration | Length of the effect | 0.5–22 seconds |
SFX Credits
| Model | Cost |
|---|---|
| Stable Audio | 1 credit per second (minimum 1) |
Usage
- Drag an Audio Generator node onto the canvas.
- Select the Mode (Voice, Music, or SFX).
- Connect a Prompt Generator to the prompt input, or type directly.
- For Voice mode, select a system voice or custom voice profile.
- Execute the node to generate your audio.
The node includes a built-in audio player with play/pause, seek, and duration display. You can navigate through your execution history to replay past generations.
Related Nodes
- Prompt Generator Node -- supply text prompts for any audio mode.
- Video Combiner Node -- merge generated audio into video clips.
- Video Generator Node -- create videos that pair with your audio.