Skip to main content

Audio Generator Node

The Audio Generator node is a unified audio creation engine that supports three modes: voice (text-to-speech), music generation, and sound effects. It integrates directly into your Flow Studio workflows, enabling you to produce narration, soundtracks, and ambient audio alongside your visual content.

What It Does

This node generates audio content from text input. Depending on the mode, it can convert text to natural-sounding speech, compose original music tracks, or create sound effects from descriptions. Generated audio is saved to your asset library and can be connected to a Video Combiner Node for video-audio merging.

Modes

The Audio Generator operates in three distinct modes:

ModeDescriptionModels
VoiceText-to-speech with system or custom voicesMiniMax Speech-02-HD
MusicAI-composed music from genre tags or descriptionsACE-Step (Standard), ElevenLabs (Premium)
SFXSound effects from text descriptionsStable Audio

Voice Mode

Voice mode converts text into natural-sounding speech using system voices or your own cloned voice profiles.

System Voices

XainFlow includes 27 built-in voices across English (17) and Spanish (10), each with a unique style and personality. System voices are available immediately — no setup required.

Custom Voice Profiles

You can clone your own voice from an audio sample (5–30 seconds of clear speech). Custom voices are scoped to your workspace and project. A single voice profile can be used for both TTS and video lip-sync if cloned for both targets.

ParameterDescription
VoiceSelect a system voice or custom voice profile
TextThe text to convert to speech (up to 5,000 characters)
ModelSpeech-02-HD (recommended)

Voice Credits

ModelCost
Speech-02-HD10 credits per 100 characters
Voice Clone (TTS)1,500 credits (one-time)
Voice Clone (Video)7 credits (one-time)

Music Mode

Music mode generates original tracks from genre tags or natural language descriptions.

ParameterDescriptionRange
PromptGenre tags (Standard) or description (Premium)Free-form text
DurationLength of the generated track5–240s (Standard), 10–300s (Premium)
TierGeneration quality tierStandard or Premium
LyricsSong lyrics with section markersOptional (Standard only)
InstrumentalGenerate without vocalsOn/Off (Premium only)

Music Credits

TierModelCost
StandardACE-Step~0.2 credits/second (minimum 1)
PremiumElevenLabs800 credits/minute
tip

Standard tier is ideal for background music and short loops. Use Premium for commercial-quality tracks where you need the highest audio fidelity.

SFX Mode

SFX mode generates sound effects from text descriptions — explosions, ambient rain, footsteps, laser blasts, and more.

ParameterDescriptionRange
DescriptionWhat the sound effect should beFree-form text
DurationLength of the effect0.5–22 seconds

SFX Credits

ModelCost
Stable Audio1 credit per second (minimum 1)

Usage

  1. Drag an Audio Generator node onto the canvas.
  2. Select the Mode (Voice, Music, or SFX).
  3. Connect a Prompt Generator to the prompt input, or type directly.
  4. For Voice mode, select a system voice or custom voice profile.
  5. Execute the node to generate your audio.

The node includes a built-in audio player with play/pause, seek, and duration display. You can navigate through your execution history to replay past generations.