TTS 情感语音调优指南 (2025-2026)

最后更新: 2026年4月30日 | 状态: 生产就绪 覆盖范围: edge-tts、MiniMax TTS、CosyVoice 2.5、Fish Speech v2、温柔少女音调优、多角色管线

1. edge-tts 参数调优

edge-tts 利用 Azure Speech Service 的神经 TTS。通过 CLI 标志或嵌入 SSML 实现精细控制。

核心参数

参数	格式	推荐范围	用途
`rate`	`+x%` / `-x%`	`-30%` 到 `+40%`	节奏控制。`-10%` 用于平静叙述
`pitch`	`+xHz` / `+x%`	`-100Hz` 到 `+100Hz` 或 `-20%` 到 `+30%`	性别/年龄偏移。`+30Hz` 用于女性提升
`volume`	`+xdB` / `-xdB` 或 `%`	`-10dB` 到 `+10dB`	动态压缩准备。`-3dB` 防止削波

CLI 示例

edge-tts \
  --voice zh-CN-XiaoxiaoNeural \
  --rate="-5%" \
  --pitch="+40Hz" \
  --volume="-2dB" \
  --file output.wav \
  --text "今天天气真不错，我们去散步吧。"

2. MiniMax TTS: 克隆与情感控制

MiniMax 的 speech-02-turbo（2025+）和 speech-02-hd 模型在零样本克隆和细粒度情感引导方面表现出色。

声音克隆

通过 POST /v1/voice_clone 上传 5-30s 参考音频
返回 voice_id。在 TTS 请求中使用 voice_setting: {"voice_id": "..."}
专业技巧: 在参考音频首尾填充 0.5s 静音。使用 clean_audio: true 去除伪影

情感与风格参数

参数	类型	描述
`emotion`	`string`	枚举: `happy`, `sad`, `angry`, `fear`, `surprise`, `calm`, `affectionate`
`emotion_strength`	`float`	`0.0`-`1.0`。`0.6` 为自然表达最佳值
`speed`	`float`	`0.5`-`2.0`。`1.0` 为基线
`vol`	`float`	`0.0`-`2.0`。线性音量缩放
`pitch`	`int`	`-12` 到 `+12` 半音
5`-`2.0`。`1.0` 为基线
`vol`	`float`	`0.0`-`2.0`。线性音量缩放
`pitch`	`int`	`-12` 到 `+12` 半音
`"voice_setting": { "voice_id": "voice_zh_girl_01", "text_prompt": "用温柔、略带害羞的少女音色，语速轻柔，尾音微微上扬。" }`

3. CosyVoice 2 / 2.5 技术

CosyVoice 2+ 结合大型语言模型（LLM）前端与 Flow Matching 声学后端，实现指令跟随 TTS。

架构亮点

零样本: 3s 提示音频克隆音色与韵律
指令模式: 传递自然语言 instruct_text 控制情感、速度或说话风格
跨语言: 原生支持 ZH/EN/JP/KR 无需重新调优

关键推理参数（Python SDK）

from cosyvoice.cli.cosyvoice import CosyVoice

model = CosyVoice('pretrained_models/CosyVoice2-0.5B')
output = model.inference(
    text="今晚的月色真美。",
    instruct_text="请用温柔少女音，带着怀念的语气，语速偏慢地朗读",
    prompt_speech=reference_audio,  # 3-10s WAV
    stream=False,
    speed=0.9,
    seed=42
)

4. Fish Speech v2

Fish Speech（v1.5 → v2.0）使用 VQ-GAN 音频 token 器和自回归 LLM。针对低 VRAM 部署和精确韵律复制优化。

参数配置

参数	范围	效果
`reference_text`	`string`	提示音频的基准转录。必须与转录完全匹配
`max_new_tokens`	`100`-`300`	控制最大音频时长。`~250` ≈ 10s
`top_p`	`0.7`-`0.95`	越高 = 更多表现力/变化，越低 = 更稳定
`temperature`	`0.6`-`0.9`	`0.75` 是自然度最佳点
`speed`	`0.5`-`2.0`	生成后时间拉伸

v2 专业技巧

Fish Speech v2 引入 prompt_mode="exact" vs "semantic"。使用 "exact" 精确克隆参考节奏。使用 "semantic" 让模型将参考音色适应目标脚本的自然节奏。

5. "温柔少女音" 调优

实现一致的"温柔少女"音色需要平衡声学参数、提示词工程和 SSML。

声学基线

参数	目标值	原因
基础音高	`+20Hz` 到 `+40Hz`	模拟更高声道共振
语速	`-5%` 到 `-10%`	平静节奏暗示体贴与温柔
音量	`-3dB` 到 `-5dB`	柔和表达减少感知攻击性
气息感	高	通过提示词实现: `"气声较多，发音轻柔"`
到 `-10%`	平静节奏暗示体贴与温柔
音量	`-3dB` 到 `-5dB`	柔和表达减少感知攻击性
气息感	高	通过提示词实现: `"气声较多，发音轻柔"`
`音色：18-22岁少女，声线清澈偏细语气：温柔、关怀、略带气声节奏：语速中等偏慢，句尾自然拖长，无突兀重音情感：温暖、亲切、自然流露`

edge-tts SSML 示例

<speak version="1.0" xml:lang="zh-CN">
  <voice name="zh-CN-XiaoxiaoNeural">
    <mstts:express-as style="affectionate" styledegree="1.2">
      <prosody rate="-8%" pitch="+10%" volume="-3dB">
        你最近过得好吗？<break time="400ms"/>要注意休息哦。
      </prosody>
    </mstts:express-as>
  </voice>
</speak>

6. 多角色语音管线

管线架构

脚本解析: 正则提取 [角色名]: 台词
说话人映射: JSON 配置映射别名 → voice_id / prompt_audio
分块: 按句子分割 + 200ms 前瞻用于交叉淡入淡出
异步 TTS 路由器: 并发限制 4-8（避免速率限制）
缝合与后处理:
所有边界应用 15ms 恒定功率交叉淡入淡出
最终 loudnorm 到 EBU R128

7. 语音质量与后处理优化

阶段	技术	工具/命令
升采样	16/22kHz → 24kHz	`torchaudio.transforms.Resample`, `soxr`
降噪	AI 降噪/去混响	`ffmpeg -af "afftdn=nf=-25"`
响度	EBU R128 标准	`ffmpeg -af loudnorm=I=-16:TP=-1.5:LRA=11`
动态范围	软压缩	`ffmpeg -af "acompressor=threshold=-20dB:ratio=4"`
编码	归档 vs 分发	FLAC（无损）、Opus `192kbps`（网络）

8. 【新增】2026 最新进阶技术

8. 【新增】2026 最新进阶技术### 8.1 edge-tts 可用声音列表与选型

中文推荐声音: | 声音名称 | 性别 | 风格 | 适用场景 | |---------|------|------|---------| | zh-CN-XiaoxiaoNeural | 女 | 温柔/亲切/活泼 | 通用，最佳少女音候选 | | zh-CN-XiaoyiNeural | 女 | 可爱/甜美/年轻 | 萝莉音、动漫角色 | | zh-CN-YunxiNeural | 男 | 年轻/阳光 | 少年角色 | | zh-CN-YunjianNeural | 男 | 成熟/沉稳 | 叙述者、旁白 | | zh-CN-XiaochenNeural | 女 | 儿童 | 儿童内容 | | zh-CN-XiaohanNeural | 女 | 沉稳/知性 | 新闻播报、教学 |

多语言声音:

# 列出所有可用声音
edge-tts --list-voices

# 日文少女音
edge-tts --voice ja-JP-NanamiNeural --text "こんにちは" --file jp.wav

# 英文女声
edge-tts --voice en-US-AriaNeural --text "Hello there" --file en.wav

んにちは" --file jp.wav

英文女声

edge-tts --voice en-US-AriaNeural --text "Hello there" --file en.wav ```### 8.2 SSML 高级标记完整参考

8.2 SSML 高级标记完整参考```xml

<!-- 情感表达 -->
<mstts:express-as style="affectionate" styledegree="1.2">
  亲爱的...
</mstts:express-as>

<!-- 情感类型枚举 -->
<!-- advertisement_upbeat, affectionate, angry, assistant, calm, 
     chat, cheerful, customerservice, depressed, disgruntled, 
     embarrassed, envious, empathetic, fearful, gentle, 
     lyrical, narration-professional, narration-relaxed, 
     newscast, newscast-casual, sad, serious, shouting, 
     sports_commentary, sports_commentary_excited, 
     terrified, unfriendly, whispering -->

<!-- 精确停顿 -->
今天<break time="500ms"/>我们去公园<break time="300ms"/>散步吧。

<!-- 语速控制 -->
<prosody rate="-10%">慢慢说</prosody>
<prosody rate="+20%">快速说</prosody>

<!-- 音高控制 -->
<prosody pitch="+20%">高音</prosody>
<prosody pitch="-10%">低音</prosody>

<!-- 音量控制 -->
<prosody volume="-6dB">轻声</prosody>
<prosody volume="+3dB">大声</prosody>

<!-- 强调 -->
<emphasis level="strong">非常重要!</emphasis>
<emphasis level="moderate">比较重要</emphasis>
<emphasis level="reduced">不太重要</emphasis>

<!-- 发音控制 -->
<say-as interpret-as="date">2026-04-30</say-as>
<say-as interpret-as="time">14:30</say-as>
<say-as interpret-as="characters">ABC</say-as>

<!-- 组合使用 -->
<prosody rate="-5%" pitch="+15%">
  <mstts:express-as style="gentle" styledegree="1.5">

你今天看起来心情不错呢。

<break time="200ms"/>看起来心情不错呢。
      </mstts:express-as>
    </prosody>

  </voice>
</speak>
```### 8.3 CosyVoice 2.5 指令跟随完整指南

**指令模板**:
```python
# 温柔少女音
instruct_text = "请用温柔少女的声音，带着关怀的语气，语速偏慢，尾音微微上扬，像在跟好朋友聊天"

# 知性成熟女声
instruct_text = "请用成熟知性女性的声音，语气平稳自信，语速中等，像在讲述一个故事"

# 活泼可爱萝莉音
instruct_text = "请用可爱萝莉的声音，语气活泼欢快，语速稍快，带着笑意，像小女孩在撒娇"

# 沉稳男声旁白
instruct_text = "请用沉稳男性声音朗读，语气庄重，语速偏慢，像在朗读纪录片旁白"

# 惊讶/激动
instruct_text = "请用惊讶的语气说，语速加快，音高提升，带着不可思议的感觉"

CosyVoice 2.5 新增标记语法:

# 在 instruct_text 中使用控制标记
instruct = (
    "用温柔少女音朗读，"
    "语气轻柔[停顿0.5秒]"
    "然后带着笑意说："
    "[语速-10%]你最近过得好吗？"
    "[停顿0.3秒]"
    "[语气+关怀]要注意休息哦。"
)

8.4 Fish Speech v2 零样本克隆最佳实践

参考音频要求: - 时长: 3-30 秒（最佳 5-15 秒） - 采样率: 22050Hz 或 24000Hz - 格式: WAV（无压缩） - 内容: 自然说话，避免唱歌、朗读、戏剧腔 - 质量: 无背景噪音、无混响、清晰咬字

克隆流程:

from fish_speech.inference import TTSInference

model = TTSInference(
    config_path="fish_speech/configs/speech.json",
    checkpoint_path="checkpoints/fish-speech-v2.0.pth"
)

# 零样本推理
audio = model.inference(
    text="你好，我是你的语音助手",
    prompt_text="你好呀，今天天气真不错",  # 参考音频的文本
    prompt_audio="reference.wav",
    max_new_tokens=250,
    top_p=0.8,
    temperature=0.75,
    repetition_penalty=1.2,
    speed=1.0
)

new_tokens=250, top_p=0.8, temperature=0.75, repetition_penalty=1.2, speed=1.0 ) ```### 8.5 多角色音频完整生产管线

#!/usr/bin/env python3
"""
多角色 TTS 生产管线
支持: edge-tts / MiniMax / CosyVoice 混合路由
"""
import asyncio
import re
import json
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class DialogueLine:
    character: str
    text: str
    emotion: str = "neutral"

@dataclass
class CharacterConfig:
    name: str
    tts_engine: str  # "edge", "minimax", "cosyvoice"
    voice_id: str
    rate: str = "0%"
    pitch: str = "0Hz"
    emotion: str = "calm"
    emotion_strength: float = 0.6
 rate: str = "0%"
    pitch: str = "0Hz"
    emotion: str = "calm"
    emotion_strength: float = 0.6class MultiRoleTTSPipeline:
    def __init__(self, config_path: str):
        with open(config_path) as f:
            self.characters = {
                c["name"]: CharacterConfig(**c)
                for c in json.load(f)
            }

    def parse_script(self, script_text: str) -> List[DialogueLine]:
        """解析剧本格式: [角色名]: 台词"""
        pattern = r'\[([^\]]+)\]:\s*(.+)'
        lines = []
        for match in re.finditer(pattern, script_text):
            lines.append(DialogueLine(
                character=match.group(1),
                text=match.group(2).strip()
            ))
        return lines

    async def generate_segment(self, line: DialogueLine) -> str:
        """根据角色配置选择合适的 TTS 引擎生成音频"""
        char = self.characters[line.character]
        output_file = f"/tmp/tts_{line.character}_{hash(line.text) % 10000}.wav"

        if char.tts_engine == "edge":
            cmd = [
                "edge-tts",
                "--voice", char.voice_id,
                f"--rate={char.rate}",
                f"--pitch={char.pitch}",
                "--file", output_file,
                "--text", line.text
            ]
            proc = await asyncio.create_subprocess_exec(*cmd)
            await proc.wait()

        elif char.tts_engine == "minimax":
            # MiniMax API 调用
            import httpx
            async with httpx.AsyncClient() as client:
                resp = await client.post(
                    "https://api.minima
lient() as client:
                resp = await client.post(
                    "https://api.minimalient() as client:
                resp = await client.post(
                    "https://api.minimax.chat/v1/text_to_speech",
                    headers={"Authorization": f"Bearer {self.api_key}"},
                    json={
                        "model": "speech-02-turbo",
                        "text": line.text,
                        "voice_setting": {
                            "voice_id": char.voice_id,
                            "speed": float(char.rate.replace("%", "")) / 100 + 1.0,
                            "pitch": int(char.pitch.replace("Hz", "")),
                            "vol": 1.0,
                            "emotion": char.emotion,
                            "emotion_strength": char.emotion_strength,
                        }
                    }
                )
                data = resp.json()
                # 下载音频
                audio_url = data["data"]["audio"]
                async with httpx.AsyncClient() as dl:
                    audio_data = await dl.get(audio_url)
                    with open(output_file, "wb") as f:
                        f.write(audio_data.content)

        return output_file

    async def generate_full(self, script: str) -> str:
        """生成完整多角色音频"""
        lines = self.parse_script(script)

        # 顺序生成保证顺序正确
        segments = []
        for line in lines:
            seg_path = await self.generate_segment(line)
            segments.append(seg_path)

        # 合成最终音频

t self.generate_segment(line)
            segments.append(seg_path)

        # 合成最终音频
      t self.generate_segment(line)
            segments.append(seg_path)

        # 合成最终音频
        final_path = "/tmp/final_multirole.wav"
        file_list = "/tmp/filelist.txt"

        with open(file_list, "w") as f:
            for seg in segments:
                f.write(f"file '{seg}'\n")

        proc = await asyncio.create_subprocess_exec(
            "ffmpeg", "-y", "-f", "concat", "-safe", "0",
            "-i", file_list, "-c", "copy",
            "-af", "loudnorm=I=-16:TP=-1.5:LRA=11",
            final_path
        )
        await proc.wait()
        return final_path
P=-1.5:LRA=11",
            final_path
        )
        await proc.wait()
        return final_path# 角色配置文件示例 (characters.json)
"""
[
  {
    "name": "小晴",
    "tts_engine": "edge",
    "voice_id": "zh-CN-XiaoxiaoNeural",
    "rate": "-8%",
    "pitch": "+35Hz",
    "emotion": "affectionate",
    "emotion_strength": 0.7
  },
  {
    "name": "旁白",
    "tts_engine": "edge",
    "voice_id": "zh-CN-YunjianNeural",
    "rate": "-5%",
    "pitch": "0Hz",
    "emotion": "calm",
    "emotion_strength": 0.5
  },
  {
    "name": "小月",
    "tts_engine": "minimax",
    "voice_id": "voice_cloned_001",
    "rate": "-10%",
    "pitch": "+25Hz",
    "emotion": "happy",
    "emotion_strength": 0.8
  }
]
"""

8.6 情感语音微调参数表

情感	rate	pitch	volume	style (edge-tts)	emotion (MiniMax)
温柔/关怀	`-8%`	`+25Hz`	`-3dB`	`affectionate`	`affectionate`
开心/活泼	`+5%`	`+30Hz`	`0dB`	`cheerful`	`happy`
悲伤/低落	`-15%`	`-10Hz`	`-6dB`	`sad`	`sad`
生气/愤怒	`+10%`	`+15Hz`	`+3dB`	`angry`	`angry`
恐惧/害怕	`+5%`	`+20Hz`	`-3dB`	`fearful`	`fear`
惊讶/震惊	`+15%`	`+40Hz`	`+2dB`	`surprised`	`surprise`
耳语/密语	`-20%`	`-5Hz`	`-8dB`	`whispering`	`calm`
平静/叙述	`-5%`	`0Hz`	`0dB`	`calm`	`calm`

8.7 音频后处理优化脚本

#!/bin/bash
# TTS 音频后处理管线

INPUT="input.wav"
OUTPUT="output_final.wav"

# 1. 降噪
ffmpeg -i "$INPUT" -af "afftdn=nf=-25" temp_denoised.wav
put.wav"
OUTPUT="output_final.wav"

# 1. 降噪
ffmpeg -i "$INPUT" -af "afftdn=nf=-25" temp_denoised.wav# 2. 标准化到 -16 LUFS (EBU R128)
ffmpeg -i temp_denoised.wav -af "loudnorm=I=-16:TP=-1.5:LRA=11" temp_loudnorm.wav

# 3. 软压缩（让声音更温暖）
ffmpeg -i temp_loudnorm.wav \
  -af "acompressor=threshold=-24dB:ratio=3:attack=5:release=50:makeup=2dB" \
  temp_compressed.wav

# 4. 微调 EQ（让人声更温暖）
ffmpeg -i temp_compressed.wav \
  -af "equalizer=f=200:t=q:w=2:g=2, \
        equalizer=f=3000:t=q:w=1:g=1, \
        highpass=f=80" \
  "$OUTPUT"

# 清理临时文件
rm -f temp_denoised.wav temp_loudnorm.wav temp_compressed.wav

echo "Output: $OUTPUT"
ffprobe -v quiet -show_entries format=duration,size -of default=noprint_wrappers=1 "$OUTPUT"

文档更新日期: 2026年4月30日 | 来源: Azure Speech 文档、MiniMax API 文档、CosyVoice GitHub、Fish Speech 论文