Compare commits
46 Commits
dev/lyuxia
...
main
| Author | SHA1 | Date |
|---|---|---|
|
|
55e3e370a0 | |
|
|
46788c7379 | |
|
|
881177287c | |
|
|
f88a14e41d | |
|
|
dac6566fc3 | |
|
|
cc91e40db8 | |
|
|
ab7f1f4a86 | |
|
|
e15222b17c | |
|
|
cfa1c115b2 | |
|
|
dd5cdb6ebf | |
|
|
2d7ef0b719 | |
|
|
ba5db602a9 | |
|
|
5b94675f62 | |
|
|
4c19646b9a | |
|
|
63a06227d1 | |
|
|
3b44913782 | |
|
|
055f64d002 | |
|
|
4d7295a9a7 | |
|
|
8524c81acd | |
|
|
a14e063ead | |
|
|
2db78e7058 | |
|
|
7538c6a73d | |
|
|
823ae2c60d | |
|
|
59cb2bf16c | |
|
|
80bebb1978 | |
|
|
bc34459bb8 | |
|
|
9f27b42cd9 | |
|
|
a7d6e2251a | |
|
|
7baefaf0f2 | |
|
|
ff0d05c380 | |
|
|
f5816b4e51 | |
|
|
8b54619760 | |
|
|
2abd42220e | |
|
|
2d6bb9bd80 | |
|
|
0b80c0746a | |
|
|
e98b828f33 | |
|
|
4d4c787be0 | |
|
|
781a49acb4 | |
|
|
9476a063b3 | |
|
|
3426ceb70f | |
|
|
089343ab0a | |
|
|
0c50894d49 | |
|
|
6816fc6a6f | |
|
|
e8bf717333 | |
|
|
fa2781405f | |
|
|
cd26dd1932 |
69
README.md
69
README.md
|
|
@ -1,12 +1,12 @@
|
|||
[](https://github.com/Akshay090/svg-banners)
|
||||

|
||||
|
||||
## 👉🏻 CosyVoice 👈🏻
|
||||
|
||||
**Fun-CosyVoice 3.0**: [Demos](https://funaudiollm.github.io/cosyvoice3/); [Paper](https://arxiv.org/abs/2505.17589); [Modelscope](https://www.modelscope.cn/studios/FunAudioLLM/Fun-CosyVoice3-0.5B); [CV3-Eval](https://github.com/FunAudioLLM/CV3-Eval)
|
||||
**Fun-CosyVoice 3.0**: [Demos](https://funaudiollm.github.io/cosyvoice3/); [Paper](https://arxiv.org/pdf/2505.17589); [Modelscope](https://www.modelscope.cn/models/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [Huggingface](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [CV3-Eval](https://github.com/FunAudioLLM/CV3-Eval)
|
||||
|
||||
**CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/abs/2412.10117); [Modelscope](https://www.modelscope.cn/studios/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/spaces/FunAudioLLM/CosyVoice2-0.5B)
|
||||
**CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/pdf/2412.10117); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B)
|
||||
|
||||
**CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/studios/iic/CosyVoice-300M)
|
||||
**CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice-300M); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice-300M)
|
||||
|
||||
## Highlight🔥
|
||||
|
||||
|
|
@ -60,23 +60,25 @@
|
|||
- [x] Fastapi server and client
|
||||
|
||||
## Evaluation
|
||||
| Model | Model Size | CER (%) ↓ (test-zh) | WER (%) ↓ (test-en) | CER (%) ↓ (test-hard) |
|
||||
|-------|------------|---------------------|---------------------|-----------------------|
|
||||
| Human | - | 1.26 | 2.14 | - |
|
||||
| Seed-TTS | - | 1.12 | 2.25 | 7.59 |
|
||||
| MiniMax-Speech | - | 0.83 | 1.65 | - |
|
||||
| F5-TTS | 0.3B | 1.52 | 2.00 | 8.67 |
|
||||
| SparkTTS | 0.5B | 1.20 | 1.98 | - |
|
||||
| CosyVoice2 | 0.5B | 1.45 | 2.57 | 6.83 |
|
||||
| FireRedTTS-2 | 1.5B | 1.14 | 1.95 | - |
|
||||
| IndexTTS2 | 1.5B | 1.01 | 1.52 | 7.12 |
|
||||
| VibeVoice | 1.5B | 1.16 | 3.04 | - |
|
||||
| HiggsAudio-v2 | 3B | 1.50 | 2.44 | - |
|
||||
| VoxPCM | 0.5B | 0.93 | 1.85 | 8.87 |
|
||||
| GLM-TTS | 1.5B | 1.03 | - | - |
|
||||
| GLM-TTS_RL | 1.5B | 0.89 | - | - |
|
||||
| Fun-CosyVoice3-0.5B-2512 | 0.5B | 1.21 | 2.24 | 6.71 |
|
||||
| Fun-CosyVoice3-0.5B-2512_RL | 0.5B | 0.81 | 1.68 | 5.44 |
|
||||
|
||||
| Model | Open-Source | Model Size | test-zh<br>CER (%) ↓ | test-zh<br>SS (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>SS (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>SS (%) ↑ |
|
||||
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
||||
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
|
||||
| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
|
||||
| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |
|
||||
| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |
|
||||
| Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |
|
||||
| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |
|
||||
| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |
|
||||
| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |
|
||||
| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |
|
||||
| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |
|
||||
| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |
|
||||
| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |
|
||||
| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |
|
||||
| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |
|
||||
| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
|
||||
| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
|
||||
|
||||
|
||||
## Install
|
||||
|
|
@ -111,7 +113,7 @@
|
|||
We strongly recommend that you download our pretrained `Fun-CosyVoice3-0.5B` `CosyVoice2-0.5B` `CosyVoice-300M` `CosyVoice-300M-SFT` `CosyVoice-300M-Instruct` model and `CosyVoice-ttsfrd` resource.
|
||||
|
||||
``` python
|
||||
# SDK模型下载
|
||||
# modelscope SDK model download
|
||||
from modelscope import snapshot_download
|
||||
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
|
||||
snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
|
||||
|
|
@ -119,6 +121,15 @@ snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-3
|
|||
snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
|
||||
snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
|
||||
snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
|
||||
|
||||
# for oversea users, huggingface SDK model download
|
||||
from huggingface_hub import snapshot_download
|
||||
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
|
||||
snapshot_download('FunAudioLLM/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
|
||||
snapshot_download('FunAudioLLM/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
|
||||
snapshot_download('FunAudioLLM/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
|
||||
snapshot_download('FunAudioLLM/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
|
||||
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
|
||||
```
|
||||
|
||||
Optionally, you can unzip `ttsfrd` resource and install `ttsfrd` package for better text normalization performance.
|
||||
|
|
@ -140,15 +151,19 @@ Follow the code in `example.py` for detailed usage of each model.
|
|||
python example.py
|
||||
```
|
||||
|
||||
#### CosyVoice2 vllm Usage
|
||||
If you want to use vllm for inference, please install `vllm==v0.9.0`. Older vllm version do not support CosyVoice2 inference.
|
||||
#### vLLM Usage
|
||||
CosyVoice2/3 now supports **vLLM 0.11.x+ (V1 engine)** and **vLLM 0.9.0 (legacy)**.
|
||||
Older vllm version(<0.9.0) do not support CosyVoice inference, and versions in between (e.g., 0.10.x) are not tested.
|
||||
|
||||
Notice that `vllm==v0.9.0` has a lot of specific requirements, for example `torch==2.7.0`. You can create a new env to in case your hardward do not support vllm and old env is corrupted.
|
||||
Notice that `vllm` has a lot of specific requirements. You can create a new env to in case your hardward do not support vllm and old env is corrupted.
|
||||
|
||||
``` sh
|
||||
conda create -n cosyvoice_vllm --clone cosyvoice
|
||||
conda activate cosyvoice_vllm
|
||||
pip install vllm==v0.9.0 transformers==4.51.3 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
|
||||
# for vllm==0.9.0
|
||||
pip install vllm==v0.9.0 transformers==4.51.3 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
|
||||
# for vllm>=0.11.0
|
||||
pip install vllm==v0.11.0 transformers==4.57.1 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
|
||||
python vllm_example.py
|
||||
```
|
||||
|
||||
|
|
@ -165,7 +180,7 @@ python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M
|
|||
|
||||
#### Advanced Usage
|
||||
|
||||
For advanced users, we have provided training and inference scripts in `examples/libritts/cosyvoice/run.sh`.
|
||||
For advanced users, we have provided training and inference scripts in `examples/libritts`.
|
||||
|
||||
#### Build for deployment
|
||||
|
||||
|
|
|
|||
|
|
@ -24,9 +24,7 @@ ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
|
|||
sys.path.append('{}/../..'.format(ROOT_DIR))
|
||||
sys.path.append('{}/../../third_party/Matcha-TTS'.format(ROOT_DIR))
|
||||
from cosyvoice.cli.cosyvoice import AutoModel
|
||||
from cosyvoice.cli.model import CosyVoiceModel, CosyVoice2Model, CosyVoice3Model
|
||||
from cosyvoice.utils.file_utils import logging
|
||||
from cosyvoice.utils.class_utils import get_model_type
|
||||
|
||||
|
||||
def get_args():
|
||||
|
|
@ -61,15 +59,7 @@ def main():
|
|||
|
||||
model = AutoModel(model_dir=args.model_dir)
|
||||
|
||||
if get_model_type(model.model) == CosyVoiceModel:
|
||||
# 1. export flow encoder
|
||||
flow_encoder = model.model.flow.encoder
|
||||
script = get_optimized_script(flow_encoder)
|
||||
script.save('{}/flow.encoder.fp32.zip'.format(args.model_dir))
|
||||
script = get_optimized_script(flow_encoder.half())
|
||||
script.save('{}/flow.encoder.fp16.zip'.format(args.model_dir))
|
||||
logging.info('successfully export flow_encoder')
|
||||
elif get_model_type(model.model) == CosyVoice2Model:
|
||||
if model.__class__.__name__ == 'CosyVoice':
|
||||
# 1. export llm text_encoder
|
||||
llm_text_encoder = model.model.llm.text_encoder
|
||||
script = get_optimized_script(llm_text_encoder)
|
||||
|
|
@ -93,6 +83,14 @@ def main():
|
|||
script = get_optimized_script(flow_encoder.half())
|
||||
script.save('{}/flow.encoder.fp16.zip'.format(args.model_dir))
|
||||
logging.info('successfully export flow_encoder')
|
||||
elif model.__class__.__name__ == 'CosyVoice2':
|
||||
# 1. export flow encoder
|
||||
flow_encoder = model.model.flow.encoder
|
||||
script = get_optimized_script(flow_encoder)
|
||||
script.save('{}/flow.encoder.fp32.zip'.format(args.model_dir))
|
||||
script = get_optimized_script(flow_encoder.half())
|
||||
script.save('{}/flow.encoder.fp16.zip'.format(args.model_dir))
|
||||
logging.info('successfully export flow_encoder')
|
||||
else:
|
||||
raise ValueError('unsupported model type')
|
||||
|
||||
|
|
|
|||
|
|
@ -89,6 +89,8 @@ class CosyVoice:
|
|||
start_time = time.time()
|
||||
|
||||
def inference_zero_shot(self, tts_text, prompt_text, prompt_wav, zero_shot_spk_id='', stream=False, speed=1.0, text_frontend=True):
|
||||
if self.__class__.__name__ == 'CosyVoice3' and '<|endofprompt|>' not in prompt_text + tts_text:
|
||||
logging.warning('<|endofprompt|> not found in CosyVoice3 inference, check your input text')
|
||||
prompt_text = self.frontend.text_normalize(prompt_text, split=False, text_frontend=text_frontend)
|
||||
for i in tqdm(self.frontend.text_normalize(tts_text, split=True, text_frontend=text_frontend)):
|
||||
if (not isinstance(i, Generator)) and len(i) < 0.5 * len(prompt_text):
|
||||
|
|
@ -114,7 +116,7 @@ class CosyVoice:
|
|||
start_time = time.time()
|
||||
|
||||
def inference_instruct(self, tts_text, spk_id, instruct_text, stream=False, speed=1.0, text_frontend=True):
|
||||
assert isinstance(self.model, CosyVoiceModel), 'inference_instruct is only implemented for CosyVoice!'
|
||||
assert self.__class__.__name__ == 'CosyVoice', 'inference_instruct is only implemented for CosyVoice!'
|
||||
instruct_text = self.frontend.text_normalize(instruct_text, split=False, text_frontend=text_frontend)
|
||||
for i in tqdm(self.frontend.text_normalize(tts_text, split=True, text_frontend=text_frontend)):
|
||||
model_input = self.frontend.frontend_instruct(i, spk_id, instruct_text)
|
||||
|
|
|
|||
|
|
@ -20,18 +20,9 @@ import numpy as np
|
|||
import whisper
|
||||
from typing import Callable
|
||||
import torchaudio.compliance.kaldi as kaldi
|
||||
import torchaudio
|
||||
import os
|
||||
import re
|
||||
import inflect
|
||||
try:
|
||||
import ttsfrd
|
||||
use_ttsfrd = True
|
||||
except ImportError:
|
||||
print("failed to import ttsfrd, use wetext instead")
|
||||
from wetext import Normalizer as ZhNormalizer
|
||||
from wetext import Normalizer as EnNormalizer
|
||||
use_ttsfrd = False
|
||||
from cosyvoice.utils.file_utils import logging, load_wav
|
||||
from cosyvoice.utils.frontend_utils import contains_chinese, replace_blank, replace_corner_mark, remove_bracket, spell_out_number, split_paragraph, is_only_punctuation
|
||||
|
||||
|
|
@ -56,21 +47,33 @@ class CosyVoiceFrontEnd:
|
|||
providers=["CUDAExecutionProvider" if torch.cuda.is_available() else
|
||||
"CPUExecutionProvider"])
|
||||
if os.path.exists(spk2info):
|
||||
self.spk2info = torch.load(spk2info, map_location=self.device)
|
||||
self.spk2info = torch.load(spk2info, map_location=self.device, weights_only=True)
|
||||
else:
|
||||
self.spk2info = {}
|
||||
self.allowed_special = allowed_special
|
||||
self.use_ttsfrd = use_ttsfrd
|
||||
if self.use_ttsfrd:
|
||||
self.inflect_parser = inflect.engine()
|
||||
# NOTE compatible when no text frontend tool is avaliable
|
||||
try:
|
||||
import ttsfrd
|
||||
self.frd = ttsfrd.TtsFrontendEngine()
|
||||
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
assert self.frd.initialize('{}/../../pretrained_models/CosyVoice-ttsfrd/resource'.format(ROOT_DIR)) is True, \
|
||||
'failed to initialize ttsfrd resource'
|
||||
self.frd.set_lang_type('pinyinvg')
|
||||
else:
|
||||
self.zh_tn_model = ZhNormalizer(remove_erhua=False)
|
||||
self.en_tn_model = EnNormalizer()
|
||||
self.inflect_parser = inflect.engine()
|
||||
self.text_frontend = 'ttsfrd'
|
||||
logging.info('use ttsfrd frontend')
|
||||
except:
|
||||
try:
|
||||
from wetext import Normalizer as ZhNormalizer
|
||||
from wetext import Normalizer as EnNormalizer
|
||||
self.zh_tn_model = ZhNormalizer(remove_erhua=False)
|
||||
self.en_tn_model = EnNormalizer()
|
||||
self.text_frontend = 'wetext'
|
||||
logging.info('use wetext frontend')
|
||||
except:
|
||||
self.text_frontend = ''
|
||||
logging.info('no frontend is avaliable')
|
||||
|
||||
|
||||
def _extract_text_token(self, text):
|
||||
if isinstance(text, Generator):
|
||||
|
|
@ -131,12 +134,13 @@ class CosyVoiceFrontEnd:
|
|||
if text_frontend is False or text == '':
|
||||
return [text] if split is True else text
|
||||
text = text.strip()
|
||||
if self.use_ttsfrd:
|
||||
if self.text_frontend == 'ttsfrd':
|
||||
texts = [i["text"] for i in json.loads(self.frd.do_voicegen_frd(text))["sentences"]]
|
||||
text = ''.join(texts)
|
||||
else:
|
||||
if contains_chinese(text):
|
||||
text = self.zh_tn_model.normalize(text)
|
||||
if self.text_frontend == 'wetext':
|
||||
text = self.zh_tn_model.normalize(text)
|
||||
text = text.replace("\n", "")
|
||||
text = replace_blank(text)
|
||||
text = replace_corner_mark(text)
|
||||
|
|
@ -147,7 +151,8 @@ class CosyVoiceFrontEnd:
|
|||
texts = list(split_paragraph(text, partial(self.tokenizer.encode, allowed_special=self.allowed_special), "zh", token_max_n=80,
|
||||
token_min_n=60, merge_len=20, comma_split=False))
|
||||
else:
|
||||
text = self.en_tn_model.normalize(text)
|
||||
if self.text_frontend == 'wetext':
|
||||
text = self.en_tn_model.normalize(text)
|
||||
text = spell_out_number(text, self.inflect_parser)
|
||||
texts = list(split_paragraph(text, partial(self.tokenizer.encode, allowed_special=self.allowed_special), "en", token_max_n=80,
|
||||
token_min_n=60, merge_len=20, comma_split=False))
|
||||
|
|
@ -178,7 +183,7 @@ class CosyVoiceFrontEnd:
|
|||
'prompt_speech_feat': speech_feat, 'prompt_speech_feat_len': speech_feat_len,
|
||||
'llm_embedding': embedding, 'flow_embedding': embedding}
|
||||
else:
|
||||
model_input = self.spk2info[zero_shot_spk_id]
|
||||
model_input = {**self.spk2info[zero_shot_spk_id]}
|
||||
model_input['text'] = tts_text_token
|
||||
model_input['text_len'] = tts_text_token_len
|
||||
return model_input
|
||||
|
|
|
|||
|
|
@ -60,14 +60,15 @@ class CosyVoiceModel:
|
|||
self.mel_overlap_dict = {}
|
||||
self.flow_cache_dict = {}
|
||||
self.hift_cache_dict = {}
|
||||
self.silent_tokens = []
|
||||
|
||||
def load(self, llm_model, flow_model, hift_model):
|
||||
self.llm.load_state_dict(torch.load(llm_model, map_location=self.device), strict=True)
|
||||
self.llm.load_state_dict(torch.load(llm_model, map_location=self.device, weights_only=True), strict=True)
|
||||
self.llm.to(self.device).eval()
|
||||
self.flow.load_state_dict(torch.load(flow_model, map_location=self.device), strict=True)
|
||||
self.flow.load_state_dict(torch.load(flow_model, map_location=self.device, weights_only=True), strict=True)
|
||||
self.flow.to(self.device).eval()
|
||||
# in case hift_model is a hifigan model
|
||||
hift_state_dict = {k.replace('generator.', ''): v for k, v in torch.load(hift_model, map_location=self.device).items()}
|
||||
hift_state_dict = {k.replace('generator.', ''): v for k, v in torch.load(hift_model, map_location=self.device, weights_only=True).items()}
|
||||
self.hift.load_state_dict(hift_state_dict, strict=True)
|
||||
self.hift.to(self.device).eval()
|
||||
|
||||
|
|
@ -98,26 +99,33 @@ class CosyVoiceModel:
|
|||
return {'min_shape': min_shape, 'opt_shape': opt_shape, 'max_shape': max_shape, 'input_names': input_names}
|
||||
|
||||
def llm_job(self, text, prompt_text, llm_prompt_speech_token, llm_embedding, uuid):
|
||||
cur_silent_token_num, max_silent_token_num = 0, 5
|
||||
with self.llm_context, torch.cuda.amp.autocast(self.fp16 is True and hasattr(self.llm, 'vllm') is False):
|
||||
if isinstance(text, Generator):
|
||||
assert isinstance(self, CosyVoice2Model) and not hasattr(self.llm, 'vllm'), 'streaming input text is only implemented for CosyVoice2 and do not support vllm!'
|
||||
for i in self.llm.inference_bistream(text=text,
|
||||
assert (self.__class__.__name__ != 'CosyVoiceModel') and not hasattr(self.llm, 'vllm'), 'streaming input text is only implemented for CosyVoice2/3 and do not support vllm!'
|
||||
token_generator = self.llm.inference_bistream(text=text,
|
||||
prompt_text=prompt_text.to(self.device),
|
||||
prompt_text_len=torch.tensor([prompt_text.shape[1]], dtype=torch.int32).to(self.device),
|
||||
prompt_speech_token=llm_prompt_speech_token.to(self.device),
|
||||
prompt_speech_token_len=torch.tensor([llm_prompt_speech_token.shape[1]], dtype=torch.int32).to(self.device),
|
||||
embedding=llm_embedding.to(self.device))
|
||||
else:
|
||||
token_generator = self.llm.inference(text=text.to(self.device),
|
||||
text_len=torch.tensor([text.shape[1]], dtype=torch.int32).to(self.device),
|
||||
prompt_text=prompt_text.to(self.device),
|
||||
prompt_text_len=torch.tensor([prompt_text.shape[1]], dtype=torch.int32).to(self.device),
|
||||
prompt_speech_token=llm_prompt_speech_token.to(self.device),
|
||||
prompt_speech_token_len=torch.tensor([llm_prompt_speech_token.shape[1]], dtype=torch.int32).to(self.device),
|
||||
embedding=llm_embedding.to(self.device)):
|
||||
self.tts_speech_token_dict[uuid].append(i)
|
||||
else:
|
||||
for i in self.llm.inference(text=text.to(self.device),
|
||||
text_len=torch.tensor([text.shape[1]], dtype=torch.int32).to(self.device),
|
||||
prompt_text=prompt_text.to(self.device),
|
||||
prompt_text_len=torch.tensor([prompt_text.shape[1]], dtype=torch.int32).to(self.device),
|
||||
prompt_speech_token=llm_prompt_speech_token.to(self.device),
|
||||
prompt_speech_token_len=torch.tensor([llm_prompt_speech_token.shape[1]], dtype=torch.int32).to(self.device),
|
||||
embedding=llm_embedding.to(self.device),
|
||||
uuid=uuid):
|
||||
self.tts_speech_token_dict[uuid].append(i)
|
||||
embedding=llm_embedding.to(self.device),
|
||||
uuid=uuid)
|
||||
for i in token_generator:
|
||||
if i in self.silent_tokens:
|
||||
cur_silent_token_num += 1
|
||||
if cur_silent_token_num > max_silent_token_num:
|
||||
continue
|
||||
else:
|
||||
cur_silent_token_num = 0
|
||||
self.tts_speech_token_dict[uuid].append(i)
|
||||
self.llm_end_dict[uuid] = True
|
||||
|
||||
def vc_job(self, source_speech_token, uuid):
|
||||
|
|
@ -260,6 +268,7 @@ class CosyVoice2Model(CosyVoiceModel):
|
|||
self.tts_speech_token_dict = {}
|
||||
self.llm_end_dict = {}
|
||||
self.hift_cache_dict = {}
|
||||
self.silent_tokens = []
|
||||
|
||||
def load_jit(self, flow_encoder_model):
|
||||
flow_encoder = torch.jit.load(flow_encoder_model, map_location=self.device)
|
||||
|
|
@ -401,6 +410,8 @@ class CosyVoice3Model(CosyVoice2Model):
|
|||
self.tts_speech_token_dict = {}
|
||||
self.llm_end_dict = {}
|
||||
self.hift_cache_dict = {}
|
||||
# FSQ silent and breath token
|
||||
self.silent_tokens = [1, 2, 28, 29, 55, 248, 494, 2241, 2242, 2322, 2323]
|
||||
|
||||
def token2wav(self, token, prompt_token, prompt_feat, embedding, token_offset, uuid, stream=False, finalize=False, speed=1.0):
|
||||
with torch.cuda.amp.autocast(self.fp16):
|
||||
|
|
|
|||
|
|
@ -145,7 +145,11 @@ def Dataset(data_list_file,
|
|||
shuffle=shuffle,
|
||||
partition=partition)
|
||||
# map partial arg to padding func
|
||||
data_pipeline[-1] = partial(data_pipeline[-1], gan=gan, dpo=dpo)
|
||||
for i in range(1, len(data_pipeline)):
|
||||
if data_pipeline[i].func.__name__ == 'compute_fbank':
|
||||
data_pipeline[i] = partial(data_pipeline[i], token_mel_ratio=0)
|
||||
if data_pipeline[i].func.__name__ == 'padding':
|
||||
data_pipeline[i] = partial(data_pipeline[i], gan=gan, dpo=dpo)
|
||||
for func in data_pipeline:
|
||||
dataset = Processor(dataset, func, mode=mode)
|
||||
return dataset
|
||||
|
|
|
|||
|
|
@ -26,7 +26,7 @@ import pyworld as pw
|
|||
AUDIO_FORMAT_SETS = {'flac', 'mp3', 'm4a', 'ogg', 'opus', 'wav', 'wma'}
|
||||
|
||||
|
||||
def parquet_opener(data, mode='train', tts_data={}):
|
||||
def parquet_opener(data, mode='train'):
|
||||
""" Give url or local file, return file descriptor
|
||||
Inplace operation.
|
||||
|
||||
|
|
@ -44,12 +44,8 @@ def parquet_opener(data, mode='train', tts_data={}):
|
|||
df = df.to_pandas()
|
||||
for i in range(len(df)):
|
||||
sample.update(dict(df.loc[i]))
|
||||
if mode == 'train':
|
||||
# NOTE do not return sample directly, must initialize a new dict
|
||||
yield {**sample}
|
||||
else:
|
||||
for index, text in enumerate(tts_data[df.loc[i, 'utt']]):
|
||||
yield {**sample, 'tts_index': index, 'tts_text': text}
|
||||
# NOTE do not return sample directly, must initialize a new dict
|
||||
yield {**sample}
|
||||
except Exception as ex:
|
||||
logging.warning('Failed to open {}, ex info {}'.format(url, ex))
|
||||
|
||||
|
|
|
|||
|
|
@ -332,8 +332,9 @@ class CausalMaskedDiffWithDiT(torch.nn.Module):
|
|||
token = self.input_embedding(torch.clamp(token, min=0)) * mask
|
||||
|
||||
# text encode
|
||||
h, h_lengths = self.encoder(token, token_len, streaming=streaming)
|
||||
h = self.encoder_proj(h)
|
||||
h = self.pre_lookahead_layer(token)
|
||||
h = h.repeat_interleave(self.token_mel_ratio, dim=1)
|
||||
mask = mask.repeat_interleave(self.token_mel_ratio, dim=1).squeeze(dim=-1)
|
||||
|
||||
# get conditions
|
||||
conds = torch.zeros(feat.shape, device=token.device)
|
||||
|
|
@ -344,7 +345,6 @@ class CausalMaskedDiffWithDiT(torch.nn.Module):
|
|||
conds[i, :index] = feat[i, :index]
|
||||
conds = conds.transpose(1, 2)
|
||||
|
||||
mask = (~make_pad_mask(h_lengths.sum(dim=-1).squeeze(dim=1))).to(h)
|
||||
loss, _ = self.decoder.compute_loss(
|
||||
feat.transpose(1, 2).contiguous(),
|
||||
mask.unsqueeze(1),
|
||||
|
|
|
|||
|
|
@ -301,18 +301,23 @@ class Qwen2LM(TransformerLM):
|
|||
self.stop_token_ids = [speech_token_size + i for i in range(3)]
|
||||
self.vllm_output_queue = {}
|
||||
|
||||
def prepare_lm_input_target(self, sos_emb, text_token, text_token_emb, text_token_len, task_id_emb, speech_token, speech_token_emb, speech_token_len):
|
||||
def prepare_lm_input_target(self, sos_emb, text_token, text_token_emb, text_token_len, task_id_emb, speech_token, speech_token_emb, speech_token_len, instruct_token=None, instruct_token_emb=None, instruct_token_len=None):
|
||||
lm_target, lm_input = [], []
|
||||
text_token = unpad_sequence(text_token, text_token_len.cpu(), batch_first=True)
|
||||
speech_token = unpad_sequence(speech_token, speech_token_len.cpu(), batch_first=True)
|
||||
text_token_emb = unpad_sequence(text_token_emb, text_token_len.cpu(), batch_first=True)
|
||||
speech_token_emb = unpad_sequence(speech_token_emb, speech_token_len.cpu(), batch_first=True)
|
||||
# NOTE add instruct_token in CosyVoice3
|
||||
if instruct_token is not None and instruct_token_emb is not None and instruct_token_len is not None:
|
||||
instruct_token = unpad_sequence(instruct_token, instruct_token_len.cpu(), batch_first=True)
|
||||
instruct_token_emb = unpad_sequence(instruct_token_emb, instruct_token_len.cpu(), batch_first=True)
|
||||
for i in range(len(text_token)):
|
||||
# bistream sequence
|
||||
if random.random() < 0.5 and speech_token_len[i] / text_token_len[i] > self.mix_ratio[1] / self.mix_ratio[0]:
|
||||
this_lm_target, this_lm_input = [], []
|
||||
this_lm_target.append(IGNORE_ID)
|
||||
this_lm_input.append(sos_emb.squeeze(dim=0))
|
||||
this_lm_target, this_lm_input = [IGNORE_ID], [sos_emb.squeeze(dim=0)]
|
||||
if instruct_token is not None and instruct_token_emb is not None and instruct_token_len is not None:
|
||||
this_lm_target += [IGNORE_ID] * instruct_token_len[i]
|
||||
this_lm_input.append(instruct_token_emb[i])
|
||||
for j in range(((text_token_len[i] + 1) / self.mix_ratio[0]).ceil().int().item()):
|
||||
this_text_token = text_token[i][j * self.mix_ratio[0]: (j + 1) * self.mix_ratio[0]].tolist()
|
||||
this_speech_token = speech_token[i][j * self.mix_ratio[1]: (j + 1) * self.mix_ratio[1]].tolist()
|
||||
|
|
@ -333,8 +338,8 @@ class Qwen2LM(TransformerLM):
|
|||
this_lm_target, this_lm_input = torch.tensor(this_lm_target), torch.concat(this_lm_input, dim=0)
|
||||
# unistream sequence
|
||||
else:
|
||||
this_lm_target = torch.tensor([IGNORE_ID] * (1 + text_token_len[i]) + speech_token[i].tolist() + [self.eos_token])
|
||||
this_lm_input = torch.concat([sos_emb.squeeze(dim=0), text_token_emb[i], task_id_emb.squeeze(dim=0), speech_token_emb[i]], dim=0)
|
||||
this_lm_target = torch.tensor([IGNORE_ID] * (1 + instruct_token_len[i] + text_token_len[i]) + speech_token[i].tolist() + [self.eos_token])
|
||||
this_lm_input = torch.concat([sos_emb.squeeze(dim=0), instruct_token_emb[i], text_token_emb[i], task_id_emb.squeeze(dim=0), speech_token_emb[i]], dim=0)
|
||||
lm_target.append(this_lm_target)
|
||||
lm_input.append(this_lm_input)
|
||||
lm_input_len = torch.tensor([i.size(0) for i in lm_input], dtype=torch.int32)
|
||||
|
|
@ -681,6 +686,7 @@ class CosyVoice3LM(Qwen2LM):
|
|||
|
||||
# 1. encode text_token
|
||||
text_token_emb = self.llm.model.model.embed_tokens(text_token)
|
||||
instruct_token_emb = self.llm.model.model.embed_tokens(instruct_token)
|
||||
|
||||
# 3. sos and task_id
|
||||
sos_emb = self.speech_embedding.weight[self.sos].reshape(1, 1, -1)
|
||||
|
|
@ -691,14 +697,14 @@ class CosyVoice3LM(Qwen2LM):
|
|||
|
||||
# 3. prepare llm_input/target
|
||||
lm_target, lm_input, lm_input_len = self.prepare_lm_input_target(sos_emb, text_token, text_token_emb, text_token_len, task_id_emb,
|
||||
speech_token, speech_token_emb, speech_token_len)
|
||||
speech_token, speech_token_emb, speech_token_len, instruct_token, instruct_token_emb, instruct_token_len)
|
||||
lm_target = lm_target.to(device)
|
||||
|
||||
# 4. run lm forward
|
||||
lm_output, lm_output_mask = self.llm(lm_input, lm_input_len.to(device))
|
||||
logits = self.llm_decoder(lm_output)
|
||||
loss = self.criterion_ce(logits, lm_target.to(device))
|
||||
acc = th_accuracy(logits.view(-1, self.speech_token_size + 3), lm_target, ignore_label=IGNORE_ID)
|
||||
acc = th_accuracy(logits.view(-1, self.speech_token_size + 200), lm_target, ignore_label=IGNORE_ID)
|
||||
return {'loss': loss, 'acc': acc}
|
||||
|
||||
@torch.inference_mode()
|
||||
|
|
|
|||
|
|
@ -53,7 +53,7 @@ def init_distributed(args):
|
|||
def init_dataset_and_dataloader(args, configs, gan, dpo):
|
||||
data_pipeline = configs['data_pipeline_gan'] if gan is True else configs['data_pipeline']
|
||||
train_dataset = Dataset(args.train_data, data_pipeline=data_pipeline, mode='train', gan=gan, dpo=dpo, shuffle=True, partition=True)
|
||||
cv_dataset = Dataset(args.cv_data, data_pipeline=data_pipeline, mode='train', gan=gan, dpo=dpo, shuffle=False, partition=False)
|
||||
cv_dataset = Dataset(args.cv_data, data_pipeline=data_pipeline, mode='dev', gan=gan, dpo=dpo, shuffle=False, partition=False)
|
||||
|
||||
# do not use persistent_workers=True, as whisper tokenizer opens tiktoken file each time when the for loop starts
|
||||
train_data_loader = DataLoader(train_dataset,
|
||||
|
|
@ -164,18 +164,18 @@ def init_optimizer_and_scheduler(args, configs, model, gan):
|
|||
raise ValueError("unknown scheduler: " + configs['train_conf'])
|
||||
|
||||
if configs['train_conf']['optim_d'] == 'adam':
|
||||
optimizer_d = optim.Adam(model.module.discriminator.parameters(), **configs['train_conf']['optim_conf'])
|
||||
optimizer_d = optim.Adam(model.module.discriminator.parameters(), **configs['train_conf']['optim_conf_d'])
|
||||
elif configs['train_conf']['optim_d'] == 'adamw':
|
||||
optimizer_d = optim.AdamW(model.module.discriminator.parameters(), **configs['train_conf']['optim_conf'])
|
||||
optimizer_d = optim.AdamW(model.module.discriminator.parameters(), **configs['train_conf']['optim_conf_d'])
|
||||
else:
|
||||
raise ValueError("unknown optimizer: " + configs['train_conf'])
|
||||
|
||||
if configs['train_conf']['scheduler_d'] == 'warmuplr':
|
||||
scheduler_type = WarmupLR
|
||||
scheduler_d = WarmupLR(optimizer_d, **configs['train_conf']['scheduler_conf'])
|
||||
scheduler_d = WarmupLR(optimizer_d, **configs['train_conf']['scheduler_d'])
|
||||
elif configs['train_conf']['scheduler_d'] == 'NoamHoldAnnealing':
|
||||
scheduler_type = NoamHoldAnnealing
|
||||
scheduler_d = NoamHoldAnnealing(optimizer_d, **configs['train_conf']['scheduler_conf'])
|
||||
scheduler_d = NoamHoldAnnealing(optimizer_d, **configs['train_conf']['scheduler_d'])
|
||||
elif configs['train_conf']['scheduler'] == 'constantlr':
|
||||
scheduler_type = ConstantLR
|
||||
scheduler_d = ConstantLR(optimizer_d)
|
||||
|
|
|
|||
|
|
@ -23,6 +23,15 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Inference-only Qwen2 model compatible with HuggingFace weights."""
|
||||
from typing import Optional
|
||||
from packaging.version import parse as vparse
|
||||
import vllm
|
||||
|
||||
# vLLM-0.11.0+ only support V1 engine
|
||||
VLLM_V1_ENGINE_ONLY: bool = vparse(vllm.__version__) >= vparse("0.11.0")
|
||||
if VLLM_V1_ENGINE_ONLY:
|
||||
from vllm.v1.sample.metadata import SamplingMetadata
|
||||
|
||||
from vllm.model_executor.models.qwen2 import *
|
||||
|
||||
|
||||
|
|
@ -87,10 +96,14 @@ class CosyVoice2ForCausalLM(nn.Module, SupportsLoRA, SupportsPP):
|
|||
def compute_logits(
|
||||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
sampling_metadata: SamplingMetadata,
|
||||
sampling_metadata: Optional[SamplingMetadata] = None,
|
||||
) -> Optional[torch.Tensor]:
|
||||
logits = self.logits_processor(self.lm_head, hidden_states,
|
||||
sampling_metadata, self.lm_head.bias)
|
||||
if VLLM_V1_ENGINE_ONLY:
|
||||
logits = self.logits_processor(self.lm_head, hidden_states,
|
||||
self.lm_head.bias)
|
||||
else:
|
||||
logits = self.logits_processor(self.lm_head, hidden_states,
|
||||
sampling_metadata, self.lm_head.bias)
|
||||
return logits
|
||||
|
||||
def load_weights(self, weights: Iterable[tuple[str,
|
||||
|
|
|
|||
|
|
@ -4,7 +4,7 @@ ARG VENV_NAME="cosyvoice"
|
|||
ENV VENV=$VENV_NAME
|
||||
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
|
||||
|
||||
ENV DEBIAN_FRONTEN=noninteractive
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
SHELL ["/bin/bash", "--login", "-c"]
|
||||
|
||||
|
|
|
|||
|
|
@ -18,7 +18,7 @@ def cosyvoice_example():
|
|||
# zero_shot usage
|
||||
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', './asset/zero_shot_prompt.wav')):
|
||||
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
|
||||
# cross_lingual usage, <|zh|><|en|><|jp|><|yue|><|ko|> for Chinese/English/Japanese/Cantonese/Korean
|
||||
# cross_lingual usage, <|zh|><|en|><|ja|><|yue|><|ko|> for Chinese/English/Japanese/Cantonese/Korean
|
||||
for i, j in enumerate(cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.',
|
||||
'./asset/cross_lingual_prompt.wav')):
|
||||
torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
|
||||
|
|
|
|||
|
|
@ -40,11 +40,10 @@ def main():
|
|||
with open('{}/spk2utt'.format(args.des_dir), 'w') as f:
|
||||
for k, v in spk2utt.items():
|
||||
f.write('{} {}\n'.format(k, ' '.join(v)))
|
||||
if args.instruct is True:
|
||||
if args.instruct != '':
|
||||
with open('{}/instruct'.format(args.des_dir), 'w') as f:
|
||||
for k, v in utt2text.items():
|
||||
# NOTE in CosyVoice3, we add instruct in sequence
|
||||
f.write('{} You are a helpful assistant.<|endofprompt|>\n'.format(k, v))
|
||||
f.write('{} {}\n'.format(k, args.instruct))
|
||||
return
|
||||
|
||||
|
||||
|
|
@ -55,8 +54,6 @@ if __name__ == "__main__":
|
|||
parser.add_argument('--des_dir',
|
||||
type=str)
|
||||
parser.add_argument('--instruct',
|
||||
action='store_true',
|
||||
default=False,
|
||||
help='create instruct file or not')
|
||||
type=str)
|
||||
args = parser.parse_args()
|
||||
main()
|
||||
|
|
|
|||
|
|
@ -66,7 +66,6 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
|
|||
fi
|
||||
cat data/{train-clean-100,train-clean-360,train-other-500}/parquet/data.list > data/train.data.list
|
||||
cat data/{dev-clean,dev-other}/parquet/data.list > data/dev.data.list
|
||||
# NOTE will update llm/hift training later
|
||||
for model in llm flow hifigan; do
|
||||
torchrun --nnodes=1 --nproc_per_node=$num_gpus \
|
||||
--rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
|
||||
|
|
|
|||
|
|
@ -20,7 +20,7 @@ num_decoding_left_chunks: -1 # streaming inference flow decoder left chunk size,
|
|||
# model params
|
||||
# for all class/function included in this repo, we use !<name> or !<new> for intialization, so that user may find all corresponding class/function according to one single yaml.
|
||||
# for system/third_party class/function, we do not require this.
|
||||
llm: !new:cosyvoice.llm.llm.Qwen2LM
|
||||
llm: !new:cosyvoice.llm.llm.CosyVoice3LM
|
||||
llm_input_size: !ref <llm_input_size>
|
||||
llm_output_size: !ref <llm_output_size>
|
||||
speech_token_size: 6561
|
||||
|
|
@ -35,8 +35,8 @@ llm: !new:cosyvoice.llm.llm.Qwen2LM
|
|||
win_size: 10
|
||||
tau_r: 0.1
|
||||
|
||||
flow: !new:cosyvoice.flow.flow.CausalMaskedDiffWithXvec
|
||||
input_size: 512
|
||||
flow: !new:cosyvoice.flow.flow.CausalMaskedDiffWithDiT
|
||||
input_size: 80
|
||||
output_size: 80
|
||||
spk_embed_dim: !ref <spk_embed_dim>
|
||||
output_type: 'mel'
|
||||
|
|
@ -45,22 +45,10 @@ flow: !new:cosyvoice.flow.flow.CausalMaskedDiffWithXvec
|
|||
only_mask_loss: True
|
||||
token_mel_ratio: !ref <token_mel_ratio>
|
||||
pre_lookahead_len: 3
|
||||
encoder: !new:cosyvoice.transformer.upsample_encoder.UpsampleConformerEncoder
|
||||
output_size: 512
|
||||
attention_heads: 8
|
||||
linear_units: 2048
|
||||
num_blocks: 6
|
||||
dropout_rate: 0.1
|
||||
positional_dropout_rate: 0.1
|
||||
attention_dropout_rate: 0.1
|
||||
normalize_before: True
|
||||
input_layer: 'linear'
|
||||
pos_enc_layer_type: 'rel_pos_espnet'
|
||||
selfattention_layer_type: 'rel_selfattn'
|
||||
input_size: 512
|
||||
use_cnn_module: False
|
||||
macaron_style: False
|
||||
static_chunk_size: !ref <chunk_size>
|
||||
pre_lookahead_layer: !new:cosyvoice.transformer.upsample_encoder.PreLookaheadLayer
|
||||
in_channels: 80
|
||||
channels: 1024
|
||||
pre_lookahead_len: 3
|
||||
decoder: !new:cosyvoice.flow.flow_matching.CausalConditionalCFM
|
||||
in_channels: 240
|
||||
n_spks: 1
|
||||
|
|
@ -73,20 +61,20 @@ flow: !new:cosyvoice.flow.flow.CausalMaskedDiffWithXvec
|
|||
training_cfg_rate: 0.2
|
||||
inference_cfg_rate: 0.7
|
||||
reg_loss_type: 'l1'
|
||||
estimator: !new:cosyvoice.flow.decoder.CausalConditionalDecoder
|
||||
in_channels: 320
|
||||
estimator: !new:cosyvoice.flow.DiT.dit.DiT
|
||||
dim: 1024
|
||||
depth: 22
|
||||
heads: 16
|
||||
dim_head: 64
|
||||
ff_mult: 2
|
||||
mel_dim: 80
|
||||
mu_dim: 80
|
||||
spk_dim: 80
|
||||
out_channels: 80
|
||||
channels: [256]
|
||||
dropout: 0.0
|
||||
attention_head_dim: 64
|
||||
n_blocks: 4
|
||||
num_mid_blocks: 12
|
||||
num_heads: 8
|
||||
act_fn: 'gelu'
|
||||
static_chunk_size: !ref <chunk_size> * <token_mel_ratio>
|
||||
num_decoding_left_chunks: !ref <num_decoding_left_chunks>
|
||||
|
||||
hift: !new:cosyvoice.hifigan.generator.HiFTGenerator
|
||||
hift: !new:cosyvoice.hifigan.generator.CausalHiFTGenerator
|
||||
in_channels: 80
|
||||
base_channels: 512
|
||||
nb_harmonics: 8
|
||||
|
|
@ -105,7 +93,8 @@ hift: !new:cosyvoice.hifigan.generator.HiFTGenerator
|
|||
source_resblock_dilation_sizes: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
|
||||
lrelu_slope: 0.1
|
||||
audio_limit: 0.99
|
||||
f0_predictor: !new:cosyvoice.hifigan.f0_predictor.ConvRNNF0Predictor
|
||||
conv_pre_look_right: 4
|
||||
f0_predictor: !new:cosyvoice.hifigan.f0_predictor.CausalConvRNNF0Predictor
|
||||
num_class: 1
|
||||
in_channels: 80
|
||||
cond_channels: 512
|
||||
|
|
@ -134,6 +123,7 @@ parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener
|
|||
get_tokenizer: !name:cosyvoice.tokenizer.tokenizer.get_qwen_tokenizer
|
||||
token_path: !ref <qwen_pretrain_path>
|
||||
skip_special_tokens: True
|
||||
version: cosyvoice3
|
||||
allowed_special: 'all'
|
||||
tokenize: !name:cosyvoice.dataset.processor.tokenize
|
||||
get_tokenizer: !ref <get_tokenizer>
|
||||
|
|
@ -146,7 +136,7 @@ filter: !name:cosyvoice.dataset.processor.filter
|
|||
resample: !name:cosyvoice.dataset.processor.resample
|
||||
resample_rate: !ref <sample_rate>
|
||||
truncate: !name:cosyvoice.dataset.processor.truncate
|
||||
truncate_length: 24480 # must be a multiplier of hop_size
|
||||
truncate_length: 24960 # must be a multiplier of hop_size and token_mel_ratio
|
||||
feat_extractor: !name:matcha.utils.audio.mel_spectrogram
|
||||
n_fft: 1920
|
||||
num_mels: 80
|
||||
|
|
@ -154,7 +144,7 @@ feat_extractor: !name:matcha.utils.audio.mel_spectrogram
|
|||
hop_size: 480
|
||||
win_size: 1920
|
||||
fmin: 0
|
||||
fmax: 8000
|
||||
fmax: null
|
||||
center: False
|
||||
compute_fbank: !name:cosyvoice.dataset.processor.compute_fbank
|
||||
feat_extractor: !ref <feat_extractor>
|
||||
|
|
@ -231,4 +221,4 @@ train_conf_gan:
|
|||
grad_clip: 5
|
||||
accum_grad: 1 # in gan training, accum_grad must be 1
|
||||
log_interval: 100
|
||||
save_per_step: -1
|
||||
save_per_step: -1
|
||||
|
|
|
|||
|
|
@ -7,7 +7,7 @@ stop_stage=3
|
|||
|
||||
data_url=www.openslr.org/resources/60
|
||||
data_dir=/mnt/lyuxiang.lx/data/tts/openslr/libritts
|
||||
pretrained_model_dir=../../../pretrained_models/CosyVoice3-0.5B
|
||||
pretrained_model_dir=../../../pretrained_models/Fun-CosyVoice3-0.5B
|
||||
|
||||
if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
|
||||
echo "Data Download"
|
||||
|
|
@ -20,7 +20,8 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
|||
echo "Data preparation, prepare wav.scp/text/utt2spk/spk2utt"
|
||||
for x in train-clean-100 train-clean-360 train-other-500 dev-clean dev-other test-clean test-other; do
|
||||
mkdir -p data/$x
|
||||
python local/prepare_data.py --src_dir $data_dir/LibriTTS/$x --des_dir data/$x --instruct
|
||||
# NOTE in CosyVoice3, we add instruct in sequence
|
||||
python local/prepare_data.py --src_dir $data_dir/LibriTTS/$x --des_dir data/$x --instruct "You are a helpful assistant.<|endofprompt|>"
|
||||
done
|
||||
fi
|
||||
|
||||
|
|
@ -67,7 +68,6 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
|
|||
fi
|
||||
cat data/{train-clean-100,train-clean-360,train-other-500}/parquet/data.list > data/train.data.list
|
||||
cat data/{dev-clean,dev-other}/parquet/data.list > data/dev.data.list
|
||||
# NOTE will update llm/hift training later
|
||||
for model in llm flow hifigan; do
|
||||
torchrun --nnodes=1 --nproc_per_node=$num_gpus \
|
||||
--rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
|
||||
|
|
|
|||
|
|
@ -17,6 +17,7 @@ lightning==2.2.4
|
|||
matplotlib==3.7.5
|
||||
modelscope==1.20.0
|
||||
networkx==3.1
|
||||
numpy==1.26.4
|
||||
omegaconf==2.3.0
|
||||
onnx==1.16.0
|
||||
onnxruntime-gpu==1.18.0; sys_platform == 'linux'
|
||||
|
|
@ -35,6 +36,7 @@ tensorrt-cu12-libs==10.13.3.9; sys_platform == 'linux'
|
|||
torch==2.3.1
|
||||
torchaudio==2.3.1
|
||||
transformers==4.51.3
|
||||
x-transformers==2.11.24
|
||||
uvicorn==0.30.0
|
||||
wetext==0.0.4
|
||||
wget==3.2
|
||||
|
|
|
|||
|
|
@ -24,7 +24,7 @@ import numpy as np
|
|||
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
sys.path.append('{}/../../..'.format(ROOT_DIR))
|
||||
sys.path.append('{}/../../../third_party/Matcha-TTS'.format(ROOT_DIR))
|
||||
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
|
||||
from cosyvoice.cli.cosyvoice import AutoModel
|
||||
from cosyvoice.utils.file_utils import load_wav
|
||||
|
||||
app = FastAPI()
|
||||
|
|
@ -88,14 +88,8 @@ if __name__ == '__main__':
|
|||
default=50000)
|
||||
parser.add_argument('--model_dir',
|
||||
type=str,
|
||||
default='iic/CosyVoice-300M',
|
||||
default='iic/CosyVoice2-0.5B',
|
||||
help='local path or modelscope repo id')
|
||||
args = parser.parse_args()
|
||||
try:
|
||||
cosyvoice = CosyVoice(args.model_dir)
|
||||
except Exception:
|
||||
try:
|
||||
cosyvoice = CosyVoice2(args.model_dir)
|
||||
except Exception:
|
||||
raise TypeError('no valid model_type!')
|
||||
cosyvoice = AutoModel(model_dir=args.model_dir)
|
||||
uvicorn.run(app, host="0.0.0.0", port=args.port)
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@ import numpy as np
|
|||
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
sys.path.append('{}/../../..'.format(ROOT_DIR))
|
||||
sys.path.append('{}/../../../third_party/Matcha-TTS'.format(ROOT_DIR))
|
||||
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
|
||||
from cosyvoice.cli.cosyvoice import AutoModel
|
||||
|
||||
logging.basicConfig(level=logging.DEBUG,
|
||||
format='%(asctime)s %(levelname)s %(message)s')
|
||||
|
|
@ -33,13 +33,7 @@ logging.basicConfig(level=logging.DEBUG,
|
|||
|
||||
class CosyVoiceServiceImpl(cosyvoice_pb2_grpc.CosyVoiceServicer):
|
||||
def __init__(self, args):
|
||||
try:
|
||||
self.cosyvoice = CosyVoice(args.model_dir, trt_concurrent=args.max_conc)
|
||||
except Exception:
|
||||
try:
|
||||
self.cosyvoice = CosyVoice2(args.model_dir, trt_concurrent=args.max_conc)
|
||||
except Exception:
|
||||
raise TypeError('no valid model_type!')
|
||||
self.cosyvoice = AutoModel(model_dir=args.model_dir)
|
||||
logging.info('grpc service initialized')
|
||||
|
||||
def Inference(self, request, context):
|
||||
|
|
@ -90,7 +84,7 @@ if __name__ == '__main__':
|
|||
default=4)
|
||||
parser.add_argument('--model_dir',
|
||||
type=str,
|
||||
default='iic/CosyVoice-300M',
|
||||
default='iic/CosyVoice2-0.5B',
|
||||
help='local path or modelscope repo id')
|
||||
args = parser.parse_args()
|
||||
main()
|
||||
|
|
|
|||
|
|
@ -28,6 +28,7 @@ import json
|
|||
import os
|
||||
import threading
|
||||
import time
|
||||
from uuid import uuid4
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
|
|
@ -364,6 +365,7 @@ class TritonPythonModel:
|
|||
# Generate semantic tokens with LLM
|
||||
generated_ids_iter = self.forward_llm(input_ids)
|
||||
|
||||
token2wav_request_id = request_id or str(uuid4())
|
||||
if self.decoupled:
|
||||
response_sender = request.get_response_sender()
|
||||
|
||||
|
|
@ -392,7 +394,7 @@ class TritonPythonModel:
|
|||
this_tts_speech_token = torch.tensor(this_tts_speech_token).unsqueeze(dim=0).to(torch.int32).to(self.device)
|
||||
|
||||
sub_tts_speech = self.forward_token2wav(
|
||||
this_tts_speech_token, request_id, prompt_speech_tokens,
|
||||
this_tts_speech_token, token2wav_request_id, prompt_speech_tokens,
|
||||
prompt_speech_feat, prompt_spk_embedding, token_offset, False
|
||||
)
|
||||
|
||||
|
|
@ -427,7 +429,7 @@ class TritonPythonModel:
|
|||
time.sleep(0.02)
|
||||
|
||||
this_tts_speech_token = torch.tensor(semantic_token_ids_arr).unsqueeze(dim=0).to(torch.int32).to(self.device)
|
||||
sub_tts_speech = self.forward_token2wav(this_tts_speech_token, request_id, prompt_speech_tokens, prompt_speech_feat, prompt_spk_embedding, token_offset, True)
|
||||
sub_tts_speech = self.forward_token2wav(this_tts_speech_token, token2wav_request_id, prompt_speech_tokens, prompt_speech_feat, prompt_spk_embedding, token_offset, True)
|
||||
audio_tensor = pb_utils.Tensor.from_dlpack("waveform", to_dlpack(sub_tts_speech))
|
||||
inference_response = pb_utils.InferenceResponse(output_tensors=[audio_tensor])
|
||||
response_sender.send(inference_response)
|
||||
|
|
@ -441,7 +443,7 @@ class TritonPythonModel:
|
|||
if generated_ids is None or len(generated_ids) == 0:
|
||||
raise pb_utils.TritonModelException("Generated IDs is None or empty")
|
||||
|
||||
audio = self.forward_token2wav(generated_ids, request_id, prompt_speech_tokens, prompt_speech_feat, prompt_spk_embedding)
|
||||
audio = self.forward_token2wav(generated_ids, token2wav_request_id, prompt_speech_tokens, prompt_speech_feat, prompt_spk_embedding)
|
||||
|
||||
# Prepare response
|
||||
audio_tensor = pb_utils.Tensor.from_dlpack("waveform", to_dlpack(audio))
|
||||
|
|
|
|||
|
|
@ -86,7 +86,7 @@ if __name__ == "__main__":
|
|||
help='Use Direct Preference Optimization')
|
||||
args = parser.parse_args()
|
||||
|
||||
utt2wav, utt2text, utt2spk = {}, {}, {}
|
||||
utt2wav, utt2text, utt2spk, utt2instruct = {}, {}, {}, {}
|
||||
with open('{}/wav.scp'.format(args.src_dir)) as f:
|
||||
for l in f:
|
||||
l = l.replace('\n', '').split()
|
||||
|
|
|
|||
8
webui.py
8
webui.py
|
|
@ -57,9 +57,6 @@ def generate_audio(tts_text, mode_checkbox_group, sft_dropdown, prompt_text, pro
|
|||
prompt_wav = None
|
||||
# if instruct mode, please make sure that model is iic/CosyVoice-300M-Instruct and not cross_lingual mode
|
||||
if mode_checkbox_group in ['自然语言控制']:
|
||||
if cosyvoice.instruct is False:
|
||||
gr.Warning('您正在使用自然语言控制模式, {}模型不支持此模式, 请使用iic/CosyVoice-300M-Instruct模型'.format(args.model_dir))
|
||||
yield (cosyvoice.sample_rate, default_data)
|
||||
if instruct_text == '':
|
||||
gr.Warning('您正在使用自然语言控制模式, 请输入instruct文本')
|
||||
yield (cosyvoice.sample_rate, default_data)
|
||||
|
|
@ -67,9 +64,6 @@ def generate_audio(tts_text, mode_checkbox_group, sft_dropdown, prompt_text, pro
|
|||
gr.Info('您正在使用自然语言控制模式, prompt音频/prompt文本会被忽略')
|
||||
# if cross_lingual mode, please make sure that model is iic/CosyVoice-300M and tts_text prompt_text are different language
|
||||
if mode_checkbox_group in ['跨语种复刻']:
|
||||
if cosyvoice.instruct is True:
|
||||
gr.Warning('您正在使用跨语种复刻模式, {}模型不支持此模式, 请使用iic/CosyVoice-300M模型'.format(args.model_dir))
|
||||
yield (cosyvoice.sample_rate, default_data)
|
||||
if instruct_text != '':
|
||||
gr.Info('您正在使用跨语种复刻模式, instruct文本会被忽略')
|
||||
if prompt_wav is None:
|
||||
|
|
@ -167,7 +161,7 @@ if __name__ == '__main__':
|
|||
default=8000)
|
||||
parser.add_argument('--model_dir',
|
||||
type=str,
|
||||
default='pretrained_models/CosyVoice3-0.5B',
|
||||
default='pretrained_models/CosyVoice2-0.5B',
|
||||
help='local path or modelscope repo id')
|
||||
args = parser.parse_args()
|
||||
cosyvoice = AutoModel(model_dir=args.model_dir)
|
||||
|
|
|
|||
Loading…
Reference in New Issue