SenseVoice

2025-11-20 3 minute read

SenseVoice 是具有音频理解能力的音频基础模型，包括语音识别（ASR）、语种识别（LID）、语音情感识别（SER）和声学事件分类（AEC）或声学事件检测（AED）。

SenseVoice

核心功能 🎯

SenseVoice 专注于高精度多语言语音识别、情感辨识和音频事件检测

多语言识别： 采用超过 40 万小时数据训练，支持超过 50 种语言，识别效果上优于 Whisper 模型。
富文本识别：
- 具备优秀的情感识别，能够在测试数据上达到和超过目前最佳情感识别模型的效果。
- 支持声音事件检测能力，支持音乐、掌声、笑声、哭声、咳嗽、喷嚏等多种常见人机交互事件进行检测。
高效推理： SenseVoice-Small 模型采用非自回归端到端框架，推理延迟极低，10s 音频推理仅耗时 70ms，15 倍优于 Whisper-Large。
微调定制： 具备便捷的微调脚本与策略，方便用户根据业务场景修复长尾样本问题。
服务部署： 具有完整的服务部署链路，支持多并发请求，支持客户端语言有，python、c++、html、java 与 c# 等。

架构图

语音识别（ASR）
语言识别（LID）
语音情感识别（SER）
音频事件检测（AED，比如笑声、掌声、背景音乐、咳嗽等）
逆文本归一化（ITN）

安装

克隆代码库

git clone https://github.com/FunAudioLLM/SenseVoice
cd SenseVoice

创建并激活 Conda 环境

# ❌ conda create -n funasr python=3.13.0 -y
conda create -n funasr python=3.10.9 -y
conda activate funasr

Python 版本	PyTorch 支持
3.10	支持
3.11	支持
3.12	刚开始支持（部分）
3.13	❌ 目前不支持

安装依赖

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install gradio "fastapi[standard]" funasr_onnx \
    -i https://pypi.tuna.tsinghua.edu.cn/simple

运行

运行 API 服务

# macOS Silicon
export SENSEVOICE_DEVICE=mps
fastapi run --port 50000

运行 WebUI

python webui.py

You are using the latest version of funasr-1.2.7
Downloading Model from https://www.modelscope.cn to directory: /Users/junjian/.cache/modelscope/hub/models/iic/SenseVoiceSmall
Loading remote code successfully: model
Downloading Model from https://www.modelscope.cn to directory: /Users/junjian/.cache/modelscope/hub/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch
* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.
audio_fs: 44100
language: zh, merge_vad: True
rtf_avg: 0.009: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.79it/s]
rtf_avg: 0.125: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.45it/s]
rtf_avg: 0.125, time_speech:  5.520, time_escape: 0.692: 100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.44it/s]
[{'key': 'rand_key_W4Acq9GFz6Y1t', 'text': '<|zh|><|NEUTRAL|><|Speech|><|withitn|>嗨你好，今天天气如何？'}]
嗨你好，今天天气如何？

推理

引擎	设备	用时 (s)	音频时长 (s)
funasr 🚀	mps	0.47	72
	cpu	1.7	72
directly	mps	0.58	72
	cpu	1.82	72
onnx	cpu	4	72
onnx(quantize)	cpu	2.5	72

ℹ 安装 onnxruntime-silicon 没有成功使用，所以没有测试 mps。

funasr 推理

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "iic/SenseVoiceSmall"

model = AutoModel(
    model=model_dir,
    trust_remote_code=True,
    remote_code="./model.py",  
    device="mps",
)

# ----------- 测试 E2E 推理时间 -----------

wav = "wav/all2.wav"
# 计算音频时长
import soundfile as sf
audio, sr = sf.read(wav)
audio_dur = len(audio) / sr

import time
t0 = time.perf_counter()

res = model.generate(
    input=wav,
    cache={},
    language="auto",  # "zh", "en", "yue", "ja", "ko", "nospeech"
    use_itn=True,
    batch_size_s=60,
)

t1 = time.perf_counter()
e2e_time = t1 - t0
rtf = e2e_time / audio_dur

print("----- ASR Benchmark -----")
print(f"Audio duration: {audio_dur:.3f} s")
print(f"🚀🚀🚀 E2E inference time: {e2e_time:.3f} s")
print(f"RTF: {rtf:.4f}")

text = rich_transcription_postprocess(res[0]["text"])
print(text)

直接推理

from model import SenseVoiceSmall
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "iic/SenseVoiceSmall"
m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="mps")
m.eval()

# ----------- 测试 E2E 推理时间 -----------

wav = "wav/all2.wav"
# 计算音频时长
import soundfile as sf
audio, sr = sf.read(wav)
audio_dur = len(audio) / sr

import time
t0 = time.perf_counter()

res = m.inference(
    data_in=wav,
    language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
    use_itn=False,
    ban_emo_unk=False,
    **kwargs,
)

t1 = time.perf_counter()
e2e_time = t1 - t0
rtf = e2e_time / audio_dur

print("----- ASR Benchmark -----")
print(f"Audio duration: {audio_dur:.3f} s")
print(f"🚀🚀🚀 E2E inference time: {e2e_time:.3f} s")
print(f"RTF: {rtf:.4f}")

text = rich_transcription_postprocess(res [0][0]["text"])
print(text)

ONNX 推理

from pathlib import Path
from funasr_onnx import SenseVoiceSmall
from funasr_onnx.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "iic/SenseVoiceSmall"
model = SenseVoiceSmall(model_dir, batch_size=10, quantize=True, device="cpu")

# ----------- 测试 E2E 推理时间 -----------

wav = "wav/all2.wav"
# 计算音频时长
import soundfile as sf
audio, sr = sf.read(wav)
audio_dur = len(audio) / sr

import time
t0 = time.perf_counter()

res = model(wav, language="auto", use_itn=True)

t1 = time.perf_counter()
e2e_time = t1 - t0
rtf = e2e_time / audio_dur

print("----- ASR Benchmark -----")
print(f"Audio duration: {audio_dur:.3f} s")
print(f"🚀🚀🚀 E2E inference time: {e2e_time:.3f} s")
print(f"RTF: {rtf:.4f}")

print([rich_transcription_postprocess(i) for i in res])

Libtorch 推理

❌ 使用 pip install funasr_torch 安装后，并没有真正安装成功，可以考虑使用源码方式安装。

from pathlib import Path
from funasr_torch import SenseVoiceSmall
from funasr_torch.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "iic/SenseVoiceSmall"
model = SenseVoiceSmall(model_dir, batch_size=10, device="mps")

# ----------- 测试 E2E 推理时间 -----------

wav = "wav/all2.wav"
# 计算音频时长
import soundfile as sf
audio, sr = sf.read(wav)
audio_dur = len(audio) / sr

import time
t0 = time.perf_counter()

res = model(wav, language="auto", use_itn=True)

t1 = time.perf_counter()
e2e_time = t1 - t0
rtf = e2e_time / audio_dur

print("----- ASR Benchmark -----")
print(f"Audio duration: {audio_dur:.3f} s")
print(f"🚀🚀🚀 E2E inference time: {e2e_time:.3f} s")
print(f"RTF: {rtf:.4f}")

print([rich_transcription_postprocess(i) for i in res])

混合 VAD 推理（离线）

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "/Users/junjian/.cache/modelscope/hub/models/iic/SenseVoiceSmall"
vad_model_dir = "/Users/junjian/.cache/modelscope/hub/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch"

model = AutoModel(
    model=model_dir,
    trust_remote_code=True,
    remote_code="./model.py",
    vad_model=vad_model_dir,
    vad_kwargs={"max_single_segment_time": 30000},
    device="mps",
    disable_update=True
)

res = model.generate(
    input=f"wav/all2.wav",
    cache={},
    language="auto",  # "zh", "en", "yue", "ja", "ko", "nospeech"
    use_itn=True,
    batch_size_s=60,
    merge_vad=True,
    merge_length_s=15,
)

text = rich_transcription_postprocess(res[0]["text"])
print(text)

参考资料

SenseVoice-Small