【Faster-Whisper】离线识别本地视频并生成字幕
- 1 前言
- 2 工具说明
- 2.1 ffmpeg 媒体转换器
- 2.1.1 理论
- 简介
- 文档
- 2.1.2 安装
- win安装
- python安装
- 2.1.3 查看
- 查看音视频文件格式、编码
- 2.1.4 视频处理
- 视频格式转换
- 设置 视频码率
- 裁剪视频
- 2.1.5 音频处理
- 视频提取音频
- 音频格式转换
- gpu加速
- 2.2 faster-whisper 语音识别模型
- 2.2.1 理论
- 参考文档
- 各个模型对比
- 2.2.2 安装
- win安装
- python 安装
- 模型下载
- 2.2.3 whisper-faster 参数
- 所有参数
- 所有参数翻译版
- 性能优化
- 字幕长度
- 2.2.4 举例
- small
- large-v2
- large-v2操作整个文件夹
- 看进度
- 2.2.5 报错
- CUDA Out of Memory :CUDA
- 3 实战演示
- 3.1 纯win端演示
- 全流程步骤
- 测试机环境说明
- 使用ffmpeg将视频提取出音频
- faster-whisper生成字幕
- 3.2 补充
- 用python批量提取音频
- 4 总结
1 前言
平常学习时看的本地离线好的视频,但是视频一般没有字幕,偶然看到了PotPlayer 的 生成有声字幕 功能,正好使用了faster-whisper模型,所以打算单独拿来用一用
Faster-Whisper
语音识别模型,能够将音频转换为文本此时正好需要还需要一个视频提取音频的工具:
ffmpeg
所以,就需要先使用 ffmpeg
把视频提取出音频,再把音频交给 Faster-Whisper
换为字幕
工具安装,参考 2 工具说明 【我使用的是win端,安装只要安装win端的,如果配合python使用,可以下载python版本】
生成视频字幕的演示,参考 3 实战演示
全流程步骤:
- 安装ffmpeg
- 下载faster-whisper
- 下载faster-whisper 的模型
- 使用ffmpeg将视频提取出音频
- 使用faster-whisper,指定模型,进行语音识别,生成字幕
2 工具说明
2.1 ffmpeg 媒体转换器
2.1.1 理论
简介
ffmpeg
是一个通用的媒体转换器。
它可以读取各种各样的 输入 - 包括实时抓取/录制设备 - 过滤和转码 转换为多种输出格式。
FFmpeg 是一个跨平台的开源多媒体框架,用于录制、转换、流处理音频和视频。它支持几乎所有主流的音视频格式(包括编解码、封装格式),并提供了丰富的滤镜、特效和处理功能,被广泛应用于视频编辑、流媒体服务、格式转换、音视频分析等领域。
文档
官方
ffmpeg 文档
第三方
完整的 FFmpeg 命令使用教程_ffmpeg使用教程-CSDN博客
FFmpeg教程(超级详细版) - 个人文章 - SegmentFault 思否
2.1.2 安装
win安装
-
官网下载 Download FFmpeg
-
-
直达下载页面 构建 - CODEX FFMPEG @ gyan.dev
-
-
下载后解压
-
添加环境变量
- 图形界面方式:慢慢添加
- 命令行方式:
- CMD方式,要用管理员
- setx PATH “%PATH%;D:\ffmpeg-2025-05-12-git-full_build\bin” /M
setx
:命令行工具,用于设置环境变量。PATH
:要设置的环境变量名称。%PATH%;D:\ffmpeg-2025-05-12-git-full_build\bin
:新的环境变量值。%PATH%
表示当前系统的 PATH 环境变量的值,D:\ffmpeg-2025-05-12-git-full_build\bin
是您要添加的目录路径。
/M
:表示修改系统环境变量(对所有用户生效)。不加就只想修改当前用户的环境变量
- PowerShell
- 命令复杂,不记了,cmd简单。不要用PowerShell执行cmd那条,会把环境变量覆盖了!
- CMD是CMD,PowerShell是PowerShell,命令不一样,不要刷错了
- CMD方式,要用管理员
setx PATH "%PATH%;D:\ffmpeg-2025-05-12-git-full_build\bin" /M
# 过程
PS C:\Users\h1369> echo %Path%PS C:\Users\h1369> setx PATH "%PATH%;D:\ffmpeg-2025-05-12-git-full_build\bin" /M成功: 指定的值已得到保存。
PS C:\Users\h1369>
重新打开命令提示符,验证
ffmpeg -version
# 过程
Windows PowerShell
版权所有(C) Microsoft Corporation。保留所有权利。安装最新的 PowerShell,了解新功能和改进!https://aka.ms/PSWindowsPS C:\Users\h1369> ffmpeg -version
ffmpeg version 2025-05-12-git-8ce32a7cbb-full_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developers
built with gcc 15.1.0 (Rev2, Built by MSYS2 project)
configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-lcms2 --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-libsnappy --enable-zlib --enable-librist --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-libbluray --enable-libcaca --enable-libdvdnav --enable-libdvdread --enable-sdl2 --enable-libaribb24 --enable-libaribcaption --enable-libdav1d --enable-libdavs2 --enable-libopenjpeg --enable-libquirc --enable-libuavs3d --enable-libxevd --enable-libzvbi --enable-libqrencode --enable-librav1e --enable-libsvtav1 --enable-libvvenc --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxavs2 --enable-libxeve --enable-libxvid --enable-libaom --enable-libjxl --enable-libvpx --enable-mediafoundation --enable-libass --enable-frei0r --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-liblensfun --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-dxva2 --enable-d3d11va --enable-d3d12va --enable-ffnvcodec --enable-libvpl --enable-nvdec --enable-nvenc --enable-vaapi --enable-libshaderc --enable-vulkan --enable-libplacebo --enable-opencl --enable-libcdio --enable-libgme --enable-libmodplug --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libshine --enable-libtheora --enable-libtwolame --enable-libvo-amrwbenc --enable-libcodec2 --enable-libilbc --enable-libgsm --enable-liblc3 --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-ladspa --enable-libbs2b --enable-libflite --enable-libmysofa --enable-librubberband --enable-libsoxr --enable-chromaprint
libavutil 60. 2.100 / 60. 2.100
libavcodec 62. 3.101 / 62. 3.101
libavformat 62. 0.102 / 62. 0.102
libavdevice 62. 0.100 / 62. 0.100
libavfilter 11. 0.100 / 11. 0.100
libswscale 9. 0.100 / 9. 0.100
libswresample 6. 0.100 / 6. 0.100Exiting with exit code 0
PS C:\Users\h1369>
python安装
pip install ffmpeg-python
python示例
import ffmpeg# 定义视频文件和输出音频文件的路径
video_file = r"D:\BaiduNetdiskDownload\数据迁移原理.ts"
audio_file = r'D:\BaiduNetdiskDownload\数据迁移原理.wav'# Step 1: 将.ts视频转换为音频文件
# 使用ffmpeg-python将视频转换为音频
# 创建ffmpeg输入流
input_stream = ffmpeg.input(video_file)# 设置输出流的参数
output_stream = input_stream.output(audio_file,acodec='pcm_s16le',ar='44100',ac='2')# 执行转换操作
output_stream.run()print('将.ts视频转换为音频文件')
2.1.3 查看
查看音视频文件格式、编码
两个方式都可以看:
- 视频编码、色彩空间、分辨率、帧率
- 音频编码、采样率、声道、音频比特率
方式1: ffprobe
(详细)
ffprobe -i 文件名
方式2:ffmpeg
(精简)
ffmpeg -i 文件名
# 输出详细说明
PS C:\Users\h1369> ffprobe -i "D:\BaiduNetdiskDownload\HyperCDP技术.ts"# FFmpeg 版本与编译信息
ffprobe version 2025-05-12-git-8ce32a7cbb-full_build-www.gyan.dev Copyright (c) 2007-2025 the FFmpeg developersbuilt with gcc 15.1.0 (Rev2, Built by MSYS2 project)configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads 。。。。。libavutil 60. 2.100 / 60. 2.100libavcodec 62. 3.101 / 62. 3.101libavformat 62. 0.102 / 62. 0.102libavdevice 62. 0.100 / 62. 0.100libavfilter 11. 0.100 / 11. 0.100libswscale 9. 0.100 / 9. 0.100libswresample 6. 0.100 / 6. 0.100# 媒体文件基本信息。# mpegts(容器格式。MPEG-2 传输流,常用于直播、广播电视)
Input #0, mpegts, from 'D:\BaiduNetdiskDownload\HyperCDP技术.ts':# 总时长Duration、开始时间start、整体比特率bitrateDuration: 03:01:03.65, start: 1.513556, bitrate: 558 kb/sProgram 1Metadata:service_name : Service01service_provider: FFmpeg# 视频流信息# 视频编码Video、色彩空间yuv420p、分辨率1728x1080、帧率25 fps、时间基准90k tbnStream #0:0[0x100]: Video: h264 (High) ([27][0][0][0] / 0x001B), yuv420p(tv, bt709/bt709/iec61966-2-1, progressive), 1728x1080 [SAR 1:1 DAR 8:5], 25 fps, 25 tbr, 90k tbn, Start 1.560000# 音频流信息# 音频编码Audio、采样率44100 Hz、声道stereo(立体声)、音频比特率153 kb/sStream #0:1[0x101](und): Audio: aac (LC) ([15][0][0][0] / 0x000F), 44100 Hz, stereo, fltp, 153 kb/s, Start 1.513556
PS C:\Users\h1369>
PS C:\Users\h1369> ffmpeg -i "D:\BaiduNetdiskDownload\HyperCDP技术.ts"
ffmpeg version 2025-05-12-git-8ce32a7cbb-full_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developersbuilt with gcc 15.1.0 (Rev2, Built by MSYS2 project)configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --。。。。。。libavutil 60. 2.100 / 60. 2.100libavcodec 62. 3.101 / 62. 3.101libavformat 62. 0.102 / 62. 0.102libavdevice 62. 0.100 / 62. 0.100libavfilter 11. 0.100 / 11. 0.100libswscale 9. 0.100 / 9. 0.100libswresample 6. 0.100 / 6. 0.100
Input #0, mpegts, from 'D:\BaiduNetdiskDownload\HyperCDP技术.ts':Duration: 03:01:03.65, start: 1.513556, bitrate: 558 kb/sProgram 1Metadata:service_name : Service01service_provider: FFmpeg# 视频流信息Stream #0:0[0x100]: Video: h264 (High) ([27][0][0][0] / 0x001B), yuv420p(tv, bt709/bt709/iec61966-2-1, progressive), 1728x1080 [SAR 1:1 DAR 8:5], 25 fps, 25 tbr, 90k tbn, Start 1.560000# 音频流信息Stream #0:1[0x101](und): Audio: aac (LC) ([15][0][0][0] / 0x000F), 44100 Hz, stereo, fltp, 153 kb/s, Start 1.513556
At least one output file must be specified
PS C:\Users\h1369>
2.1.4 视频处理
视频格式转换
比如:.ts 转MP4
方式1
ffmpeg -i input.ts output.mp4
ffmpeg -i "HyperCDP技术.ts" "HyperCDP技术.mp4"
方式 2:兼容转换(重新编码为标准 MP4 格式)
ffmpeg -i input.ts -c:v libx264 -crf 23 -c:a aac -b:a 128k output.mp4
# 参数说明
-c:v libx264 # 将视频重新编码为 H.264(MP4 最兼容的视频编码)
-crf 23 # 控制视频质量(默认 23,值越小画质越好,文件越大)
-c:a aac # 将音频重新编码为 AAC(MP4 标准音频编码)
-b:a 128k # 设置音频比特率为 128kbps(平衡音质与体积)
-progress pipe:1 # 参数可实时显示转换进度
-bsf:v h264_mp4toannexb # 修复 H.264 时间戳问题(常见于直播流)
-copyts # 保留原始时间戳(避免某些播放器播放异常)
设置 视频码率
# 设置输出文件的视频码率为 64 kbit/s:
ffmpeg -i input.avi -b:v 64k -bufsize 64k output.mp4
裁剪视频
FFmpeg 也允许你裁剪视频
例如,从视频中提取从 00:00:30 到 00:00:50 之间的视频片段:
ffmpeg -ss 00:00:30 -to 00:00:50 -i 输入.mp4 输出.mp4
ffmpeg -ss 00:00:30 -to 00:00:50 -i [新闻30分]国内简讯-1.mp4 [新闻30分]国内简讯-裁剪后.mp4
# 参数说明
-ss 00:00:30 # 从 00:00:30 开始裁剪
-to 00:00:50 # 在 00:00:50 结束裁剪
2.1.5 音频处理
视频提取音频
视频提取音频
ffmpeg -i 输入视频文件 -vn -acodec mp3 输出音频文件
ffmpeg -i "D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp4" -vn -acodec mp3 "D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp3"
-vn
: 禁用视频流,只提取音频。-acodec mp3
: 设置音频编码格式为 MP3。
PS D:\Users\Desktop\新建文件夹> ffmpeg -i "D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp4" -vn -acodec mp3 "D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp3"
ffmpeg version 2025-05-12-git-8ce32a7cbb-full_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developersbuilt with gcc 15.1.0 (Rev2, Built by MSYS2 project)configuration: --enable-gpl 。。。。libavutil 60. 2.100 / 60. 2.100libavcodec 62. 3.101 / 62. 3.101libavformat 62. 0.102 / 62. 0.102libavdevice 62. 0.100 / 62. 0.100libavfilter 11. 0.100 / 11. 0.100libswscale 9. 0.100 / 9. 0.100libswresample 6. 0.100 / 6. 0.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp4':Metadata:major_brand : mp42minor_version : 0compatible_brands: mp42isomDuration: 00:01:42.17, start: 0.000000, bitrate: 2680 kb/sStream #0:0[0x1](und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 192 kb/s (default)Metadata:vendor_id : [0][0][0][0]Stream #0:1[0x2](und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(progressive), 1916x1076 [SAR 1:1 DAR 479:269], 2485 kb/s, 29.93 fps, 30 tbr, 100k tbn (default)Metadata:vendor_id : [0][0][0][0]encoder : JVT/AVC Coding
Stream mapping:Stream #0:0 -> #0:0 (aac (native) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
Output #0, mp3, to 'D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp3':Metadata:major_brand : mp42minor_version : 0compatible_brands: mp42isomTSSE : Lavf62.0.102Stream #0:0(und): Audio: mp3, 48000 Hz, stereo, fltp (default)Metadata:encoder : Lavc62.3.101 libmp3lamevendor_id : [0][0][0][0]
[out#0/mp3 @ 00000269a61c48c0] video:0KiB audio:1598KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 0.020112%
size= 1598KiB time=00:01:42.17 bitrate= 128.1kbits/s speed=81.4x elapsed=0:00:01.25
PS D:\Users\Desktop\新建文件夹>
PS D:\Users\Desktop\新建文件夹> ls目录: D:\Users\Desktop\新建文件夹Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 2025/6/18 10:54 1636169 [新闻30分]国内简讯-1.mp3
-a---- 2025/6/18 10:18 34232134 [新闻30分]国内简讯-1.mp4
-a---- 2025/6/18 10:22 35120886 [新闻30分]国内简讯-2.mp4PS D:\Users\Desktop\新建文件夹>
音频格式转换
跟视频转格式一样
ffmpeg -i input.ts output.mp3
gpu加速
ffmpeg 硬件加速视频转码指南 - afnewiung - 博客园
C:\Users\HN>ffmpeg -hwaccels
ffmpeg version 2025-05-15-git-12b853530a-full_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developersbuilt with gcc 15.1.0 (Rev4, Built by MSYS2 project)configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-。。。libswresample 6. 0.100 / 6. 0.100
Hardware acceleration methods:
cuda
vaapi
dxva2
qsv
d3d11va
opencl
vulkan
d3d12va
amfC:\Users\HN>
2.2 faster-whisper 语音识别模型
2.2.1 理论
参考文档
基于OpenAI的Whisper构建的高效语音识别模型:faster-whisper-CSDN博客
各个模型对比
ai生成
模型名称 | 准确性 | 识别速度 | 参数量 | 语言支持 | 适用场景 |
---|---|---|---|---|---|
tiny | 较低准确性 | 极快(比 base 快) | 39M | 多语言 | 实时翻译、低资源设备、快速转录 |
tiny.en | 英语较低准确性 | 极快(比 base.en 快) | 39M | 英语 | 仅英语场景,极致性能需求 |
base | 中等准确性 | 快(比 small 快) | 74M | 多语言 | 日常使用、平衡精度与速度 |
base.en | 英语中等准确性 | 快(比 small.en 快) | 74M | 英语 | 仅英语场景,优化精度 |
small | 较高准确性 | 中等(比 medium 快) | 244M | 多语言 | 专业字幕生成、会议记录、标准需求 |
small.en | 英语较高准确性 | 中等(比 medium.en 快) | 244M | 英语 | 仅英语场景,更高精度 |
medium | 高准确性 | 较慢(比 large-v2 快) | 769M | 多语言 | 学术研究、专业音频处理、高要求场景 |
medium.en | 英语高准确性 | 较慢(比 large-v2 快) | 769M | 英语 | 仅英语场景,极高精度 |
large-v1 | 最高准确性 | 慢(基准速度) | 1550M | 多语言 | 高质量转录需求、长文本处理 |
large-v2 | 最高准确性 | 慢(基准速度) | 1550M | 多语言 | 改进版大模型,全面优于 v1 |
large-v3 | 最高准确性 | 慢(基准速度) | 1550M | 多语言 | 最新版大模型,优化长文本和复杂场景 |
large | 最高准确性 | 慢(基准速度) | 1550M | 多语言 | 等同于 large-v3 |
distil-large-v2 | 接近 large-v2 | 较快(比 large-v2 快) | 1550M | 多语言 | 蒸馏优化版,资源效率更高 |
distil-medium.en | 接近 medium.en | 中等偏快(比 medium.en 快) | 769M | 英语 | 英语蒸馏优化版,平衡效率与精度 |
distil-small.en | 接近 small.en | 快(比 small.en 快) | 244M | 英语 | 英语小型蒸馏版,轻量高效 |
distil-large-v3 | 接近 large-v3 | 较快(比 large-v3 快) | 1550M | 多语言 | 最新版蒸馏大模型,优化效率 |
2.2.2 安装
faster-whisper 下好后,还需要下载模型
win安装
faster-whisper的 win版本:
Releases · Purfview/whisper-standalone-win
python 安装
python 安装 faster-whisper faster-whisper · PyPI Python 3.9 或更高版本
pip install faster-whisper
模型下载
模型下载
guillaumekln (Guillaume Klein)
large-v3模型:https://huggingface.co/Systran/faster-whisper-large-v3/tree/main
large-v2模型:https://huggingface.co/guillaumekln/faster-whisper-large-v2/tree/main
large-v1模型:https://huggingface.co/guillaumekln/faster-whisper-large-v1/tree/main
medium模型:https://huggingface.co/guillaumekln/faster-whisper-medium/tree/main
small模型:https://huggingface.co/guillaumekln/faster-whisper-small/tree/main
base模型:https://huggingface.co/guillaumekln/faster-whisper-base/tree/main
tiny模型:https://huggingface.co/guillaumekln/faster-whisper-tiny/tree/main
国内模型地址:
https://aifasthub.com/models/guillaumekln
2.2.3 whisper-faster 参数
所有参数
--model
['tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2', 'large-v3', 'large', 'distil-large-v2', 'distil-medium.en', 'distil-small.en', 'distil-large-v3']
# 用法:
whisper-faster.exe 选项 音频文件【你可以输入文件通配符、文件列表(txt、m3u、m3u8、lst)或目录以进行批量处理。注意:列表或目录中的非媒体文件将按扩展名过滤掉】
PS D:\Users\Desktop\字幕\Whisper-Faster> .\whisper-faster.exe -h
usage: whisper-faster.exe [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE] [--output_dir OUTPUT_DIR][--output_format {lrc,txt,text,vtt,srt,tsv,json,all}] [--verbose VERBOSE] [--task {transcribe,translate}][--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}][--language_detection_threshold LANGUAGE_DETECTION_THRESHOLD] [--language_detection_segments LANGUAGE_DETECTION_SEGMENTS][--temperature TEMPERATURE] [--best_of BEST_OF] [--beam_size BEAM_SIZE] [--patience PATIENCE] [--length_penalty LENGTH_PENALTY][--repetition_penalty REPETITION_PENALTY] [--no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE] [--suppress_blank SUPPRESS_BLANK][--suppress_tokens SUPPRESS_TOKENS] [--initial_prompt INITIAL_PROMPT] [--prefix PREFIX][--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT] [--prompt_reset_on_temperature PROMPT_RESET_ON_TEMPERATURE][--without_timestamps WITHOUT_TIMESTAMPS] [--max_initial_timestamp MAX_INITIAL_TIMESTAMP][--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK][--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD] [--logprob_threshold LOGPROB_THRESHOLD][--no_speech_threshold NO_SPEECH_THRESHOLD] [--v3_offsets_off][--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD] [--hallucination_silence_th_temp {0.0,0.2,0.5,0.8,1.0}][--clip_timestamps CLIP_TIMESTAMPS] [--no_speech_strict_lvl {0,1,2}] [--word_timestamps WORD_TIMESTAMPS][--highlight_words HIGHLIGHT_WORDS] [--prepend_punctuations PREPEND_PUNCTUATIONS] [--append_punctuations APPEND_PUNCTUATIONS][--threads THREADS] [--version] [--vad_filter VAD_FILTER] [--vad_threshold VAD_THRESHOLD][--vad_min_speech_duration_ms VAD_MIN_SPEECH_DURATION_MS] [--vad_max_speech_duration_s VAD_MAX_SPEECH_DURATION_S][--vad_min_silence_duration_ms VAD_MIN_SILENCE_DURATION_MS] [--vad_speech_pad_ms VAD_SPEECH_PAD_MS][--vad_window_size_samples VAD_WINDOW_SIZE_SAMPLES] [--vad_dump] [--max_new_tokens MAX_NEW_TOKENS] [--chunk_length CHUNK_LENGTH][--compute_type {default,auto,int8,int8_float16,int8_float32,int8_bfloat16,int16,float16,float32,bfloat16}] [--batch_recursive][--beep_off] [--skip] [--checkcuda] [--print_progress] [--postfix] [--check_files] [--PR163_off] [--hallucinations_list_off][--one_word {0,1,2}] [--sentence] [--standard] [--standard_asia] [--max_comma MAX_COMMA] [--max_comma_cent {50,60,70,80,90,100}][--max_gap MAX_GAP] [--max_line_width MAX_LINE_WIDTH] [--max_line_count MAX_LINE_COUNT][--min_dist_to_end {0,4,5,6,7,8,9,10,11,12}] [--prompt_max {16,32,64,128,223}] [--reprompt {0,1,2}][--prompt_reset_on_no_end {0,1,2}] [--ff_dump] [--ff_track {1,2,3,4,5,6}] [--ff_fc] [--ff_mp3] [--ff_sync] [--ff_rnndn_sh][--ff_rnndn_xiph] [--ff_fftdn [0 - 97]] [--ff_tempo [0.5 - 2.0]] [--ff_gate] [--ff_speechnorm] [--ff_loudnorm][--ff_silence_suppress noise duration] [--ff_lowhighpass]audio [audio ...]positional arguments:audio audio file(s). You can enter a file wildcard, filelist (txt. m3u, m3u8, lst) or directory to do batch processing. Note: non-mediafiles in list or directory are filtered out by extension.optional arguments:-h, --help show this help message and exit--model MODEL, -m MODELname of the Whisper model to use (default: medium)--model_dir MODEL_DIRthe path to save model files; uses D:\Users\Desktop\字幕\Whisper-Faster\_models by default (default: None)--device DEVICE, -d DEVICEDevice to use. Default is 'cuda' if CUDA device is detected, else is 'cpu'. If CUDA GPU is a second device then set 'cuda:1'.(default: cuda)--output_dir OUTPUT_DIR, -o OUTPUT_DIRdirectory to save the outputs. By default the same folder where the executable file is or where media file is if--batch_recursive=True. '.'- sets to the current folder. 'source' - sets to where media file is. (default: default)--output_format {lrc,txt,text,vtt,srt,tsv,json,all}, -f {lrc,txt,text,vtt,srt,tsv,json,all}format of the output file; if not specified srt will be produced (default: srt)--verbose VERBOSE, -v VERBOSEwhether to print out debug messages (default: False)--task {transcribe,translate}whether to perform X->X speech recognition ('transcribe') or X->English translation ('translate') (default: transcribe)--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}, -l {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}language spoken in the audio, specify None to perform language detection (default: None)--language_detection_threshold LANGUAGE_DETECTION_THRESHOLDIf the maximum probability of the language tokens is higher than this value, the language is detected. (default: None)--language_detection_segments LANGUAGE_DETECTION_SEGMENTSNumber of segments/chunks to consider for the language detection. (default: 1)--temperature TEMPERATUREtemperature to use for sampling (default: 0)--best_of BEST_OF, -bo BEST_OFnumber of candidates when sampling with non-zero temperature (default: 5)--beam_size BEAM_SIZE, -bs BEAM_SIZEnumber of beams in beam search, only applicable when temperature is zero (default: 5)--patience PATIENCE, -p PATIENCEoptional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent toconventional beam search (default: 2.0)--length_penalty LENGTH_PENALTYoptional token length penalty coefficient (alpha) as in https://arxiv.org/abs/1609.08144, uses simple length normalization bydefault (default: 1.0)--repetition_penalty REPETITION_PENALTYPenalty applied to the score of previously generated tokens (set > 1.0 to penalize). (default: 1.0)--no_repeat_ngram_size NO_REPEAT_NGRAM_SIZEPrevent repetitions of ngrams with this size (set 0 to disable). (default: 0)--suppress_blank SUPPRESS_BLANKSuppress blank outputs at the beginning of the sampling. (default: True)--suppress_tokens SUPPRESS_TOKENScomma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except commonpunctuations (default: -1)--initial_prompt INITIAL_PROMPT, -prompt INITIAL_PROMPToptional text to provide context as a prompt for the first window. Use 'None' to disable it. Note: 'auto' and 'default' areexperimental ~universal prompt presets, they work if --language is set. (default: auto)--prefix PREFIX Optional text to provide as a prefix for the first window (default: None)--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT, -condition CONDITION_ON_PREVIOUS_TEXTif True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent acrosswindows, but the model becomes less prone to getting stuck in a failure loop. If disabled then you may want to disable --reprompttoo. (default: True)--prompt_reset_on_temperature PROMPT_RESET_ON_TEMPERATUREResets prompt if temperature is above this value. Arg has effect only if condition_on_previous_text is True. (default: 0.5)--without_timestamps WITHOUT_TIMESTAMPSOnly sample text tokens. (default: False)--max_initial_timestamp MAX_INITIAL_TIMESTAMPThe initial timestamp cannot be later than this. (default: 1.0)--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK, -fallback TEMPERATURE_INCREMENT_ON_FALLBACKtemperature to increase when falling back when the decoding fails to meet either of the thresholds below. To disable fallback setit to 'None'. (default: 0.2)--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLDif the gzip compression ratio is higher than this value, treat the decoding as failed (default: 2.4)--logprob_threshold LOGPROB_THRESHOLDif the average log probability is lower than this value, treat the decoding as failed (default: -1.0)--no_speech_threshold NO_SPEECH_THRESHOLDif the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to 'logprob_threshold',consider the segment as silence (default: 0.6)--v3_offsets_off Disables custom offsets to the defaults of pseudo-vad thresholds when 'large-v3' models are in use. Note: Offsets made to combat'large-v3' hallucinations. (default: False)--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD, -hst HALLUCINATION_SILENCE_THRESHOLD(Experimental) When word_timestamps is True, skip silent periods longer than this threshold (in seconds) when a possiblehallucination is detected. Optimal value is somewhere between 2 - 8 seconds. Inactive if None. (default: None)--hallucination_silence_th_temp {0.0,0.2,0.5,0.8,1.0}, -hst_temp {0.0,0.2,0.5,0.8,1.0}(Experimental) Additional heuristic for '--hallucination_silence_threshold'. If temperature is higher that this threshold thenconsider segment as possible hallucination ignoring the hst score. Inactive if 1.0. (default: 1.0)--clip_timestamps CLIP_TIMESTAMPSComma-separated list start,end,start,end,... timestamps (in seconds) of clips to process. The last end timestamp defaults to theend of the file. VAD is auto-disabled. (default: 0)--no_speech_strict_lvl {0,1,2}(experimental) Level of stricter actions when no_speech_prob > 0.93. Use beam_size=5 if this is enabled. Options: 0 - Disabled (donothing), 1 - Reset propmt (see condition_on_previous_text), 2 - Invalidate the cached encoder output (if no_speech_threshold isnot None). Arg meant to combat cases where the model is getting stuck in a failure loop or outputs nonsense (default: 0)--word_timestamps WORD_TIMESTAMPS, -wt WORD_TIMESTAMPSExtract word-level timestamps and refine the results based on them (default: True)--highlight_words HIGHLIGHT_WORDS, -hw HIGHLIGHT_WORDSunderline each word as it is spoken AKA karaoke in srt and vtt output formats (default: False)--prepend_punctuations PREPEND_PUNCTUATIONSif word_timestamps is True, merge these punctuation symbols with the next word (default: "'“¿([{-)--append_punctuations APPEND_PUNCTUATIONSif word_timestamps is True, merge these punctuation symbols with the previous word (default: "'.。,,!!??::”)]}、)--threads THREADS number of threads used for CPU inference; By default number of the real cores but no more that 4 (default: 0)--version Show Faster-Whisper's version number--vad_filter VAD_FILTER, -vad VAD_FILTEREnable the voice activity detection (VAD) to filter out parts of the audio without speech. (default: True)--vad_threshold VAD_THRESHOLDProbabilities above this value are considered as speech. (default: 0.45)--vad_min_speech_duration_ms VAD_MIN_SPEECH_DURATION_MSFinal speech chunks shorter min_speech_duration_ms are thrown out. (default: 350)--vad_max_speech_duration_s VAD_MAX_SPEECH_DURATION_SMaximum duration of speech chunks in seconds. Longer will be split at the timestamp of the last silence. (default: None)--vad_min_silence_duration_ms VAD_MIN_SILENCE_DURATION_MSIn the end of each speech chunk time to wait before separating it. (default: 3000)--vad_speech_pad_ms VAD_SPEECH_PAD_MSFinal speech chunks are padded by speech_pad_ms each side. (default: 900)--vad_window_size_samples VAD_WINDOW_SIZE_SAMPLESSize of audio chunks fed to the silero VAD model. Values other than 512, 1024, 1536 may affect model perfomance!!! (default: 1536)--vad_dump Dumps VAD timings to a subtitle file for inspection. (default: False)--max_new_tokens MAX_NEW_TOKENSMaximum number of new tokens to generate per-chunk. (default: None)--chunk_length CHUNK_LENGTHThe length of audio segments. If it is not None, it will overwrite the default chunk_length of the FeatureExtractor. (default:None)--compute_type {default,auto,int8,int8_float16,int8_float32,int8_bfloat16,int16,float16,float32,bfloat16}, -ct {default,auto,int8,int8_float16,int8_float32,int8_bfloat16,int16,float16,float32,bfloat16}Type of quantization to use (see https://opennmt.net/CTranslate2/quantization.html). (default: auto)--batch_recursive, -brEnables recursive batch processing. Note: If set then it changes defaults of --output_dir. (default: False)--beep_off Disables the beep sound when operation is finished. (default: False)--skip Skips media file if subtitle exists. Works if input is wildcard or directory. (default: False)--checkcuda, -cc Returns CUDA device count. (for Subtitle Edit's internal use)--print_progress, -ppPrints progress bar instead of transcription. (default: False)--postfix Adds language as a postfix to subtitle's filename. (default: False)--check_files Checks input files for errors before passing all them for transcription. Works if input is wildcard or directory. (default: False)--PR163_off (For dev experiments) Disables PR163. . (default: False)--hallucinations_list_off(For dev experiments) Disables hallucinations_list, allows hallucinations added to prompt. (default: False)--one_word {0,1,2} 0) Disabled. 1) Outputs srt and vtt subtitles with one word per line. 2) As '1', plus removes whitespace and ensures >= 50ms forsub lines. Note: VAD may slightly reduce the accuracy of timestamps on some lines. (default: 0)--sentence Enables splitting lines to sentences for srt and vtt subs. Every sentence starts in the new segment. By default meant to outputwhole sentence per line for better translations, but not limited to, read about '--max_...' parameters. Note: has no effect on'highlight_words'. (default: False)--standard Quick hardcoded preset to split lines in standard way. 42 chars per 2 lines with max_comma_cent=70 and --sentence are activatedautomatically. (default: False)--standard_asia Quick hardcoded preset to split lines in standard way for some Asian languages. 16 chars per 2 lines with max_comma_cent=80 and--sentence are activated automatically. (default: False)--max_comma MAX_COMMA(requires --sentence) After this line length a comma is treated as the end of sentence. Note: disabled if it‘s over or equal to--max_line_width. (default: 250)--max_comma_cent {50,60,70,80,90,100}(requires --sentence) Percentage of --max_line_width when it starts breaking the line after comma. Note: 100 = disabled. (default:100)--max_gap MAX_GAP (requires --sentence) Threshold for a gap length in seconds, longer gaps are treated as dots. (default: 3.0)--max_line_width MAX_LINE_WIDTHThe maximum number of characters in a line before breaking the line. (default: 1000)--max_line_count MAX_LINE_COUNTThe maximum number of lines in one sub segment. (default: 1)--min_dist_to_end {0,4,5,6,7,8,9,10,11,12}(requires --sentence) If from words like 'the', 'Mr.' and ect. to the end of line distance is less than set then it starts in anew line. Note: 0 = disabled. (default: 0)--prompt_max {16,32,64,128,223}(experimental) The maximum size of prompt. (default: 223)--reprompt {0,1,2} (experimental) 0) Disabled. 1) Inserts initial_prompt after the prompt resets. 2) Ensures that initial_prompt is present in promptfor all windows/chunks. Note: auto-disabled if initial_prompt=None. It's similar to 'hotwords' feature. (default: 2)--prompt_reset_on_no_end {0,1,2}(experimental) Resets prompt if there is no end of sentence in window/chunk. 0 - disabled, 1 - looks for period, 2 - looks forperiod or comma. Note: it's auto-disabled if reprompt=0. (default: 2)--ff_dump Dumps pre-processed audio by the filters to the 16000Hz file and prevents deletion of some intermediate audio files. (default:False)--ff_track {1,2,3,4,5,6}Audio track selector. 1 - selects the first audio track. (default: 1)--ff_fc Selects only front-center channel (FC) to process. (default: False)--ff_mp3 Audio filter: Conversion to MP3 and back. (default: False)--ff_sync Audio filter: Stretch/squeeze samples to the given timestamps, with a maximum of 3600 samples per second compensation. Input filemust be container that support storing PTS like mp4, mkv... (default: False)--ff_rnndn_sh Audio filter: Suppress non-speech with GregorR‘s SH model using Recurrent Neural Networks. Notes: It’s more aggressive than Xiph,discards singing. (default: False)--ff_rnndn_xiph Audio filter: Suppress non-speech with Xiph’s original model using Recurrent Neural Networks. (default: False)--ff_fftdn [0 - 97] Audio filter: General denoise with Fast Fourier Transform. Notes: 12 - normal strength, 0 - disabled. (default: 0)--ff_tempo [0.5 - 2.0]Audio filter: Adjust audio tempo. Values below 1.0 slows down audio, above - speeds up. 1.0 = disabled. (default: 1.0)--ff_gate Audio filter: Reduce lower parts of a signal. (default: False)--ff_speechnorm Audio filter: Extreme and fast speech amplification. (default: False)--ff_loudnorm Audio filter: EBU R128 loudness normalization. (default: False)--ff_silence_suppress noise durationAudio filter: Suppress quiet parts of audio. Takes two values. First value - noise tolerance in decibels [-70 - 0] (0=disabled),second value - minimum silence duration in seconds [0.1 - 10]. (default: [0, 3.0])--ff_lowhighpass Audio filter: Pass 50Hz - 7800 band. sinc + afir. (default: False)
PS D:\Users\Desktop\字幕\Whisper-Faster>
所有参数翻译版
-h, --help 显示此帮助消息并退出--model MODEL, -m MODEL要使用的Whisper模型名称(默认值:medium)--model_dir MODEL_DIR保存模型文件的路径;默认使用D:\Users\Desktop\字幕\Whisper-Faster\_models(默认值:None)--device DEVICE, -d DEVICE使用的设备。如果检测到CUDA设备,默认值为'cuda',否则为'cpu'。如果CUDA GPU是第二个设备,则设置'cuda:1'。(默认值:cuda)--output_dir OUTPUT_DIR, -o OUTPUT_DIR保存输出的目录。默认情况下,如果--batch_recursive=True,则为可执行文件所在的文件夹或媒体文件所在的位置。'.'设置为当前文件夹。'source'设置为媒体文件所在的位置。(默认值:default)--output_format {lrc,txt,text,vtt,srt,tsv,json,all}, -f {lrc,txt,text,vtt,srt,tsv,json,all}输出文件的格式;如果未指定,将生成srt(默认值:srt)--verbose VERBOSE, -v VERBOSE是否输出调试消息(默认值:False)--task {transcribe,translate}是执行X->X语音识别('transcribe')还是X->英语翻译('translate')(默认值:transcribe)--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}, -l {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}音频中使用的语言,指定None以执行语言检测(默认值:None)--language_detection_threshold LANGUAGE_DETECTION_THRESHOLD如果语言标记的最大概率高于此值,则检测到该语言。(默认值:None)--language_detection_segments LANGUAGE_DETECTION_SEGMENTS用于语言检测的段/块数。(默认值:1)--temperature TEMPERATURE采样时使用的温度(默认值:0)--best_of BEST_OF, -bo BEST_OF非零温度采样时的候选数(默认值:5)--beam_size BEAM_SIZE, -bs BEAM_SIZE波束搜索中的波束数,仅在温度为零时适用(默认值:5)--patience PATIENCE, -p PATIENCE波束解码中使用的可选耐心值,如https://arxiv.org/abs/2204.05424所述,默认值(1.0)相当于传统波束搜索(默认值:2.0)--length_penalty LENGTH_PENALTY可选的标记长度惩罚系数(alpha),如https://arxiv.org/abs/1609.08144所述,默认使用简单长度归一化(默认值:1.0)--repetition_penalty REPETITION_PENALTY应用于先前生成的标记分数的惩罚(设置>1.0以进行惩罚)。(默认值:1.0)--no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE防止出现此大小的ngram重复(设置0以禁用)。(默认值:0)--suppress_blank SUPPRESS_BLANK在采样开始时抑制空白输出。(默认值:True)--suppress_tokens SUPPRESS_TOKENS采样期间要抑制的标记ID的逗号分隔列表;'-1'将抑制除常见标点符号外的大多数特殊字符(默认值:-1)--initial_prompt INITIAL_PROMPT, -prompt INITIAL_PROMPT可选文本,用于为第一个窗口提供上下文作为提示。使用'None'禁用它。注意:'auto'和'default'是实验性的~通用提示预设,如果设置了--language,它们会起作用。(默认值:auto)--prefix PREFIX 可选文本,用于为第一个窗口提供前缀(默认值:None)--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT, -condition CONDITION_ON_PREVIOUS_TEXT如果为True,则将模型的先前输出作为下一个窗口的提示;禁用可能会使窗口之间的文本不一致,但模型更不容易陷入失败循环。如果禁用,则可能需要同时禁用--reprompt。(默认值:True)--prompt_reset_on_temperature PROMPT_RESET_ON_TEMPERATURE如果温度高于此值,则重置提示。仅当condition_on_previous_text为True时,该参数才有效。(默认值:0.5)--without_timestamps WITHOUT_TIMESTAMPS仅采样文本标记。(默认值:False)--max_initial_timestamp MAX_INITIAL_TIMESTAMP初始时间戳不能晚于这个值。(默认值:1.0)--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK, -fallback TEMPERATURE_INCREMENT_ON_FALLBACK当解码未能满足以下任一阈值而回退时要增加的温度。要禁用回退,请将其设置为'None'。(默认值:0.2)--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD如果gzip压缩比高于此值,则将解码视为失败(默认值:2.4)--logprob_threshold LOGPROB_THRESHOLD如果平均对数概率低于此值,则将解码视为失败(默认值:-1.0)--no_speech_threshold NO_SPEECH_THRESHOLD如果<|nospeech|>标记的概率高于此值,并且解码因'logprob_threshold'失败,则将该段视为静音(默认值:0.6)--v3_offsets_off 禁用使用'large-v3'模型时对伪VAD阈值默认值的自定义偏移。注意:偏移量用于抑制'large-v3'的幻觉。(默认值:False)--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD, -hst HALLUCINATION_SILENCE_THRESHOLD(实验性)当word_timestamps为True时,在检测到可能的幻觉时,跳过超过此阈值(以秒为单位)的静音期。最佳值介于2-8秒之间。如果为None则不激活。(默认值:None)--hallucination_silence_th_temp {0.0,0.2,0.5,0.8,1.0}, -hst_temp {0.0,0.2,0.5,0.8,1.0}(实验性)'--hallucination_silence_threshold'的附加启发式方法。如果温度高于此阈值,则将段视为可能的幻觉,忽略hst分数。如果为1.0则不激活。(默认值:1.0)--clip_timestamps CLIP_TIMESTAMPS要处理的剪辑的开始、结束、开始、结束...时间戳(以秒为单位)的逗号分隔列表。最后一个结束时间戳默认为文件末尾。VAD会自动禁用。(默认值:0)--no_speech_strict_lvl {0,1,2}(实验性)当no_speech_prob > 0.93时的严格操作级别。如果启用,使用beam_size=5。选项:0 - 禁用(不执行任何操作),1 - 重置提示(请参阅condition_on_previous_text),2 - 使缓存的编码器输出无效(如果no_speech_threshold不为None)。该参数旨在解决模型陷入失败循环或输出无意义内容的情况(默认值:0)--word_timestamps WORD_TIMESTAMPS, -wt WORD_TIMESTAMPS提取单词级时间戳并基于它们优化结果(默认值:True)--highlight_words HIGHLIGHT_WORDS, -hw HIGHLIGHT_WORDS在srt和vtt输出格式中,随着单词的发音为其添加下划线(即卡拉OK效果)(默认值:False)--prepend_punctuations PREPEND_PUNCTUATIONS如果word_timestamps为True,将这些标点符号与下一个单词合并(默认值:"'“¿([{-)--append_punctuations APPEND_PUNCTUATIONS如果word_timestamps为True,将这些标点符号与前一个单词合并(默认值:"'.。,,!!??::”)]}、)--threads THREADS CPU推理使用的线程数;默认值为实际核心数,但不超过4(默认值:0)--version 显示Faster-Whisper的版本号--vad_filter VAD_FILTER, -vad VAD_FILTER启用语音活动检测(VAD)以过滤掉音频中无语音的部分。(默认值:True)--vad_threshold VAD_THRESHOLD高于此值的概率被视为语音。(默认值:0.45)--vad_min_speech_duration_ms VAD_MIN_SPEECH_DURATION_MS最终语音块短于min_speech_duration_ms将被丢弃。(默认值:350)--vad_max_speech_duration_s VAD_MAX_SPEECH_DURATION_S语音块的最大持续时间(以秒为单位)。较长的块将在最后一次静音的时间戳处拆分。(默认值:None)--vad_min_silence_duration_ms VAD_MIN_SILENCE_DURATION_MS在每个语音块结束时等待分离前的时间。(默认值:3000)--vad_speech_pad_ms VAD_SPEECH_PAD_MS最终语音块两侧各填充speech_pad_ms的时间。(默认值:900)--vad_window_size_samples VAD_WINDOW_SIZE_SAMPLES输入到silero VAD模型的音频块大小。非512、1024、1536的值可能会影响模型性能!!!(默认值:1536)--vad_dump 将VAD时间戳转储到字幕文件中以供检查。(默认值:False)--max_new_tokens MAX_NEW_TOKENS每个块生成的最大新标记数。(默认值:None)--chunk_length CHUNK_LENGTH音频段的长度。如果不为None,它将覆盖FeatureExtractor的默认chunk_length。(默认值:None)--compute_type {default,auto,int8,int8_float16,int8_float32,int8_bfloat16,int16,float16,float32,bfloat16}, -ct {default,auto,int8,int8_float16,int8_float32,int8_bfloat16,int16,float16,float32,bfloat16}使用的量化类型(请参阅https://opennmt.net/CTranslate2/quantization.html)。(默认值:auto)--batch_recursive, -br启用递归批量处理。注意:如果设置,它将更改--output_dir的默认值。(默认值:False)--beep_off 操作完成时禁用提示音。(默认值:False)--skip 如果字幕已存在,则跳过媒体文件。如果输入是通配符或目录,则有效。(默认值:False)--checkcuda, -cc 返回CUDA设备数量。(供Subtitle Edit内部使用)--print_progress, -pp打印进度条而不是转录内容。(默认值:False)--postfix 在字幕文件名后添加语言作为后缀。(默认值:False)--check_files 在传递所有文件进行转录之前检查输入文件是否有错误。如果输入是通配符或目录,则有效。(默认值:False)--PR163_off (用于开发实验)禁用PR163。(默认值:False)--hallucinations_list_off(用于开发实验)禁用hallucinations_list,允许将幻觉添加到提示中。(默认值:False)--one_word {0,1,2} 0)禁用。1)输出srt和vtt字幕,每行一个单词。2)如'1',另外删除空格并确保字幕行≥50ms。注意:VAD可能会略微降低某些行时间戳的准确性。(默认值:0)--sentence 启用将行分割为句子以生成srt和vtt字幕。每个句子在新段中开始。默认情况下,旨在每行输出整个句子以获得更好的翻译,但不仅限于此,请参阅'--max_...'参数。注意:对'highlight_words'无效。(默认值:False)--standard 以标准方式分割行的快速硬编码预设。自动激活每行42个字符、2行、max_comma_cent=70和--sentence。(默认值:False)--standard_asia 针对某些亚洲语言的标准分行快速硬编码预设。自动激活每行16个字符、2行、max_comma_cent=80和--sentence。(默认值:False)--max_comma MAX_COMMA(需要启用--sentence)当行长度超过此值时,逗号将被视为句子结尾。注意:如果该值大于或等于--max_line_width,则禁用此功能。(默认值:250)--max_comma_cent {50,60,70,80,90,100}(需要启用--sentence)当达到--max_line_width的此百分比时,开始在逗号后换行。注意:100表示禁用。(默认值:100)--max_gap MAX_GAP (需要启用--sentence)间隙长度(以秒为单位)的阈值,超过该阈值的间隙将被视为省略号。(默认值:3.0)--max_line_width MAX_LINE_WIDTH换行前每行的最大字符数。(默认值:1000)--max_line_count MAX_LINE_COUNT每个字幕段的最大行数。(默认值:1)--min_dist_to_end {0,4,5,6,7,8,9,10,11,12}(需要启用--sentence)如果像'the'、'Mr.'等单词到行尾的距离小于设定值,则另起一行。注意:0表示禁用。(默认值:0)--prompt_max {16,32,64,128,223}(实验性)提示的最大大小。(默认值:223)--reprompt {0,1,2} (实验性)0)禁用。1)在提示重置后插入initial_prompt。2)确保所有窗口/块的提示中都存在initial_prompt。注意:如果initial_prompt=None则自动禁用。类似于“热词”功能。(默认值:2)--prompt_reset_on_no_end {0,1,2}(实验性)如果窗口/块中没有句子结尾,则重置提示。0 - 禁用,1 - 查找句号,2 - 查找句号或逗号。注意:如果reprompt=0则自动禁用。(默认值:2)--ff_dump 将过滤器预处理后的音频转储为16000Hz文件,并防止删除某些中间音频文件。(默认值:False)--ff_track {1,2,3,4,5,6}音频轨道选择器。1 - 选择第一个音频轨道。(默认值:1)--ff_fc 仅选择前中声道(FC)进行处理。(默认值:False)--ff_mp3 音频过滤器:转换为MP3并转回。(默认值:False)--ff_sync 音频过滤器:根据给定的时间戳拉伸/压缩样本,最大补偿为每秒3600个样本。输入文件必须是支持存储PTS的容器,如mp4、mkv...(默认值:False)--ff_rnndn_sh 音频过滤器:使用循环神经网络(GregorR的SH模型)抑制非语音部分。注意:比Xiph模型更激进,会丢弃歌声。(默认值:False)--ff_rnndn_xiph 音频过滤器:使用循环神经网络(Xiph的原始模型)抑制非语音部分。(默认值:False)--ff_fftdn [0 - 97] 音频过滤器:使用快速傅里叶变换进行常规降噪。注意:12 - 正常强度,0 - 禁用。(默认值:0)--ff_tempo [0.5 - 2.0]音频过滤器:调整音频节奏。值低于1.0会放慢音频,高于1.0会加快音频。1.0表示禁用。(默认值:1.0)--ff_gate 音频过滤器:降低信号的低频部分。(默认值:False)--ff_speechnorm 音频过滤器:极端快速的语音放大。(默认值:False)--ff_loudnorm 音频过滤器:EBU R128响度归一化。(默认值:False)--ff_silence_suppress noise duration音频过滤器:抑制音频中的安静部分。接受两个值。第一个值 - 噪声容限(分贝,[-70 - 0],0表示禁用),第二个值 - 最小静音持续时间(秒,[0.1 - 10])。(默认值:[0, 3.0])--ff_lowhighpass 音频过滤器:通过50Hz - 7800Hz频段。使用sinc和afir滤波器。(默认值:False)
性能优化
# 使用CUDA、指定cpu线程数、模型量化参数
whisper-faster --device cuda --threads 8 --compute_type int8_float16
- 改 Whisper 模型单次处理音频长度的参数,默认单位是秒 【small模型 实测最大值30】
--chunk_length 20
- 改模型量化参数
--compute_type int8_float16
不同量化类型的区别
量化类型 | 精度 | 内存占用 | 速度 | 适用场景 |
---|---|---|---|---|
float32 | 最高 | 完整模型大小(如 medium=3GB) | 最慢 | 追求极致精度,显存充足(≥8GB)的场景 |
float16 | 高 | 约 float32 的 50% | 快(GPU 加速) | 英伟达 GPU(支持 Tensor Core),需平衡精度和速度 |
int8 | 中 | 约 float32 的 25% | 快(CPU/GPU) | 显存有限(≤4GB),可接受轻微精度损失(WER↑约 1-3%) |
int8_float16 | 中高 | 约 float32 的 25-30% | 最快 | 推荐!混合精度,在 int8 基础上保留关键层的 float16 精度,平衡内存和精度 |
以 medium.en
模型处理 1 小时音频为例:
参数 | 内存峰值 | 处理时间(RTX 3060) | WER(词错误率) |
---|---|---|---|
float32 | 3.2GB | 15 分钟 | 4.5% |
float16 | 1.6GB | 10 分钟 | 4.6% |
int8 | 0.8GB | 8 分钟 | 5.0% |
int8_float16 | 0.9GB | 7 分钟 | 4.7% |
- 指定 CPU 推理时使用的线程数
--threads 参数(CPU 推理专用)
作用:指定 CPU 推理时使用的线程数,优化多核 CPU 的利用率。
取值范围:0(自动检测,默认不超过 4 线程)或手动设置为 CPU 核心数(如 4、8)。
字幕长度
# 如果有几句话识别成 一段很长的话 的场景【执行时的输出虽然会有长句,但是输出后的文件会分割】:
--standard # 启用标准预设:每行42字符,2行限制,自动激活 --sentence
--standard_asia # 优化亚洲语言:每行16字符,2行限制,更高的逗号容忍度# 可选
--vad_min_silence_duration_ms 500 --vad_threshold 0.5 # 降低静音检测阈值至0.5秒--sentence --max_line_width 30 --max_line_count 2 # 按句子分割,每行最多20字符,每段最多2行
--max_comma 25 --max_comma_cent 70 # 25字符后逗号视为句子结束,70%宽度时优先断句
2.2.4 举例
small
# 测试机:CPU: I5 8250URAM: 16GGPU: MX150 2G# 目录结构
D:\Users\Desktop\字幕
├── faster-whisper-large-v2
├── faster-whisper-small
│ ├── config.json
│ ├── model.bin
│ ├── preprocessor_config.json
│ ├── tokenizer.json
│ └── vocabulary.txt
├── faster-whisper-tiny
├── output_audio.srt
└── Whisper-Faster├── cublas64_11.dll├── cublasLt64_11.dll├── cudnn_cnn_infer64_8.dll├── cudnn_ops_infer64_8.dll├── Wav47B5.tmp├── whisper-faster.exe└── zlibwapi.dll
.\whisper-faster.exe --model_dir "D:\Users\Desktop\字幕" --model small -l zh --chunk_length 20 --output_format srt --output_dir "D:\Users\Desktop\字幕" "D:\BaiduNetdiskDownload\存储数据迁移原理.wav"
cpu线程数8
.\whisper-faster.exe --device cuda --threads 8 --compute_type int8 --standard_asia --chunk_length 20 --model small --model_dir "D:\Users\Desktop\字幕" --output_format srt --output_dir "D:\Users\Desktop\字幕" "D:\BaiduNetdiskDownload\HyperCDP技术.wav"
过程
PS D:\Users\Desktop\字幕\Whisper-Faster> .\whisper-faster.exe --model_dir "D:\Users\Desktop\字幕" --model small -l zh --chunk_length 20 --output_format srt --output_dir "D:\Users\Desktop\字幕" "D:\BaiduNetdiskDownload\存储数据迁移原理.wav"Standalone Faster-Whisper r192.3 running on: CUDAStarting work on: D:\BaiduNetdiskDownload\存储数据迁移原理.wav[00:00.580 --> 00:32.830] Hello,Hello,现在可以吗?现在可以吗?有声音吗?有声音吗?。。。。。
[03:11:16.700 --> 03:11:21.680] 那接下来呢,我们今天上午呢就到这个地方,大家扣6,我们就下课了,下午两点钟我们继续。Transcription speed: 0.61 audio seconds/sSubtitles are written to 'D:\Users\Desktop\字幕' directory.Operation finished in: 18967 secondsPS D:\Users\Desktop\字幕\Whisper-Faster>
large-v2
# 测试机:CPU: AMD R7 3700XRAM: 48GGPU: RTX2070 8G执行3小时音频转字幕:CPU占用33左右,内容占用1G以内,GPU吃满# 目录结构
D:\faster-whisper
├── faster-whisper-large-v2
│ ├── config.json
│ ├── gitattributes
│ ├── model.bin
│ ├── README.md
│ ├── tokenizer.json
│ └── vocabulary.txt
└── Faster-Whisper-XXL├── _xxl_data├── faster-whisper-xxl.exe├── ffmpeg.exe└── One Click Transcribe.bat
.\faster-whisper-xxl.exe --device cuda --threads 12 --compute_type int8_float16 --standard_asia --chunk_length 30 --language zh --model large-v2 --model_dir "D:\faster-whisper" --output_format srt --output_dir "D:\faster-whisper" "G:\学习视频\异步远程复制原理.wav"
# 过程
PS D:\faster-whisper\Faster-Whisper-XXL> .\faster-whisper-xxl.exe --device cuda --threads 12 --compute_type int8_float16 --standard_asia --chunk_length 30 --language zh --model large-v2 --model_dir "D:\faster-whisper" --output_format srt --output_dir "D:\faster-whisper" "G:\学习视频\异步远程复制原理.wav"Standalone Faster-Whisper-XXL r245.4 running on: CUDAStarting to process: G:\学习视频\异步远程复制原理.wavStarting sequential faster-whisper inference.[01:18.410 --> 01:49.220] 接下来是昨日。
。。。。。
[03:05:45.530 --> 03:05:46.670] 下午两点钟我们再继续啊。
[03:06:13.500 --> 03:06:14.540] 各位同学我们下课了啊。Transcription speed: 12.25 audio seconds/sSubtitles are written to 'D:\faster-whisper' directory.Operation finished in: 0:16:06.705PS D:\faster-whisper\Faster-Whisper-XXL>
large-v2操作整个文件夹
把 D:\HN\桌面\新建文件夹
下所有音频 提取字幕 到 D:\HN\桌面\字幕
目录下
.\faster-whisper-xxl.exe --device cuda --threads 16 --compute_type int8_float16 --standard_asia --language zh --model large-v2 --model_dir "D:\faster-whisper" --output_format srt --output_dir "D:\HN\桌面\字幕" D:\HN\桌面\新建文件夹
PS D:\HN\桌面\新建文件夹> ls目录: D:\HN\桌面\新建文件夹Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 2025/5/18 23:20 172193166 周末班-上午.mp3
-a---- 2025/5/18 23:20 166684464 周末班-下午.mp3PS D:\HN\桌面\新建文件夹>
看进度
执行时,cmd标题会变成进度
Default:"47%|5128/10864|32:08<<35:57|2.66 audioseconds/s"
47%
: 表示当前处理的进度为 47%。5128/10864
: 表示已经处理的音频帧数(5128)和总音频帧数(10864)。32:08<<35:57
: 表示已经处理的时间(32分08秒)和总时间(35分57秒)。2.66 audioseconds/s
: 表示处理速度,即每秒处理的音频秒数。
2.2.5 报错
CUDA Out of Memory :CUDA
# 报错:CUDA内存不足
RuntimeError: CUDA failed with error out of memory[01:51:06.920 --> 01:51:20.480] 那接下来呢,我们来看一下,除了这个migration,在v5的早期啊,它就只有快业物,在v5的后期呢,它就有了文件业物,一直现在呢,
Traceback (most recent call last):File "D:\whisper-fast\__main__.py", line 1600, in <module>File "D:\whisper-fast\__main__.py", line 1527, in cliFile "faster_whisper\transcribe.py", line 1373, in restore_speech_timestampsFile "faster_whisper\transcribe.py", line 722, in generate_segmentsFile "faster_whisper\transcribe.py", line 1072, in generate_with_fallback
RuntimeError: CUDA failed with error out of memory
[18364] Failed to execute script '__main__' due to unhandled exception!
PS D:\Users\Desktop\字幕\Whisper-Faster>
解决方式:
- 改 Whisper 模型单次处理音频长度的参数,默认单位是秒 【small模型 实测最大值30】
--chunk_length 20
- 内存充足(≥8GB):30-60 秒(平衡速度与上下文连贯性)。
- 内存有限(≤4GB):10-20 秒(避免 CUDA Out of Memory)。
- 处理含长句的音频:20-40 秒(确保完整句子不被截断)
3 实战演示
3.1 纯win端演示
全流程步骤
全流程步骤:
- 安装ffmpeg
- 下载faster-whisper
- 下载faster-whisper 的模型
- 使用ffmpeg将视频提取出音频
- 使用faster-whisper,指定模型,进行语音识别,生成字幕
测试机环境说明
# 测试机环境说明:CPU: I5 8250URAM: 16GGPU: MX150 2G# 视频存放位置: D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp4
# faster-whisper存放目录: E:\字幕识别工具\Whisper-Faster
# faster-whisper 的模型 存放目录: E:\字幕识别工具\模型存放目录# 目录结构
E:\字幕识别工具
├── Whisper-Faster
│ ├── cublas64_11.dll
│ ├── cublasLt64_11.dll
│ ├── cudnn_cnn_infer64_8.dll
│ ├── cudnn_ops_infer64_8.dll
│ ├── Wav47B5.tmp
│ ├── whisper-faster.exe
│ └── zlibwapi.dll
└── 模型存放目录├── faster-whisper-tiny├── faster-whisper-large-v2└── faster-whisper-small├── config.json├── model.bin├── preprocessor_config.json├── tokenizer.json└── vocabulary.txt
使用ffmpeg将视频提取出音频
ffmpeg -i "D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp4" -vn -acodec mp3 "D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp3"
-vn
: 禁用视频流,只提取音频。-acodec mp3
: 设置音频编码格式为 MP3。
# 过程
PS C:\Users\h1369> ffmpeg -i "D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp4" -vn -acodec mp3 "D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp3"
ffmpeg version 2025-05-12-git-8ce32a7cbb-full_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developersbuilt with gcc 15.1.0 (Rev2, Built by MSYS2 project)configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-lcms2 --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-libsnappy --enable-zlib --enable-librist --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-libbluray --enable-libcaca --enable-libdvdnav --enable-libdvdread --enable-sdl2 --enable-libaribb24 --enable-libaribcaption --enable-libdav1d --enable-libdavs2 --enable-libopenjpeg --enable-libquirc --enable-libuavs3d --enable-libxevd --enable-libzvbi --enable-libqrencode --enable-librav1e --enable-libsvtav1 --enable-libvvenc --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxavs2 --enable-libxeve --enable-libxvid --enable-libaom --enable-libjxl --enable-libvpx --enable-mediafoundation --enable-libass --enable-frei0r --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-liblensfun --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-dxva2 --enable-d3d11va --enable-d3d12va --enable-ffnvcodec --enable-libvpl --enable-nvdec --enable-nvenc --enable-vaapi --enable-libshaderc --enable-vulkan --enable-libplacebo --enable-opencl --enable-libcdio --enable-libgme --enable-libmodplug --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libshine --enable-libtheora --enable-libtwolame --enable-libvo-amrwbenc --enable-libcodec2 --enable-libilbc --enable-libgsm --enable-liblc3 --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-ladspa --enable-libbs2b --enable-libflite --enable-libmysofa --enable-librubberband --enable-libsoxr --enable-chromaprintlibavutil 60. 2.100 / 60. 2.100libavcodec 62. 3.101 / 62. 3.101libavformat 62. 0.102 / 62. 0.102libavdevice 62. 0.100 / 62. 0.100libavfilter 11. 0.100 / 11. 0.100libswscale 9. 0.100 / 9. 0.100libswresample 6. 0.100 / 6. 0.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp4':Metadata:major_brand : mp42minor_version : 0compatible_brands: mp42isomDuration: 00:01:42.17, start: 0.000000, bitrate: 2680 kb/sStream #0:0[0x1](und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 192 kb/s (default)Metadata:vendor_id : [0][0][0][0]Stream #0:1[0x2](und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(progressive), 1916x1076 [SAR 1:1 DAR 479:269], 2485 kb/s, 29.93 fps, 30 tbr, 100k tbn (default)Metadata:vendor_id : [0][0][0][0]encoder : JVT/AVC Coding
Stream mapping:Stream #0:0 -> #0:0 (aac (native) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
Output #0, mp3, to 'D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp3':Metadata:major_brand : mp42minor_version : 0compatible_brands: mp42isomTSSE : Lavf62.0.102Stream #0:0(und): Audio: mp3, 48000 Hz, stereo, fltp (default)Metadata:encoder : Lavc62.3.101 libmp3lamevendor_id : [0][0][0][0]
[out#0/mp3 @ 0000024fc68ac040] video:0KiB audio:1598KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 0.020112%
size= 1598KiB time=00:01:42.17 bitrate= 128.1kbits/s speed=86.6x elapsed=0:00:01.17
PS C:\Users\h1369>
faster-whisper生成字幕
使用faster-whisper,指定模型,进行语音识别,生成字幕
E:\字幕识别工具\Whisper-Faster\whisper-faster.exe --standard_asia --model_dir "E:\字幕识别工具\模型存放目录" --model small -l zh --chunk_length 20 --output_format srt --output_dir "D:\Users\Desktop\新建文件夹" "D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp3"
# 参数说明
--standard_asia # 优化亚洲语言:每行16字符,2行限制,更高的逗号容忍度
--model_dir # 指定模型存放目录路径: E:\字幕识别工具\模型存放目录
--model # 选择模型: small
-l zh # 设置识别语言: 中文
--chunk_length # 设置音频分块长度: 20 秒
--output_format # 设置输出字幕格式: srt
--output_dir # 指定输出字幕文件保存路径: D:\Users\Desktop\新建文件夹
# 过程
PS C:\Users\h1369> E:\字幕识别工具\Whisper-Faster\whisper-faster.exe --standard_asia --model_dir "E:\字幕识别工具\模型存放目录" --model small -l zh --chunk_length 20 --output_format srt --output_dir "D:\Users\Desktop\新建文件夹" "D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp3"Standalone Faster-Whisper r192.3 running on: CUDAStarting work on: D:\Users\Desktop\新建文件夹\[新闻30分]国内简讯-1.mp3[00:00.000 --> 00:01.600] 接下来更多消息,我们来看一组简讯。
[00:03.560 --> 00:08.460] 今年以来,消费品已就换新,有力带动消费,持续回升向好。
[00:08.920 --> 00:11.820] 商务部数据显示,截至5月31号,
[00:12.080 --> 00:17.200] 今年消费品已就换新,五大品类合计带动销售额1.1万亿元。
[00:17.200 --> 00:20.960] 发放直达消费者的补贴约1.75亿份。
[00:21.940 --> 00:25.800] 5月份,山东青岛港开闭三条全新航线,
[00:26.120 --> 00:31.500] 覆盖了巴西、阿根廷、智利等南美主要经济体以及中东疏扭港口,
[00:31.860 --> 00:36.620] 预计每年为青岛港新增集装箱吞吐量超过20万飙箱。
[00:37.560 --> 00:40.040] 新加坡与毛球公开赛昨晚结束,
[00:40.340 --> 00:42.160] 中国队一金两银收关。
[00:42.540 --> 00:46.360] 女单决赛中,陈宇飞战胜队友王子仪获得了冠军。
[00:46.700 --> 00:49.460] 南单决赛中,中国选手陆光祖0比2
[00:49.460 --> 00:52.180] 不敌泰国名将昆拉伍特收获亚军。
[00:53.200 --> 00:56.020] 1号,在2025年法国网球公开赛
[00:56.020 --> 01:00.340] 女单第四轮比赛中,赛会8号种子中国选手郑青文
[01:00.340 --> 01:03.260] 经历了2小时47分钟的苦战,
[01:03.540 --> 01:06.620] 2比1击败了俄罗斯选手萨蒙索诺娃,
[01:06.800 --> 01:09.080] 职业生涯首次近期罚网八强。
[01:09.440 --> 01:12.220] 郑青文在四分之一决赛中的对手将会是
[01:12.220 --> 01:13.740] 头号种子萨巴伦卡,
[01:13.740 --> 01:17.420] 后者执落两盘,战胜美国选手阿尼西莫娃。
[01:18.160 --> 01:23.000] 记者昨天从中国气象局国家空间天气监测预警中心获悉,
[01:23.440 --> 01:25.820] 5月31号,太阳爆发药班,
[01:26.180 --> 01:29.040] 地球可能连续三天发生地磁爆,
[01:29.560 --> 01:32.820] 卫星通信、航天气运行等可能会受到干扰。
[01:32.820 --> 01:36.340] 我国北部有机会出现较为明显的极光,
[01:36.500 --> 01:38.860] 但不会对人体健康有影响。Transcription speed: 2.16 audio seconds/sSubtitles are written to 'D:\Users\Desktop\新建文件夹' directory.Operation finished in: 52 secondsPS C:\Users\h1369>
PS C:\Users\h1369> ls D:\Users\Desktop\新建文件夹目录: D:\Users\Desktop\新建文件夹Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 2025/6/19 9:44 1636169 [新闻30分]国内简讯-1.mp3
-a---- 2025/6/18 10:18 34232134 [新闻30分]国内简讯-1.mp4
-a---- 2025/6/19 9:57 2233 [新闻30分]国内简讯-1.srt
-a---- 2025/6/18 10:22 35120886 [新闻30分]国内简讯-2.mp4PS C:\Users\h1369>
3.2 补充
用python批量提取音频
import os
import subprocess
import multiprocessing
import ffmpeg# 将视频转换为音频, cmd方式
def video_to_audio_cmd(_video_file, _audio_file):# 定义ffmpeg命令command = ["ffmpeg","-hwaccel", "cuda","-i", _video_file,"-vn","-acodec", "libmp3lame",_audio_file]# 执行命令try:subprocess.run(command, check=True)print("ffmpeg命令执行成功")print(f"音频文件已保存到: {_audio_file}")except subprocess.CalledProcessError as e:print(f"ffmpeg命令执行失败: {e}")# 将视频转换为音频, ffmpeg方式
def video_to_audio(_video_file, _audio_file):# 使用ffmpeg-python将视频转换为音频input_stream = ffmpeg.input(_video_file)# 设置输出流的参数output_stream = input_stream.output(_audio_file,ar='44100',ac='2')# 执行转换操作output_stream.run()if __name__ == "__main__":# 指定目录路径directory = r"D:\Users\Desktop\新建文件夹"# 指定输出音频目录路径output_directory = r"D:\Users\Desktop\新建文件夹"# 确保输出目录存在os.makedirs(output_directory, exist_ok=True)# 获取目录中的所有视频文件video_files = [f for f in os.listdir(directory) if f.endswith(('.mp4', '.ts', '.avi', '.mov'))]# 定义一个进程池,最大进程数4po = multiprocessing.Pool(processes=4)# 并行处理视频文件for video_file in video_files:filepath = os.path.join(directory, video_file)audiopath = os.path.join(output_directory, os.path.splitext(video_file)[0] + '.mp3')print(f"视频路径: {filepath} \n音频路径: {audiopath}")po.apply_async(video_to_audio, args=(filepath, audiopath))# 关闭进程池,关闭后po不再接收新的请求po.close()# 主进程等待子进程执行完po.join()
4 总结
这个Faster-Whisper的识别率:
- small模型,主要在配置低的笔记本上运行的
- large-v2模型,虽然有识别错的,但是更精准了,比较满意了;
主要能离线识别,比较方便