ubuntu使用DeepSpeech进行语音识别（包含交叉编译）

文章目录

前言
一、DeepSpeech编译
二、DeepSpeech使用示例
三、核心代码分析
- 1.创建模型核心代码
- 2.识别过程核心代码
四、交叉编译
- 1.交叉编译
- 2.使用
总结

前言

由于工作需要语音识别的功能，环境是在linux arm版上，所以想先在ubuntu上跑起来看一看，就找了一下语音识别的开源框架，选中了很多框架可以看编译vosk那篇文章，现在一一试验一下。

本篇博客将会在ubuntu上进行DeepSpeech编译使用，并且进行交叉编译。

|版本声明：山河君，未经博主允许，禁止转载

一、DeepSpeech编译

如果想先自己编编看，可以先看这里，如果想直接使用库文件等，可以跳过本节，下文会标注出官方支持的各种平台已经编好的二进制文件。

不过博主还是建议先自己编编看，因为源码中有一个文件是官方的示例文档，还是值得一看的。

下载依赖项

sudo apt-get update
sudo apt-get install -y \build-essential \libatlas-base-dev \libfftw3-dev \libgfortran5 \sox \libsox-devsudo apt-get install libmagic-dev

下载DeepSpeech源码

git clone https://github.com/mozilla/DeepSpeech.git
cd DeepSpeech
git submodule sync tensorflow/
git submodule update --init tensorflow/

DeepSpeech是使用bazel构建的，下载bazel

sudo apt install curl
curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
echo "deb [arch=amd64] https://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
sudo apt update && sudo apt install bazel

配置tensorlow

cd tensorflow
./configure
ln -s ../native_client

如果native_client不存在，使用native_client进行创建

编译

只需要库文件

bazel build --workspace_status_command="bash native_client/bazel_workspace_status_cmd.sh" --config=monolithic -c opt --copt=-O3 --copt="-D_GLIBCXX_USE_CXX11_ABI=0" --copt=-fvisibility=hidden //native_client:libdeepspeech.so

库和可执行文件

bazel build --workspace_status_command="bash native_client/bazel_workspace_status_cmd.sh" --config=monolithic -c opt --copt=-O3 --copt="-D_GLIBCXX_USE_CXX11_ABI=0" --copt=-fvisibility=hidden //native_client:libdeepspeech.so //native_client:generate_scorer_package

native_client存在deepspeech可执行文件，值得注意的是，头文件是deepspeech.h，而client.cc是C++示例文件
在这里插入图片描述
tensorflow/baze-bin/native_client下存在对应库文件

二、DeepSpeech使用示例

模型下载地址

模型文件：deepspeech-0.9.3-models-zh-CN.pbmm
打分文件：deepspeech-0.9.3-models-zh-CN.scorer
在这里插入图片描述

./deepspeech --model /home/aaron/workplace/audioread/deepspeech-0.9.3-models-zh-CN.pbmm --scorer /home/aaron/workplace/audioread/deepspeech-0.9.3-models-zh-CN.scorer --audio /home/aaron/workplace/audioread/test.wav

在这里插入图片描述

三、核心代码分析

核心代码是上文提到client.cc文件中的示例代码

1.创建模型核心代码

// Initialise DeepSpeechModelState* ctx;// sphinx-doc: c_ref_model_startint status = DS_CreateModel(model, &ctx);if (status != 0) {char* error = DS_ErrorCodeToErrorMessage(status);fprintf(stderr, "Could not create model: %s\n", error);free(error);return 1;}if (set_beamwidth) {status = DS_SetModelBeamWidth(ctx, beam_width);if (status != 0) {fprintf(stderr, "Could not set model beam width.\n");return 1;}}if (scorer) {status = DS_EnableExternalScorer(ctx, scorer);if (status != 0) {fprintf(stderr, "Could not enable external scorer.\n");return 1;}if (set_alphabeta) {status = DS_SetScorerAlphaBeta(ctx, lm_alpha, lm_beta);if (status != 0) {fprintf(stderr, "Error setting scorer alpha and beta.\n");return 1;}}}// sphinx-doc: c_ref_model_stopstatus = DS_AddHotWord(ctx, word, boost);if (status != 0) {fprintf(stderr, "Could not enable hot-word.\n");return 1;}

DS_CreateModel：创建模型
DS_SetModelBeamWidth：设置搜索空间宽度，值越大，越准确，但会增大开销
DS_EnableExternalScorer：设置评分器
DS_SetScorerAlphaBeta：设置声学模型和语言模型参数。Alpha 参数：影响语言模型的权重；Beta 参数：影响语言模型中候选路径的惩罚机制。
DS_AddHotWord：设置某些特定词汇或短语被识别为更优先的目标词汇

2.识别过程核心代码

if (extended_output) {Metadata *result = DS_SpeechToTextWithMetadata(aCtx, aBuffer, aBufferSize, 1);res.string = CandidateTranscriptToString(&result->transcripts[0]);DS_FreeMetadata(result);} else if (json_output) {Metadata *result = DS_SpeechToTextWithMetadata(aCtx, aBuffer, aBufferSize, json_candidate_transcripts);res.string = MetadataToJSON(result);DS_FreeMetadata(result);} else if (stream_size > 0) {StreamingState* ctx;int status = DS_CreateStream(aCtx, &ctx);if (status != DS_ERR_OK) {res.string = strdup("");return res;}size_t off = 0;const char *last = nullptr;const char *prev = nullptr;while (off < aBufferSize) {size_t cur = aBufferSize - off > stream_size ? stream_size : aBufferSize - off;DS_FeedAudioContent(ctx, aBuffer + off, cur);off += cur;prev = last;const char* partial = DS_IntermediateDecode(ctx);if (last == nullptr || strcmp(last, partial)) {printf("%s\n", partial);last = partial;} else {DS_FreeString((char *) partial);}if (prev != nullptr && prev != last) {DS_FreeString((char *) prev);}}if (last != nullptr) {DS_FreeString((char *) last);}res.string = DS_FinishStream(ctx);} else if (extended_stream_size > 0) {StreamingState* ctx;int status = DS_CreateStream(aCtx, &ctx);if (status != DS_ERR_OK) {res.string = strdup("");return res;}size_t off = 0;const char *last = nullptr;const char *prev = nullptr;while (off < aBufferSize) {size_t cur = aBufferSize - off > extended_stream_size ? extended_stream_size : aBufferSize - off;DS_FeedAudioContent(ctx, aBuffer + off, cur);off += cur;prev = last;const Metadata* result = DS_IntermediateDecodeWithMetadata(ctx, 1);const char* partial = CandidateTranscriptToString(&result->transcripts[0]);if (last == nullptr || strcmp(last, partial)) {printf("%s\n", partial);last = partial;} else {free((char *) partial);}if (prev != nullptr && prev != last) {free((char *) prev);}DS_FreeMetadata((Metadata *)result);}const Metadata* result = DS_FinishStreamWithMetadata(ctx, 1);res.string = CandidateTranscriptToString(&result->transcripts[0]);DS_FreeMetadata((Metadata *)result);free((char *) last);} else {res.string = DS_SpeechToText(aCtx, aBuffer, aBufferSize);}

最核心的deepspeech接口：

函数名	输入	输出	适用场景	优缺点
DS_SpeechToText	完整音频数据	完整识别文本	适用于一次性处理音频文件的场景	简单直接，适用于批量处理，但不能处理实时流
DS_SpeechToTextWithMetadata	完整音频数据	完整识别文本 + 元数据（如时间戳等）	适用于需要每个词或音节时间戳的场景	更详细的输出，适用于字幕等场景，但复杂度略高
DS_IntermediateDecode	流式输入音频数据（逐段）	逐步输出识别文本	适用于实时语音识别场景，如语音助手、实时转录等	低延迟输出，适合流式处理，但可能不精确
DS_IntermediateDecodeWithMetadata	流式输入音频数据（逐段）	逐步输出识别文本 + 元数据	适用于实时语音识别，且需要获取词级时间戳和置信度等详细信息	更全面的输出，适用于实时字幕等场景，但复杂度更高

值得注意的是：如果是文件中语音识别，应该是使用前两个，如果是流式需要考虑延时或实时语音，应该使用后面两个

四、交叉编译

1.交叉编译

非常不建议自己进行交叉编译，建议直接使用官方版本，因为如果使用交叉编译，需要在tensflow那边就开始设置交叉编译环境，并且虽然bazel工具中存在对于aarch64环境的脚本提示（在隐藏文件.bazelrc里）

build:elinux_aarch64 --config=elinux
build:elinux_aarch64 --cpu=aarch64

但笔者尝试过，会报各种各样的错误。

官方编译好的文件，各种平台都支持，下图是版本
在这里插入图片描述

各种平台的文件，根据需要下载
在这里插入图片描述

解压后
在这里插入图片描述

2.使用

值得注意的是，如果是在linux aarch64环境下，那么使用的model受到资源限制，应该使用的是.tflite而非pbmm
在这里插入图片描述

在rk芯片上的使用结果，实际上比较消耗性能
在这里插入图片描述

总结

目前已经尝试过vosk、PocketSphinx，有兴趣的话可以看看之前的文章，实际上还有两个没有记录出来，Snowboy和Julius，有兴趣的小伙伴可以一起探讨。