大模型压测-evalscope

evalscope

evalscope github地址

https://github.com/modelscope/evalscope

安装

使用conda创建一个虚拟环境(也可以直接在现有的环境上)
```
conda create -n evalscope python=3.10 -yconda activate evalscope
```

pip安装依赖

pip install evalscope                pip install 'evalscope[perf]'

evalscope[perf] 是evalscope库中压测大模型的部分

重要参数

evalscope库压测大模型参数解释

参数名	解释
url	API请求的目标地址
parallel	并发数
model	模型名称
number	压测样本数
api	使用的API类型，这里采用OpenAI兼容的API格式
dataset	测试使用的数据集类型
stream	是否使用流式响应，设置为False表示使用非流式响应
debug	是否开启调试模式，设置为False表示不开启
headers	HTTP请求头信息，可以包含API密钥用于认证
connect_timeout	连接超时时间，单位为秒
read_timeout	读取超时时间，单位为秒
max_tokens	生成回复的最大token数量限制

使用

以压测 openai 峰风格的为例，这里压一下硅基流动的，curl命令如下

curl --location 'https://api.siliconflow.cn/v1/chat/completions' \
--header 'Authorization: Bearer your_api_key' \
--data '{"model": "Qwen/Qwen2.5-72B-Instruct","messages": [{"role": "system","content": "你是一个助手"},{"role": "user","content": "你好"}],"max_tokens": 512,"stream": true
}'

针对上面的culr命令，压测代码如下 (需要将your_api_key更换为自己的)

from evalscope.perf.main import run_perf_benchmarkdef run_perf(parallel):task_cfg = {'url': 'https://api.siliconflow.cn/v1/chat/completions','parallel': parallel,'model': 'Qwen/Qwen2.5-72B-Instruct','number': 20,'api': 'openai','dataset': 'openqa','stream': False,'debug': False,'headers': {'Authorization': 'Bearer your_api_key'},'connect_timeout': 6000,'read_timeout': 6000,'max_tokens': 512}run_perf_benchmark(task_cfg)run_perf(parallel=1)

有一个点需要注意点，如果自己部署的，需要压大模型首帧时间（首个token出现），则需要将stream设为True,否则非流式的一次性出来，首帧即尾帧

输出解释

会输出子啊output文件夹下，共5个文件

其中主要的是这两个文件夹benchmark_summary.json 和benchmark_percentile.json

benchmark_summary.json

{"Time taken for tests (s)": 203.0674,"Number of concurrency": 1,"Total requests": 20,"Succeed requests": 20,"Failed requests": 0,"Output token throughput (tok/s)": 31.0685,"Total token throughput (tok/s)": 36.0176,"Request throughput (req/s)": 0.0985,"Average latency (s)": 10.1525,"Average time to first token (s)": 10.1525,"Average time per output token (s)": 0.0328,"Average input tokens per request": 50.25,"Average output tokens per request": 315.45,"Average package latency (s)": 10.1525,"Average package per request": 1.0,"Expected number of requests": 20,"Result DB path": "./outputs/20250430_230534/Qwen2.5-72B-Instruct/benchmark_data.db"
}

英文名称	含义
Time taken for tests (s)	测试总时长（秒）
Number of concurrency	并发数
Total requests	测试过程中发送的请求总数
Succeed requests	成功完成的请求数量
Failed requests	失败的请求数量
Output token throughput (tok/s)	每秒输出tokens
Total token throughput (tok/s)	每秒总tokens
Request throughput (req/s)	这个就是pqs
Average latency	从发送请求到接收完整响应的平均时间
Average time to first token	从发送请求到接收第一个token的平均时间因为设置了stream=False,所以这里只有一帧
Average input tokens per request	每个请求的平均输入长度
Average output tokens per request	每个请求的平均输出长度
Average time per output token	生成每个token的平均时间
Average package per request	流式输出模式下，每个请求平均接收的数据包数量
Average package latency	流式输出模式下，接收每个数据包的平均延迟
Result DB path	这个是每条样本的请求信息，可使用sqlite3来访问

benchmark_percentile.json

[{"Percentile": "10%","TTFT (s)": 5.4016,"ITL (s)": NaN,"TPOT (s)": 0.0298,"Latency (s)": 5.4016,"Input tokens": 42,"Output tokens": 167,"Output throughput(tok/s)": 28.0987,"Total throughput(tok/s)": 32.7871},......后面太长了省略
]

英文名称	含义
`Percentile`	百分位数，表示数据分布中某个百分比的位置，例如 10%、25%、50% 等。
`TTFT (s)`	Time to First Token，生成第一个输出 token 所需的时间（秒）。
`ITL (s)`	Inference Time per Layer，每层推理所需的时间（秒）。文件中为 `NaN`。
`TPOT (s)`	Time per Output Token，生成每个输出 token 所需的时间（秒）。
`Latency (s)`	延迟时间（秒），当stream=False的时候等于 `TTFT (s)`。
`Input tokens`	输入的 token 数量。
`Output tokens`	输出的 token 数量。
`Output throughput(tok/s)`	输出吞吐量，表示每秒生成的输出 token 数量。
`Total throughput(tok/s)`	总吞吐量，表示每秒处理的 token 总数（输入和输出 token 的总和）。

解析代码

如果压测的时候有不同的模型需要压测，或者需要压测不同的并发数，可以使用for循环来实现，所有的结果都保存在outputs文件夹中

可以使用下面的解析代码来整合

import os
import pandas as pd
import jsondef load_benchmark_data(outputs_dir):"""加载 outputs 目录下的 benchmark 数据，并返回两个 DataFrame：1. benchmark_percentile_df：包含所有 benchmark_percentile.json 的数据，并添加 Number of concurrency。2. benchmark_summary_df：包含所有 benchmark_summary.json 的数据。:param outputs_dir: outputs 目录路径:return: (benchmark_percentile_df, benchmark_summary_df)"""percentile_data = []summary_data = []# 遍历 outputs 目录for time_dir in os.listdir(outputs_dir):time_path = os.path.join(outputs_dir, time_dir)if not os.path.isdir(time_path):continuefor model_name in os.listdir(time_path):model_path = os.path.join(time_path, model_name)if not os.path.isdir(model_path):continue# 定位 benchmark_percentile.json 和 benchmark_summary.jsonpercentile_file = os.path.join(model_path, "benchmark_percentile.json")summary_file = os.path.join(model_path, "benchmark_summary.json")if os.path.exists(percentile_file) and os.path.exists(summary_file):# 加载 benchmark_summary.jsonwith open(summary_file, "r", encoding="utf-8") as f:summary = json.load(f)summary_data.append({"time": time_dir,"model_name": model_name,**summary})# 加载 benchmark_percentile.jsonwith open(percentile_file, "r", encoding="utf-8") as f:percentiles = json.load(f)for percentile in percentiles:percentile_data.append({"time": time_dir,"model_name": model_name,**percentile,"Number of concurrency": summary.get("Number of concurrency", None)})# 转换为 DataFramebenchmark_percentile_df = pd.DataFrame(percentile_data)benchmark_summary_df = pd.DataFrame(summary_data)return benchmark_percentile_df, benchmark_summary_df# 示例调用
outputs_dir = "c:/Users/孟智超/Desktop/eval_scope解析/outputs"
percentile_df, summary_df = load_benchmark_data(outputs_dir)# 打印结果
print(percentile_df.head())
print(summary_df.head())

大模型压测-evalscope

evalscope

安装

重要参数

使用

输出解释

解析代码

相关资讯

热文排行

最新新闻

推荐新闻

热搜词