跳至主要內容

评估指令与记录

黄曦大约 1 分钟模型评估模型评估

📊 模型评估指令与记录

评估指令

评估模型时使用的命令和配置参数

评估记录

模型评估的结果记录和数据分析

🎯 评估指令

漏洞检测与修复模型评估指令

evalscope eval \
  --work-dir /home/test/llm-eval/eval-outputs/99一次性过/漏洞检测与修复甲乙丙 \
  --model XingYun-Vul-72B \
  --api-url http://192.168.63.100:8586/v1/chat/completions \
  --api-key 1 \
  --datasets vuln_evaluate_third vuln_evaluate_accept vuln_repair_evaluate_third vuln_repair_evaluate_accept vuln_evaluate vuln_repair_evaluate \
  --eval-type openai_api \
  --generation-config '{"temperature": 0, "max_new_tokens": 8192}'

缺陷检测与修复模型评估指令

evalscope eval \
  --work-dir /home/test/llm-eval/eval-outputs/99一次性过/缺陷检测与修复甲乙丙 \
  --model XingYun-72B \
  --api-url http://192.168.68.75:8080/generation/v1 \
  --api-key 1 \
  --datasets defect_evaluate defect_evaluate_third defect_evaluate_accept defect_repair_evaluate defect_repair_evaluate_third defect_repair_evaluate_accept \
  --eval-type openai_api \
  --generation-config '{"temperature": 0, "max_new_tokens": 8192}'

📓 学习记录

基础知识

Python 基础语法和概念

进阶内容

面向对象、装饰器等高级特性

实战项目

实际项目经验和代码示例

📈 评估记录

2025年12月10日 早晨记录 (vLLM 版本: v0.10.0)

漏洞检测评估结果

vuln_evaluate_third 报告
+-----------------+---------------------+---------------+----------+-------+---------+---------+
| Model           | Dataset             | Metric        | Subset   |   Num |   Score | Cat.0   |
+=================+=====================+===============+==========+=======+=========+=========+
| XingYun-Vul-72B | vuln_evaluate_third | mean_accuracy | default  |   600 |  0.9317 | default |
+-----------------+---------------------+---------------+----------+-------+---------+---------+
vuln_evaluate_accept 报告
+-----------------+----------------------+---------------+----------+-------+---------+---------+
| Model           | Dataset              | Metric        | Subset   |   Num |   Score | Cat.0   |
+=================+======================+===============+==========+=======+---------+---------+
| XingYun-Vul-72B | vuln_evaluate_accept | mean_accuracy | default  |   600 |    0.93 | default |
+-----------------+----------------------+---------------+----------+-------+---------+---------+

漏洞修复评估结果

vuln_repair_evaluate_third 报告
+-----------------+----------------------------+---------------+----------+-------+---------+---------+
| Model           | Dataset                    | Metric        | Subset   |   Num |   Score | Cat.0   |
+=================+============================+===============+==========+=======+=========+=========+
| XingYun-Vul-72B | vuln_repair_evaluate_third | mean_accuracy | default  |   500 |   0.912 | default |
+-----------------+----------------------------+---------------+----------+-------+---------+---------+