测试与评估大型语言模型（LLMs）：关键指标与最佳实践（第一部分）

大型语言模型（LLMs）彻底改变了自然语言处理（NLP），推动了文本生成、摘要、翻译等应用的发展。然而，评估其有效性仍然是一个挑战，因为其性能受多个因素影响，包括连贯性、流畅性、准确性和公平性。

本博客深入介绍了LLM评估指标，讨论了其重要性、最佳实践和实际应用。

为什么评估指标很重要？

LLM 评估对于以下方面至关重要：

确保模型准确性：验证响应是否符合用户期望。
检测偏差：识别和减轻意外偏差。
提高性能：根据可衡量的结果优化模型。
比较模型：为更好的决策对不同的 LLM 进行基准测试。
现实世界适用性：评估确保模型在实际、现实世界的场景中表现良好，而不仅仅是在基准数据集上。

没有适当的评估，模型可能会生成误导性或低质量的输出，影响用户信任和实际应用。指标有助于改进和微调模型响应。

评估 LLM 的关键指标

准确性指标

1. Perplexity（困惑度）

困惑度衡量 LLM 预测一系列单词的效果。较低的值表示更好的性能。让我们使用 Hugging Face 库计算 GPT-2 模型的困惑度：

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# 加载 GPT-2 模型和分词器
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# 示例文本
input_text = "Evaluate the performance of this model."
inputs = tokenizer(input_text, return_tensors="pt")

# 计算损失而不进行训练
with torch.no_grad():
    outputs = model(**inputs, labels=inputs["input_ids"])
    loss = outputs.loss
    perplexity = torch.exp(loss)  # PPL = exp(loss)

print(f"Perplexity: {perplexity.item()}")

解释困惑度

低困惑度（10–20） ：模型自信且生成流畅的文本。
高困惑度（100+） ：模型不确定，导致文本脱节或意义较小。

何时使用困惑度？

评估语言模型的流畅性：如果微调后的模型的困惑度低于基础模型，则表示已改进。
比较不同模型：例如 GPT-4 与 GPT-3：困惑度较低的模型更流畅。
评估微调模型的过拟合情况：如果训练数据上的困惑度很低，但验证数据上的困惑度很高，则模型过拟合。

2. 交叉熵损失

衡量预测概率与实际标签之间的差异。它主要用于分类任务，量化预测概率与实际目标值之间的差异。

解释：较高的交叉熵损失表示预测分布与实际分布之间的差异较大，意味着模型的预测不够准确。

何时使用交叉熵损失：在训练期间用于更新模型权重，引导模型随时间做出更准确的预测。

相似性指标

1. BLEU（双语评估替身）

BLEU 通过 n-gram 重叠、精确度和简洁性惩罚（BP）将生成的文本与参考文本进行比较。用于机器翻译、文本摘要等任务。响应文本与一个或多个参考文本进行比较，并基于重叠的 n-gram（单词序列）计算匹配程度。分数越高，响应越准确。

何时使用：

评估客户支持聊天机器人
检查机器翻译文本

应用：AI 驱动的客户服务聊天机器人必须提供精确的答案。BLEU 帮助检查聊天机器人响应与人工编写响应的匹配程度。

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

bleu_score = sentence_bleu(reference, candidate)
print(f"BLEU Score: {bleu_score}")

2. ROUGE（基于召回的评估替身）

专注于召回率，衡量生成文本从参考文本中捕获的关键信息量。用于摘要模型任务。与基于精确度的 BLEU 不同，ROUGE 使用召回率进行计算。

from evaluate import load

# 加载 ROUGE 指标
rouge = load('rouge')

# 候选摘要
candidates = [
    "Artificial intelligence is transforming industries.",
    "Deep learning models require a lot of data.",
    "Self-driving cars are the future of transportation."
]

# 参考摘要（多个参考可提高评估效果）
references = [
    ["AI is changing the way businesses operate.", "Industries are evolving due to artificial intelligence."],
    ["Deep learning systems depend on large datasets.", "Training deep models requires significant data."],
    ["Autonomous vehicles will revolutionize travel.", "The future of transportation includes self-driving technology."]
]

# 计算 ROUGE 分数
results = rouge.compute(predictions=candidates, references=references)

# 打印结果
print(results)

输出

{
  'rouge1': 0.5645, # 候选文本和参考文本之间的单词重叠
  'rouge2': 0.3214, # 双词重叠，检查短语一致性
  'rougeL': 0.5484, # 最长公共子序列（LCS），捕捉句子结构
  'rougeLsum': 0.5484
}

何时使用 ROUGE

摘要任务：将生成的摘要与人工编写的摘要进行比较。
文本生成评估：评估机器生成文本与参考文本的相似性。

3. METEOR（用于翻译评估的明确排序指标）

METEOR 通过结合同义词和词序改进了 BLEU，使用精确度和召回率。

from evaluate import load

# 加载 METEOR 指标
meteor = load("meteor")

# 模型生成的候选文本
candidates = [
    "The quick brown fox jumps over a lazy dog.",
    "AI is revolutionizing technology and industry.",
    "Natural language processing is a subfield of AI."
]

# 参考文本（多个参考可提高评估效果）
references = [
    ["A fast brown fox leaps over a sleeping dog.", "The fox quickly jumps over the lazy dog."],
    ["Artificial intelligence is transforming businesses.", "Technology and industry are being reshaped by AI."],
    ["NLP is a branch of artificial intelligence.", "Natural language processing is part of AI."]
]

# 计算 METEOR 分数
results = meteor.compute(predictions=candidates, references=references)

# 打印结果
print(results)

输出

{
  'meteor': 0.6723
}

解释 METEOR 分数

同义词（例如，“quick” 和 “fast” 被视为相似）。
词干提取（例如，“jumping” 和 “jumps” 被识别为相同的词根）。
词序（比 ROUGE 更好地捕捉句子结构）。

何时使用 METEOR

机器翻译评估：METEOR 广泛用于将翻译的句子与人工编写的句子进行比较。
摘要和释义任务：与 BLEU 和 ROUGE 不同，METEOR 考虑同义词，适用于接受轻微措辞变化的情况。
对话 AI 和聊天机器人：确保 AI 生成的响应与预期回复在语义上相似。

4. BERTScore

BERTScore 是一个强大的 NLP 指标，它使用深度上下文嵌入而不是精确的单词匹配来评估文本相似性。与 ROUGE、BLEU 和 METEOR 不同，BERTScore 通过利用基于 BERT 的嵌入来捕捉句子的语义含义。

from evaluate import load

# 加载 BERTScore 指标
bertscore = load("bertscore")

# 模型生成的候选文本
candidates = [
    "The economy is improving with new policies.",
    "Artificial intelligence is shaping the future of technology.",
    "Deep learning models require a lot of data to perform well."
]

# 参考文本（人工编写的句子）
references = [
    ["The financial situation is getting better due to recent reforms.", "Economic growth is driven by policy changes."],
    ["AI is transforming the way we interact with technology.", "The future of technology is being shaped by AI advancements."],
    ["Large datasets are necessary for training deep learning systems.", "Deep learning models rely on massive amounts of training data."]
]

# 计算 BERTScore
results = bertscore.compute(predictions=candidates, references=references, lang="en")

# 打印结果
print(f"Precision: {results['precision']}")
print(f"Recall: {results['recall']}")
print(f"F1 Score: {results['f1']}")

输出分数

Precision: [0.92, 0.89, 0.91]
Recall: [0.91, 0.88, 0.90]
F1 Score: [0.91, 0.88, 0.90]

解释 BERTScore

使用预训练的 BERT 嵌入以上下文感知的方式衡量单词之间的相似性。
精确度：候选文本中有多少单词在参考文本中有接近的语义匹配。
召回率：参考文本中有多少内容被候选文本捕获。
F1 分数：精确度和召回率的平衡。

何时使用 BERTScore

摘要和释义：比 ROUGE 和 METEOR 更好地捕捉含义，使其适用于抽象摘要。
对话 AI 和聊天机器人：评估响应是否语义正确，即使措辞与人工响应不同。
上下文感知评估 → 与传统指标不同，BERTScore 理解句子含义。
多语言支持 → 适用于多语言 NLP 任务。

5. MOVEScore

MOVEScore 是一种先进的语义相似性指标，用于评估摘要、翻译和文本生成等 NLP 任务。与 BLEU、ROUGE 和 METEOR 等传统方法不同，MOVEScore 利用基于变换器的模型上下文比较生成文本与参考文本。

from evaluate import load

# 加载 MOVEScore 指标
movescore = load("movescore")

# 模型生成的候选文本
candidates = [
    "The climate is changing rapidly due to human activities.",
    "Advancements in AI are transforming industries worldwide.",
    "Cybersecurity threats are increasing in the digital world."
]

# 参考文本（人工编写的句子）
references = [
    ["Human activities are driving rapid climate change.", "Global warming is accelerating due to human intervention."],
    ["AI innovations are revolutionizing various industries.", "Artificial intelligence is reshaping global business operations."],
    ["Online security risks are rising in the digital age.", "Cyber threats are becoming more common in today's internet landscape."]
]

# 计算 MOVEScore
results = movescore.compute(predictions=candidates, references=references)

# 打印结果
print(f"MOVEScore: {results['score']}")

输出分数

MOVEScore: [0.89, 0.91, 0.87]

解释 MOVEScore

提供 0 到 1 之间的分数，分数越高表示候选文本与参考文本越一致。

何时使用 MOVEScore

文本摘要：适用于抽象摘要，即使单词不同也能捕捉含义。
机器翻译：与 BLEU 不同，MOVEScore 具有上下文感知能力，可用于评估翻译。
对话 AI 和聊天机器人：用于评估 AI 生成响应的自然性和连贯性。
适用于现代 NLP 模型 → 适用于基于 GPT、T5 和 BERT 的模型。

这些指标是理解 LLM 在生成类人文本方面表现如何的基础。然而，评估 LLM 超出了数值分数的范畴。

在本系列的第二部分，我们将深入探讨人类评估技术和公平性与偏差评估，这对于确保 LLM 生成的文本不仅质量高，而且在伦理上合理且无偏差至关重要。敬请期待我们如何有效测量偏差、公平性和人类判断以评估 LLM！

📢 阅读第 2 部分 👉 https://medium.com/@sumit.somanchd/testing-evaluating-large-language-models-llms-key-metrics-and-best-practices-part-2-0ac7092c9776

很高兴听到您评估聊天机器人和模型响应的经验。