1. 如何使用 ROUGE 评估大型语言模型以进行摘要 #
我们评估大型语言模型的方式与评估机器学习模型的方式截然不同,后者通常使用准确度、F1 分数或召回率等指标。
生成语言的指标各不相同。根据具体应用,选择不同的指标来评估模型的性能。
在本笔记本中,我们将探讨如何使用 ROUGE 指标来衡量语言模型生成的摘要的质量。
2.什么是 ROUGE? #
ROUGE 不仅仅是一个指标;它是一组指标,用于衡量生成的摘要与作为基准的参考摘要之间的重叠和相似性。
它返回第四个单独的指标。提供的指标包括:
ROUGE-1:测量单字母或单个单词的重叠。
ROUGE-2:测量双字母或单词对的重叠。
ROUGE-L:测量最长公共子序列,奖励生成摘要和参考摘要之间较长的共享序列。
ROUGE-LSUM:计算为 LCS 的长度除以生成摘要和参考摘要的长度之和。
3.我们要做什么? #
我们将使用两个 T5 模型,其中一个是 t5-Base 模型,另一个是专门为创建摘要而设计的经过微调的 t5-base。
首先,我们将使用一个数据集并使用这两个模型生成摘要。通过比较两个生成的摘要,我们可以观察微调是否有效地产生了不同的结果。换句话说,在这里我们只能确定这两个模型在摘要生成方面表现出显著差异,但我们不知道哪一个可能表现更好。
为了确定哪个模型生成更好的摘要,我们将利用一个众所周知的数据集“cnn_dailymail”,该数据集可在“数据集”库中找到。
此数据集包含可用于比较的参考摘要。我们将根据这些参考摘要评估两个模型生成的摘要。
获得更高 ROUGE 分数的模型将被视为生成更好摘要的模型。
4. 模型 #
t5-Base Finnetunned:https://huggingface.co/flax-community/t5-base-cnn-dm
t5-Base:https://huggingface.co/t5-base
!pip install -q evaluate==0.4.2
!pip install -q transformers==4.42.4
!pip install -q rouge_score==0.1.2
!pip install kaggle
#!pip install -q datasets==2.1.0
import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import evaluate
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
5.加载数据 #
#Import generic libraries
import numpy as np
import pandas as pd
import torch
该数据集可在 Kaggle 上获取,包含由麻省理工学院汇编的一系列科技新闻文章。文章正文位于“文章正文”栏中。https://www.kaggle.com/datasets/deepanshudalal09/mit-ai-news-published-till-2023
5.1 从 Kaggle 导入数据集 #
您只需要访问数据集中的 articles.csv 文件,即可直接下载并加载它,如果您更喜欢使用 Kaggle API,则可以使用下面的代码。要使用 Kaggle API,您需要在目录 /content/drive/MyDrive/kaggle 中使用密钥来获取 kaggle.json 文件
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'
!kaggle datasets download -d deepanshudalal09/mit-ai-news-published-till-2023
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.6.17 / client 1.6.14)
Dataset URL: https://www.kaggle.com/datasets/deepanshudalal09/mit-ai-news-published-till-2023
License(s): unknown
Downloading mit-ai-news-published-till-2023.zip to /content
0% 0.00/1.90M [00:00<?, ?B/s]
100% 1.90M/1.90M [00:00<00:00, 128MB/s]
import zipfile
file_path = '/content/mit-ai-news-published-till-2023.zip'
with zipfile.ZipFile(file_path, 'r') as zip_ref:
zip_ref.extractall('/content/drive/MyDrive/kaggle')
5.2 加载数据集 #
news = pd.read_csv('/content/drive/MyDrive/kaggle/articles.csv')
DOCUMENT="Article Body"
#Because it is just a course we select a small portion of News.
MAX_NEWS = 3
subset_news = news.head(MAX_NEWS)
subset_news.head()
Unnamed: 0 | Published Date | Author | Source | Article Header | Sub_Headings | Article Body | Url | |
---|---|---|---|---|---|---|---|---|
0 | 0 | July 7, 2023 | Adam Zewe | MIT News Office | Learning the language of molecules to predict … | This AI system only needs a small amount of da… | [‘Discovering new materials and drugs typicall… | https://news.mit.edu/2023/learning-language-mo… |
1 | 1 | July 6, 2023 | Alex Ouyang | Abdul Latif Jameel Clinic for Machine Learning… | MIT scientists build a system that can generat… | BioAutoMATED, an open-source, automated machin… | [‘Is it possible to build machine-learning mod… | https://news.mit.edu/2023/bioautomated-open-so… |
2 | 2 | June 30, 2023 | Jennifer Michalowski | McGovern Institute for Brain Research | When computer vision works more like a brain, … | Training artificial neural networks with data … | [‘From cameras to self-driving cars, many of t… | https://news.mit.edu/2023/when-computer-visi |
articles = subset_news[DOCUMENT].tolist()
6. 加载模型并创建摘要 #
这两种模型都可以在 Hugging Face 上使用,因此我们将使用 Transformers 库。
model_name_base = "t5-base"
model_name_finetuned = "flax-community/t5-base-cnn-dm"
#This function returns the tokenizer and the Model.
def get_model(model_id):
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
return tokenizer, model
tokenizer_base, model_base = get_model(model_name_base)
config.json: 0%| | 0.00/1.21k [00:00<?, ?B/s]
spiece.model: 0%| | 0.00/792k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/1.39M [00:00<?, ?B/s]
model.safetensors: 0%| | 0.00/892M [00:00<?, ?B/s]
generation_config.json: 0%| | 0.00/147 [00:00<?, ?B/s]
tokenizer_finetuned, model_finetuned = get_model(model_name_finetuned)
tokenizer_config.json: 0%| | 0.00/1.92k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/1.39M [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/1.79k [00:00<?, ?B/s]
config.json: 0%| | 0.00/1.36k [00:00<?, ?B/s]
model.safetensors: 0%| | 0.00/892M [00:00<?, ?B/s]
下载并准备好两个模型后,我们创建一个执行摘要的函数。
该函数采用第四个参数:
- the list of texts to summarize.
- the tokenizer.
- the model.
- the maximum length for the generated summary
import time
def create_summaries(texts_list, tokenizer, model, max_l=125):
# We are going to add a prefix to each article to be summarized
# so that the model knows what it should do
prefix = "Summarize this news: "
summaries_list = [] #Will contain all summaries
texts_list = [prefix + text for text in texts_list]
for text in texts_list:
summary=""
#calculate the encodings
input_encodings = tokenizer(text,
max_length=1024,
return_tensors='pt',
padding=True,
truncation=True)
# Generate summaries
start = time.time()
output = model.generate(
input_ids=input_encodings.input_ids,
attention_mask=input_encodings.attention_mask,
max_length=max_l, # Set the maximum length of the generated summary
num_beams=2, # Set the number of beams for beam search
early_stopping=True
)
#Decode to get the text
summary = tokenizer.batch_decode(output, skip_special_tokens=True)
end = time.time()
#Add the summary to summaries list
elapsed_time = end - start
print(f"Time taken: {elapsed_time:.3f} seconds")
summaries_list += summary
return summaries_list
为了创建摘要,我们调用“create_summaries”函数,传递新闻文章和相应的标记器和模型。
# Creating the summaries for both models.
summaries_base = create_summaries(articles,
tokenizer_base,
model_base)
summaries_finetuned = create_summaries(articles,
tokenizer_finetuned,
model_finetuned)
summaries_base
['MIT and MIT-Watson AI Lab have developed a unified framework. the system can simultaneously predict molecular properties and generate new molecules. it uses this grammar to construct viable molecules and predict their properties.',
'\'BioAutoMATED\' is an automated machine-learning system that can select and build an appropriate model for a given dataset. it can even take care of the laborious task of data preprocessing, whittling down a months-long process to just a few hours. \'"We want to lower these barriers for a lot of folks that want to use machine learning or biology," says first co-author Jacqueline Valeri.',
"MIT and IBM research scientists have made a computer vision model more robust by training it to work like a part of the brain that humans and other primates rely on for object recognition. 'we asked the artificial neural network to make the function of one of your inside simulated “neural” layers as similar as possible to the corresponding biological neural layer,' says MIT professor."]
summaries_finetuned
['Researchers created a machine-learning system that automatically learns the "language" of molecules using only a small, domain-specific dataset. The system learns to construct viable molecules and predict their properties. Computational design and Fabrication Group will be presented at the International Conference for Machine Learning.',
"Automated machine-learning system can select and build an appropriate model for a given dataset. 'BioAutoMATED' is an automated machine-learning system. The tool includes binary classification models, multi-class classification models, and more complex neural networks.",
"MIT and IBM researchers have found that artificial neural networks resemble the multilayered brain circuits that process visual information in humans and other primates. 'We asked it to do both of those things as well as the standard, computer vision approach,' said one expert. The network found to be more robust by training it to work like a part of the brain that humans rely on for object recognition."]
乍一看,很明显摘要是不同的。
但是,很难确定哪一个更好。
甚至很难辨别它们之间是否有明显差异,或者它们之间是否只有细微的差别。
这就是我们现在要使用 ROUGE 验证的内容。当比较一个模型的摘要与另一个模型的摘要时,我们无法知道哪一个更好,而是知道对模型进行微调后摘要发生了多大的变化。
7.ROUGE评估 #
让我们加载 ROUEGE 评估器。
#With the function load of the library evaluate
#we create a rouge_score object
rouge_score = evaluate.load("rouge")
clear
440 / 5,000
翻译结果 #
翻译结果 #
计算 ROUGE 非常简单,只需调用我们之前创建的 rouge_score 对象的计算函数即可。此函数将要比较的文本作为参数,并将第三个值 use_stemmer 作为参数,该值指示是否应使用词干分析器或完整单词进行比较。 词干分析器是单词的基础。在同一个基础中转换单词的不同形式。 steammer 的一些示例是:
- Jumping -> Jump.
- Running -> Run.
- Cats -> Cat.
def compute_rouge_score(generated, reference):
#We need to add '\n' to each line before send it to ROUGE
generated_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in generated]
reference_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in reference]
return rouge_score.compute(
predictions=generated_with_newlines,
references=reference_with_newlines,
use_stemmer=True,
)
compute_rouge_score(summaries_base, summaries_finetuned)
{'rouge1': 0.47018752391886715,
'rouge2': 0.3209013209013209,
'rougeL': 0.34330271718331423,
'rougeLsum': 0.44692881745120555}
我们可以看到,在进行摘要时,两个模型之间存在差异。
例如,在 ROUGE-1 中,相似度为 47%,而在 ROUGE-2 中,相似度为 32%。这表明结果是不同的,有一些相似之处,但差异很大。
但是,我们仍然不知道哪个模型更好,因为我们将它们相互比较,而不是与参考文本进行比较。但至少,我们知道应用于第二个模型的微调过程已显着改变了其结果。
8.与具有真实摘要的数据集进行比较 #
我们将加载数据集 cnn_dailymail。这是数据集库中一个众所周知的数据集,它非常适合我们的目的。
除了新闻之外,它还包含预先存在的摘要。
我们将比较我们使用的两个模型生成的摘要与数据集中的摘要,以确定哪个模型创建的摘要更接近参考摘要。
from datasets import load_dataset
cnn_dataset = load_dataset("ccdv/cnn_dailymail", "3.0.0")
Downloading builder script: 0%| | 0.00/9.27k [00:00<?, ?B/s]
Downloading readme: 0%| | 0.00/13.9k [00:00<?, ?B/s]
The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.
Do you wish to run the custom code? [y/N] y
Downloading data: 0%| | 0.00/159M [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/376M [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/572k [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/12.3M [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/661k [00:00<?, ?B/s]
Generating train split: 0 examples [00:00, ? examples/s]
Generating validation split: 0 examples [00:00, ? examples/s]
Generating test split: 0 examples [00:00, ? examples/s]
#Get just a few news to test
sample_cnn = cnn_dataset["test"].select(range(MAX_NEWS))
sample_cnn
Dataset({
features: ['article', 'highlights', 'id'],
num_rows: 3
})
我们检索摘要的最大长度,以便模型可以选择生成相同长度的摘要(如果它们选择这样做的话)。
max_length = max(len(item['highlights']) for item in sample_cnn)
max_length = max_length + 10
summaries_t5_base = create_summaries(sample_cnn["article"],
tokenizer_base,
model_base,
max_l=max_length)
# Time taken: 16.551 seconds
# Time taken: 13.505 seconds
# Time taken: 13.629 seconds
summaries_t5_finetuned = create_summaries(sample_cnn["article"],
tokenizer_finetuned,
model_finetuned,
max_l=max_length)
# Time taken: 17.991 seconds
# Time taken: 9.375 seconds
#Time taken: 12.378 seconds
#Get the real summaries from the cnn_dataset
real_summaries = sample_cnn['highlights']
让我们看一下生成的摘要以及数据集提供的参考摘要。
summaries = pd.DataFrame.from_dict(
{
"base": summaries_t5_base,
"finetuned": summaries_t5_finetuned,
"reference": real_summaries,
}
)
summaries.head()
base | finetuned | reference | |
---|---|---|---|
0 | best died in hospice in Hickory, north Carolin… | Jimmie Best was “the most constantly creative … | James Best, who played the sheriff on “The Duk… |
1 | “it doesn’t matter what anyone says, he is pre… | Dr. Anthony Moschetto’s attorney calls the all… | A lawyer for Dr. Anthony Moschetto says the ch… |
2 | president Barack Obama took part in a roundtab… | President Obama says climate change is a publi… | “No challenge poses more of a public threat th.. |
现在我们可以计算这两个模型的 ROUGE 分数。
summaries_t5_base
['best died in hospice in Hickory, north Carolina, of complications from pneumonia. he played bumbling sheriff Rosco P. Coltrane on "the Dukes of Hazzard" he was born in Kentucky and raised in rural Indiana.',
'"it doesn\'t matter what anyone says, he is presumed to be innocent," attorney says. cardiologist\'s lawyer says allegations against his client are "completely unsubstantiated" prosecutors say he pleaded not guilty to all charges. he faces charges in connection with a plot to take out a rival doctor.',
'president Barack Obama took part in a roundtable discussion this week on climate change. he refocused on the issue from a public health vantage point. the average american can also do their part to reduce their own carbon footprint.']
real_summaries
['James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .',
'A lawyer for Dr. Anthony Moschetto says the charges against him are baseless .\nMoschetto, 54, was arrested for selling drugs and weapons, prosecutors say .\nAuthorities allege Moschetto hired accomplices to burn down the practice of former associate .',
'"No challenge poses more of a public threat than climate change," the President says .\nHe credits the Clean Air Act with making Americans "a lot" healthier .']
compute_rouge_score(summaries_t5_base, real_summaries)
{'rouge1': 0.3050834824090638,
'rouge2': 0.07211128178870115,
'rougeL': 0.2095520274299344,
'rougeLsum': 0.2662418008348241}
compute_rouge_score(summaries_t5_finetuned, real_summaries)
{'rouge1': 0.31659149328289443,
'rouge2': 0.11065084340946411,
'rougeL': 0.22002036956205442,
'rougeLsum': 0.24877540132887144}
根据这些结果,我认为微调模型的表现略优于 T5-Base 模型。除了 LSUM 之外,它在所有指标中始终获得更高的 ROUGE 分数,而 LSUM 的差异很小。
此外,ROUGE 指标非常容易解释。
LSUM 表示最长公共子序列(无论词序如何)占文本总长度的百分比。
这可以很好地指示文本之间的整体相似性。但是,两个模型的 LSUM 分数非常相似,而微调模型在其他 ROUGE 指标中的分数更高。
就我个人而言,我更倾向于微调模型,尽管差异可能不是很大。
使用 ROUGE 比较实体
entities=['Paris, Londres, Barcelona, Reus']
entities_ref=['Reus, Paris, Londres, Barcelona']
compute_rouge_score(entities, entities_ref)
{'rouge1': 1.0,
'rouge2': 0.6666666666666666,
'rougeL': 0.75,
'rougeLsum': 0.75}
entities_ref=['Paris, Londres, Barcelona, Reus']
compute_rouge_score(entities, entities_ref)
{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}