1.摘要 #
之前在笔记本 4_1_rouge_evaluations中,我们学习了如何使用 ROUGE 来评估哪个摘要最接近人类创建的摘要。这次,我们将使用嵌入距离和 LangSmith 来验证哪个模型生成的摘要更类似于参考摘要。
我们将继续使用与 ROUGE 示例相同的数据集和模型。也就是说,一个包含 CNN 新闻文章及其人工创建的摘要的数据集将作为参考。此外,我们有两个 T5 模型,其中一个专门针对生成摘要进行了微调。
#Loading Necessary Libraries
!pip install -q transformers==4.42.4
!pip install -q langchain==0.2.11
!pip install -q langchain-openai==0.1.19
!pip install -q langchainhub==0.1.20
!pip install -q datasets==2.20.0
!pip install -q huggingface-hub==0.23.5
!pip install -q langchain-community==0.2.10
#You need a LangChain API Key.
from getpass import getpass
import os
if not 'LANGCHAIN_API_KEY' in os.environ:
os.environ["LANGCHAIN_API_KEY"] = getpass("LangChain API Key: ")
#LangChain API Key: ··········
if not 'OPENAI_API_KEY' in os.environ:
os.environ["OPENAI_API_KEY"] = getpass("OPENAI API Key: ")
#OPENAI API Key: ··········
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"]="https://api.smith.langchain.com"
#os.environ["LANGCHAIN_PROJECT"]="langsmith_yttest01"
#Importing Client from Langsmith
from langsmith import Client
client = Client()
2.创建数据集 #
from datasets import load_dataset
cnn_dataset = load_dataset(
"ccdv/cnn_dailymail", version
="3.0.0",
trust_remote_code=True
)
Downloading builder script: 0%| | 0.00/9.27k [00:00<?, ?B/s]
Downloading readme: 0%| | 0.00/13.9k [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/159M [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/376M [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/572k [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/12.3M [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/661k [00:00<?, ?B/s]
Generating train split: 0 examples [00:00, ? examples/s]
Generating validation split: 0 examples [00:00, ? examples/s]
Generating test split: 0 examples [00:00, ? examples/s]
def add_prefix(example):
return {
**example,
"article": f"Summarize this news:\n{example['article']}"
}
#cnn_dataset = cnn_dataset.map(add_prefix)
#Get just a few news to test
MAX_NEWS=3
sample_cnn = cnn_dataset["test"].select(range(MAX_NEWS)).map(add_prefix)
sample_cnn
Map: 0%| | 0/3 [00:00<?, ? examples/s]
Dataset({
features: ['article', 'highlights', 'id'],
num_rows: 3
})
数据集包含三列:文章、亮点和 ID。要使用 LangSmith,我们需要创建一个 LangSmith 格式的数据集。
LangSmith 需要提示和结果。为此,我们将通过添加前缀“总结这条新闻”将文章转换为提示。因此,我们将使用亮点的内容,它代表了人类创建的摘要。
print(sample_cnn[0])
{'article': 'Summarize this news:\n(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he\'d been a busy actor for decades in theater and in Hollywood, Best didn\'t become famous until 1979, when "The Dukes of Hazzard\'s" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best\'s Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his "hot pursuit" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive "kew-kew-kew" chuckle and for goofy catchphrases such as "cuff \'em and stuff \'em!" upon making an arrest. Among the most popular shows on TV in the early \'80s, "The Dukes of Hazzard" ran until 1985 and spawned TV movies, an animated series and video games. Several of Best\'s "Hazzard" co-stars paid tribute to the late actor on social media. "I laughed and learned more from Jimmie in one hour than from anyone else in a whole year," co-star John Schneider, who played Bo Duke, said on Twitter. "Give Uncle Jesse my love when you see him dear friend." "Jimmy Best was the most constantly creative person I have ever known," said Ben Jones, who played mechanic Cooter on the show, in a Facebook post. "Every minute of his long life was spent acting, writing, producing, painting, teaching, fishing, or involved in another of his life\'s many passions." Born Jewel Guy on July 26, 1926, in Powderly, Kentucky, Best was orphaned at 3 and adopted by Armen and Essa Best, who renamed him James and raised him in rural Indiana. Best served in the Army during World War II before launching his acting career. In the 1950s and 1960s, he accumulated scores of credits, playing a range of colorful supporting characters in such TV shows as "The Twilight Zone," "Bonanza," "The Andy Griffith Show" and "Gunsmoke." He later appeared in a handful of Burt Reynolds\' movies, including "Hooper" and "The End." But Best will always be best known for his "Hazzard" role, which lives on in reruns. "Jimmie was my teacher, mentor, close friend and collaborator for 26 years," Latshaw said. "I directed two of his feature films, including the recent \'Return of the Killer Shrews,\' a sequel he co-wrote and was quite proud of as he had made the first one more than 50 years earlier." People we\'ve lost in 2015 . CNN\'s Stella Chan contributed to this story.', 'highlights': 'James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .', 'id': '00200e794fa41d3f7ce92cbf43e9fd4cd652bb09'}
现在我们有了带有提示和参考摘要的数据集,是时候使用这些信息在 LangSmith 中创建一个数据集了。
3.在 Langsmith 中创建数据集 #
LangSmith 中的数据集由输入(传递给模型进行评估的提示)和输出(应包含我们期望模型返回的内容)组成。
#import uuid
import datetime
input_key=['article']
output_key=['highlights']
NAME_DATASET=f"Summarize_dataset_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
#This create the Dataset in LangSmith with the content in sample_cnn
dataset = client.upload_dataframe(
df=sample_cnn,
input_keys=input_key,
output_keys=output_key,
name=NAME_DATASET,
description="Test Embedding distance between model summarizations",
data_type="kv"
)
在此图中,我们可以看到数据集在 LangSmith 中注册后的示例。
在输入列中,有要发送的提示,而在输出列中,存储了预期的输出。
在进行比较时,将向模型提供提示,并计算其响应与存储在样本数据集中的响应之间的余弦距离。
4.从 Hugging Face 恢复模型 #
让我们从 HuggingFace 中检索两个模型。一个是基础 T5 模型,另一个是使用同一数据集的训练部分进行微调以生成摘要的模型。
from langchain import HuggingFaceHub
from getpass import getpass
hf_key = getpass("Hugging Face Key: ")
Hugging Face Key: ··········
summarizer_base = HuggingFaceHub(
repo_id="t5-base",
model_kwargs={"temperature":0, "max_length":180},
huggingfacehub_api_token=hf_key
)
/usr/local/lib/python3.10/dist-packages/langchain_core/_api/deprecation.py:139: LangChainDeprecationWarning: The class `HuggingFaceHub` was deprecated in LangChain 0.0.21 and will be removed in 0.3.0. An updated version of the class exists in the langchain-huggingface package and should be used instead. To use it run `pip install -U langchain-huggingface` and import as `from langchain_huggingface import HuggingFaceEndpoint`.
warn_deprecated(
summarizer_finetuned = HuggingFaceHub(
repo_id="flax-community/t5-base-cnn-dm",
model_kwargs={"temperature":0, "max_length":180},
huggingfacehub_api_token=hf_key
)
5.定义评估者 #
第一步是定义一个评估器,在其中指定要评估的变量。在我们的例子中,我选择仅测量“embedding_distance”。
我留下了“string_distance”作为注释,以防您想进行两次评估而不是一次评估的测试。
from langchain.smith import run_on_dataset, RunEvalConfig
!pip install -q rapidfuzz==3.6.1
#We are using just one of the multiple evaluator available on LangSmith.
evaluation_config = RunEvalConfig(
evaluators=[
"embedding_distance",
#RunEvalConfig.Criteria("conciseness"),
#RunEvalConfig.Criteria("harmfulness"),
#"string_distance"
],
)
5. 运行评估器 #
使用相同的配置,我们可以对同一数据集启动两次评估。每个选定的模型各一次。
chain_models = [summarizer_base, summarizer_finetuned]
project_name = f"T5-BASE {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
base_t5_results = run_on_dataset(
client=client,
project_name=project_name,
dataset_name=NAME_DATASET,
llm_or_chain_factory=summarizer_base,
evaluation=evaluation_config,
)
View the evaluation results for project 'T5-BASE 2024-07-28 13:57:36' at:
https://smith.langchain.com/o/c67fe7fc-f385-519d-9a10-9464d2be3d9d/datasets/951ef736-0cb1-4d58-a4e2-d41da4d7b758/compare?selectedSessions=e1d9e763-3025-4324-bbc2-24b041d67f02
View all tests for Dataset Summarize_dataset_2024-07-28 13:57:02 at:
https://smith.langchain.com/o/c67fe7fc-f385-519d-9a10-9464d2be3d9d/datasets/951ef736-0cb1-4d58-a4e2-d41da4d7b758
[------------------------------------------------->] 3/3
project_name = f"T5-FineTuned {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
finetuned_t5_results = run_on_dataset(
client=client,
project_name=project_name,
dataset_name=NAME_DATASET,
llm_or_chain_factory=summarizer_finetuned,
evaluation=evaluation_config,
)
View the evaluation results for project 'T5-FineTuned 2024-07-28 13:57:44' at:
https://smith.langchain.com/o/c67fe7fc-f385-519d-9a10-9464d2be3d9d/datasets/951ef736-0cb1-4d58-a4e2-d41da4d7b758/compare?selectedSessions=dac25105-a6c5-48fb-ac47-8d6a0921c2e2
View all tests for Dataset Summarize_dataset_2024-07-28 13:57:02 at:
https://smith.langchain.com/o/c67fe7fc-f385-519d-9a10-9464d2be3d9d/datasets/951ef736-0cb1-4d58-a4e2-d41da4d7b758
[------------------------------------------------->] 3/3
在下面的图片中您可以看到两个测试之间的比较。
好吧,既然这么简单,我们为什么不尝试与 OpenAI 模型进行比较呢?
if not 'OPENAI_API_KEY' in os.environ:
os.environ["OPENAI_API_KEY"] = getpass("OPENAI API Key: ")
from langchain_openai import OpenAI
open_aillm=OpenAI(temperature=0.0)
project_name = f"OpenAI {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
finetuned_t5_results = run_on_dataset(
client=client,
project_name=project_name,
dataset_name=NAME_DATASET,
llm_or_chain_factory=open_aillm,
evaluation=evaluation_config,
)
View the evaluation results for project 'OpenAI 2024-07-28 13:57:52' at:
https://smith.langchain.com/o/c67fe7fc-f385-519d-9a10-9464d2be3d9d/datasets/951ef736-0cb1-4d58-a4e2-d41da4d7b758/compare?selectedSessions=17493e77-b9f1-4f3d-9d23-79c99bcc83fc
View all tests for Dataset Summarize_dataset_2024-07-28 13:57:02 at:
https://smith.langchain.com/o/c67fe7fc-f385-519d-9a10-9464d2be3d9d/datasets/951ef736-0cb1-4d58-a4e2-d41da4d7b758
[------------------------------------------------->] 3/3
使用 OpenAI 模型的实验取得了最佳效果。但是,请注意!如我们所见,由于我们使用 API,因此需要付费。
另一个关键信息是,我们可以查看模型的性能数据。这些数据对于最低限度评估我们的推理服务器也很有用。