1.安装库和加载数据集 #
!pip install -q langchain==0.1.4
!pip install -q langchain-openai==0.0.5
!pip install -q langchainhub==0.1.14
!pip install -q datasets==2.16.1
!pip install -q chromadb==0.4.22
我们将从 Hugging Face 数据集库中下载数据集。这是一个包含疾病信息的数据集。
from datasets import load_dataset
data = load_dataset("keivalya/MedQuad-MedicalQnADataset", split='train')
Downloading readme: 0%| | 0.00/233 [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/22.5M [00:00<?, ?B/s]
Generating train split: 0 examples [00:00, ? examples/s]
data = data.to_pandas()
data.head(10)
qtype | Question | Answer | |
---|---|---|---|
0 | susceptibility | Who is at risk for Lymphocytic Choriomeningiti… | LCMV infections can occur after exposure to fr… |
1 | symptoms | What are the symptoms of Lymphocytic Choriomen… | LCMV is most commonly recognized as causing ne… |
2 | susceptibility | Who is at risk for Lymphocytic Choriomeningiti… | Individuals of all ages who come into contact … |
3 | exams and tests | How to diagnose Lymphocytic Choriomeningitis (… | During the first phase of the disease, the mos… |
4 | treatment | What are the treatments for Lymphocytic Chorio… | Aseptic meningitis, encephalitis, or meningoen… |
5 | prevention | How to prevent Lymphocytic Choriomeningitis (L… | LCMV infection can be prevented by avoiding co… |
6 | information | What is (are) Parasites – Cysticercosis ? | Cysticercosis is an infection caused by the la… |
7 | susceptibility | Who is at risk for Parasites – Cysticercosis? ? | Cysticercosis is an infection caused by the la… |
8 | exams and tests | How to diagnose Parasites – Cysticercosis ? | If you think that you may have cysticercosis, … |
9 | treatment | What are the treatments for Parasites – Cystic… | Some people with cysticercosis do not need to |
#uncoment this line if you want to limit the size of the data.
data = data[0:100]
如您所见,数据集中的医疗信息井然有序,对于像我这样不是该领域专家的人来说,它似乎非常有价值。这些信息可以作为任何普通医学书籍的有用补充,以支持初级保健医生。
加载 langchain 库以加载文档。
from langchain.document_loaders import DataFrameLoader
from langchain.vectorstores import Chroma
文档位于答案列中,其他列是元数据。
df_loader = DataFrameLoader(data, page_content_column="Answer")
df_document = df_loader.load()
display(df_document[:2])
[Document(page_content='LCMV infections can occur after exposure to fresh urine, droppings, saliva, or nesting materials from infected rodents. Transmission may also occur when these materials are directly introduced into broken skin, the nose, the eyes, or the mouth, or presumably, via the bite of an infected rodent. Person-to-person transmission has not been reported, with the exception of vertical transmission from infected mother to fetus, and rarely, through organ transplantation.', metadata={'qtype': 'susceptibility', 'Question': 'Who is at risk for Lymphocytic Choriomeningitis (LCM)? ?'}),
Document(page_content='LCMV is most commonly recognized as causing neurological disease, as its name implies, though infection without symptoms or mild febrile illnesses are more common clinical manifestations. \n \nFor infected persons who do become ill, onset of symptoms usually occurs 8-13 days after exposure to the virus as part of a biphasic febrile illness. This initial phase, which may last as long as a week, typically begins with any or all of the following symptoms: fever, malaise, lack of appetite, muscle aches, headache, nausea, and vomiting. Other symptoms appearing less frequently include sore throat, cough, joint pain, chest pain, testicular pain, and parotid (salivary gland) pain. \n \nFollowing a few days of recovery, a second phase of illness may occur. Symptoms may consist of meningitis (fever, headache, stiff neck, etc.), encephalitis (drowsiness, confusion, sensory disturbances, and/or motor abnormalities, such as paralysis), or meningoencephalitis (inflammation of both the brain and meninges). LCMV has also been known to cause acute hydrocephalus (increased fluid on the brain), which often requires surgical shunting to relieve increased intracranial pressure. In rare instances, infection results in myelitis (inflammation of the spinal cord) and presents with symptoms such as muscle weakness, paralysis, or changes in body sensation. An association between LCMV infection and myocarditis (inflammation of the heart muscles) has been suggested. \n \nPrevious observations show that most patients who develop aseptic meningitis or encephalitis due to LCMV survive. No chronic infection has been described in humans, and after the acute phase of illness, the virus is cleared from the body. However, as in all infections of the central nervous system, particularly encephalitis, temporary or permanent neurological damage is possible. Nerve deafness and arthritis have been reported. \n \nWomen who become infected with LCMV during pregnancy may pass the infection on to the fetus. Infections occurring during the first trimester may result in fetal death and pregnancy termination, while in the second and third trimesters, birth defects can develop. Infants infected In utero can have many serious and permanent birth defects, including vision problems, mental retardation, and hydrocephaly (water on the brain). Pregnant women may recall a flu-like illness during pregnancy, or may not recall any illness. \n \nLCM is usually not fatal. In general, mortality is less than 1%.', metadata={'qtype': 'symptoms', 'Question': 'What are the symptoms of Lymphocytic Choriomeningitis (LCM) ?'})]
我们可以对文档进行分块。我们想要将文档分割成多大的尺寸是一个设计决定。尺寸越大,提示就会越大,模型的响应过程也会越慢。
我们还需要考虑最大提示尺寸,并确保文档不超过该尺寸。
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1250,
separator="\n",
chunk_overlap=100)
texts = text_splitter.split_documents(df_document)
我们看到这些警告是因为它无法执行所需大小的分区。这是因为它等待分页符来划分文本,并在可能的情况下这样做。
first_doc = texts[1]
print(first_doc.page_content)
LCMV is most commonly recognized as causing neurological disease, as its name implies, though infection without symptoms or mild febrile illnesses are more common clinical manifestations.
For infected persons who do become ill, onset of symptoms usually occurs 8-13 days after exposure to the virus as part of a biphasic febrile illness. This initial phase, which may last as long as a week, typically begins with any or all of the following symptoms: fever, malaise, lack of appetite, muscle aches, headache, nausea, and vomiting. Other symptoms appearing less frequently include sore throat, cough, joint pain, chest pain, testicular pain, and parotid (salivary gland) pain
2. 初始化嵌入模型和向量数据库 #
我们从 OpenAI 加载 text-embedding-ada-002 模型。
from getpass import getpass
OPENAI_API_KEY = getpass("OpenAI API Key: ")
## OpenAI API Key: ··········
from langchain_openai import OpenAIEmbeddings
model_name = 'text-embedding-ada-002'
embed = OpenAIEmbeddings(
model=model_name,
openai_api_key=OPENAI_API_KEY
)
此单元格的执行可能需要 3 到 5 分钟。如果您希望速度更快,可以减少数据集中的记录数。
directory_cdb = '/content/drive/MyDrive/chromadb'
chroma_db = Chroma.from_documents(
df_document, embed, persist_directory=directory_cdb
)
我们将创建三个对象:
语言模型,可以是 OpenAI 中的任何一个。
内存,负责保存提示和所有必要的历史记录。
检索,用于获取存储在 ChromaDB 中的信息。
from langchain.chat_models import ChatOpenAI
from langchain_openai import OpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA
llm=OpenAI(openai_api_key=OPENAI_API_KEY,
temperature=0.0)
conversational_memory = ConversationBufferWindowMemory(
memory_key='chat_history',
k=4, #Number of messages stored in memory
return_messages=True #Must return the messages in the response.
)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=chroma_db.as_retriever()
)
我们可以尝试单独的检索,看看它返回的信息是否相关。
qa.run("What is the main symptom of LCM?")
' The main symptom of LCM is a biphasic febrile illness, which includes symptoms such as fever, malaise, lack of appetite, muscle aches, headache, nausea, and vomiting.'
完美!返回的信息正是我们想要的。
3. 创建代理Agent #
from langchain.agents import Tool
#Defining the list of tool objects to be used by LangChain.
tools = [
Tool(
name='Medical KB',
func=qa.run,
description=(
"""use this tool when answering medical knowledge queries to get
more information about the topic"""
)
)
]
from langchain.agents import create_react_agent
from langchain import hub
prompt = hub.pull("hwchase17/react-chat")
agent = create_react_agent(
tools=tools,
llm=llm,
prompt=prompt,
)
# Create an agent executor by passing in the agent and tools
from langchain.agents import AgentExecutor
agent_executor = AgentExecutor(agent=agent,
tools=tools,
verbose=True,
memory=conversational_memory,
max_iterations=30,
max_execution_time=600,
handle_parsing_errors=True
)
3.1使用对话代理 #
要进行查询,我们只需直接致电代理即可。首先,我将尝试与医疗领域无关的订单。
agent_executor.invoke({"input": "Give me the area of square of 2x2"})
> Entering new AgentExecutor chain...
Thought: Do I need to use a tool? Yes
Action: Medical KB
Action Input: Area of square I don't know.Do I need to use a tool? No
Final Answer: The area of a square with sides of 2 units is 4 square units.
> Finished chain.
{'input': 'Give me the area of square of 2x2',
'chat_history': [],
'output': 'The area of a square with sides of 2 units is 4 square units.'}
完美,模型无需访问配置的知识数据库即可做出响应。
现在我将尝试回答一个与健康无关的问题。
agent_executor.invoke({"input": "Do you know who is Clark Kent?"})
> Entering new AgentExecutor chain...
Thought: Do I need to use a tool? No
Final Answer: Clark Kent is the secret identity of the superhero Superman.
> Finished chain.
{'input': 'Do you know who is Clark Kent?',
'chat_history': [HumanMessage(content='Give me the area of square of 2x2'),
AIMessage(content='The area of a square with sides of 2 units is 4 square units.')],
'output': 'Clark Kent is the secret identity of the superhero Superman.'}
它也没有访问,因为模型已经能够识别出这不是与 LangChain 提供的数据库相关的问题。
现在是时候尝试一个与医学相关的问题了。让我们看看模型是否能够理解它应该首先在其可用的向量数据库中查找信息。
agent_executor.memory.clear()
agent_executor.invoke({"input": """I have a patient that can have Botulism,
how can I confirm the diagnosis?"""})
> Entering new AgentExecutor chain...
Thought: Do I need to use a tool? Yes
Action: Medical KB
Action Input: Botulism Botulism is a rare but serious paralytic illness caused by a nerve toxin produced by certain bacteria. It can be contracted through contaminated food, wounds, or ingestion of bacterial spores. Symptoms include muscle paralysis, difficulty swallowing, and respiratory failure. Treatment includes antitoxin, supportive care, and removal of the source of the toxin.Do I need to use a tool? No
Final Answer: To confirm the diagnosis, you can perform a physical exam and order laboratory tests, such as a stool or blood test, to detect the presence of the bacteria or its toxin. It is important to act quickly, as botulism can be life-threatening if left untreated.
> Finished chain.
{'input': 'I have a patient that can have Botulism,\nhow can I confirm the diagnosis?',
'chat_history': [],
'output': 'To confirm the diagnosis, you can perform a physical exam and order laboratory tests, such as a stool or blood test, to detect the presence of the bacteria or its toxin. It is important to act quickly, as botulism can be life-threatening if left untreated.'}
完美,对我们来说最重要的是它已经能够识别应该去医疗数据库搜索有关症状的信息。
agent_executor.invoke({"input": "Is this an important illness?"})
> Entering new AgentExecutor chain...
Thought: Do I need to use a tool? No
Final Answer: Yes, botulism is a serious illness that can be life-threatening if left untreated. It is important to seek medical attention and confirm the diagnosis as soon as possible.
> Finished chain.
{'input': 'Is this an important illness?',
'chat_history': [HumanMessage(content='I have a patient that can have Botulism,\nhow can I confirm the diagnosis?'),
AIMessage(content='To confirm the diagnosis, you can perform a physical exam and order laboratory tests, such as a stool or blood test, to detect the presence of the bacteria or its toxin. It is important to act quickly, as botulism can be life-threatening if left untreated.')],
'output': 'Yes, botulism is a serious illness that can be life-threatening if left untreated. It is important to seek medical attention and confirm the diagnosis as soon as possible.'}
而且记忆功能运行良好。我们可以继续对话,同时考虑到模型知道之前的问题和答案。
4. 结论 #
实验取得了小小的成功。Vectorial 数据库已配置完毕,并已从数据集中填充信息。已创建 LangChain 代理,并且它能够仅在必要时从数据库检索信息。别忘了我们的 ChatBot 有记忆。
所有这些只需几行代码即可完成!