介绍
Llama Stack 是一组标准化和有主见的接口,用于如何构建规范的工具链组件(微调、合成数据生成)和代理应用程序。我们希望这些接口能够在整个生态系统中得到采用,这将有助于更轻松地实现互操作性。
Llama Stack 定义并标准化了将生成式 AI 应用程序推向市场所需的构建模块。这些模块涵盖整个开发生命周期:从模型训练和微调,到产品评估,再到在生产中调用 AI 代理。除了定义之外,我们还在开发开源版本并与云提供商合作,确保开发人员能够使用跨平台的一致、互锁的组件来组装 AI 解决方案。最终目标是加速 AI 领域的创新。
安装
你可以使用以下命令将此存储库作为软件包进行安装pip install llama-stack
如果您想从源安装:
mkdir -p ~/local
cd ~/local
git clone git@github.com:meta-llama/llama-stack.git
conda create -n stack python=3.10
conda activate stack
cd llama-stack
$CONDA_PREFIX/bin/pip install -e .
入门
CLI工具llama
可帮助您设置和使用 Llama 工具链和代理系统。安装软件包后,它应该在您的路径上可用llama-stack
。
本指南可帮助您在不到 5 分钟的时间内快速开始构建和运行 Llama Stack 服务器!
您还可以查看此笔记本来尝试演示脚本。
快速备忘单
- 使用我们的元参考实现为所有具有
conda
构建类型的 API 端点快速构建和启动 LlamaStack 服务器。llama stack build
- 系统将提示您以交互方式输入构建信息。
llama stack build
> Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local-stack
> Enter the image type you want your distribution to be built with (docker or conda): conda
Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs.
> Enter the API provider for the inference API: (default=meta-reference): meta-reference
> Enter the API provider for the safety API: (default=meta-reference): meta-reference
> Enter the API provider for the agents API: (default=meta-reference): meta-reference
> Enter the API provider for the memory API: (default=meta-reference): meta-reference
> Enter the API provider for the telemetry API: (default=meta-reference): meta-reference
> (Optional) Enter a short description for your Llama Stack distribution:
Build spec configuration saved at ~/.conda/envs/llamastack-my-local-stack/my-local-stack-build.yaml
You can now run `llama stack configure my-local-stack`
llama stack configure
llama stack configure <name>
使用您在步骤中先前定义的名称运行build
。
llama stack configure <name>
- 系统将提示您输入 Llama Stack 的配置
$ llama stack configure my-local-stack
Could not find my-local-stack. Trying conda build name instead...
Configuring API `inference`...
=== Configuring provider `meta-reference` for API inference...
Enter value for model (default: Llama3.1-8B-Instruct) (required):
Do you want to configure quantization? (y/n): n
Enter value for torch_seed (optional):
Enter value for max_seq_len (default: 4096) (required):
Enter value for max_batch_size (default: 1) (required):
Configuring API `safety`...
=== Configuring provider `meta-reference` for API safety...
Do you want to configure llama_guard_shield? (y/n): n
Do you want to configure prompt_guard_shield? (y/n): n
Configuring API `agents`...
=== Configuring provider `meta-reference` for API agents...
Enter `type` for persistence_store (options: redis, sqlite, postgres) (default: sqlite):
Configuring SqliteKVStoreConfig:
Enter value for namespace (optional):
Enter value for db_path (default: /home/xiyan/.llama/runtime/kvstore.db) (required):
Configuring API `memory`...
=== Configuring provider `meta-reference` for API memory...
> Please enter the supported memory bank type your provider has for memory: vector
Configuring API `telemetry`...
=== Configuring provider `meta-reference` for API telemetry...
> YAML configuration has been written to ~/.llama/builds/conda/my-local-stack-run.yaml.
You can now run `llama stack run my-local-stack --port PORT`
llama stack run
llama stack run <name>
使用您之前定义的名称运行。
llama stack run my-local-stack
...
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
...
Finished model load YES READY
Serving POST /inference/chat_completion
Serving POST /inference/completion
Serving POST /inference/embeddings
Serving POST /memory_banks/create
Serving DELETE /memory_bank/documents/delete
Serving DELETE /memory_banks/drop
Serving GET /memory_bank/documents/get
Serving GET /memory_banks/get
Serving POST /memory_bank/insert
Serving GET /memory_banks/list
Serving POST /memory_bank/query
Serving POST /memory_bank/update
Serving POST /safety/run_shield
Serving POST /agentic_system/create
Serving POST /agentic_system/session/create
Serving POST /agentic_system/turn/create
Serving POST /agentic_system/delete
Serving POST /agentic_system/session/delete
Serving POST /agentic_system/session/get
Serving POST /agentic_system/step/get
Serving POST /agentic_system/turn/get
Serving GET /telemetry/get_trace
Serving POST /telemetry/log_event
Listening on :::5000
INFO: Started server process [587053]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
步骤 1. 构建
在以下步骤中,假设我们将使用一个Meta-Llama3.1-8B-Instruct
模型。我们将命名我们的构建8b-instruct
以帮助我们记住配置。我们将开始构建我们的发行版(以 Conda 环境或 Docker 映像的形式)。在此步骤中,我们将指定:
name
:我们的发行版的名称(例如8b-instruct
)image_type
:我们的构建图像类型(conda | docker
)distribution_spec
:我们用于指定 API 提供商的分发规范description
:分发配置的简短描述providers
:指定为每个 API 端点提供服务的底层实现image_type
:conda
|docker
指定是否以 Docker 镜像或 Conda 环境的形式构建分发。
在构建命令的末尾,我们将生成存储<name>-build.yaml
构建配置的文件。
此步骤完成后,<name>-build.yaml
将生成一个名为的文件,并保存在命令末尾指定的输出文件路径下。
从头开始构建
- 对于新用户,我们可以从运行开始
llama stack build
,这将允许您以交互方式输入向导,系统将提示您输入构建配置。
llama stack build
运行上述命令将允许您填写配置以构建您的 Llama Stack 发行版,您将看到以下输出。
> Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): 8b-instruct
> Enter the image type you want your distribution to be built with (docker or conda): conda
Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs.
> Enter the API provider for the inference API: (default=meta-reference): meta-reference
> Enter the API provider for the safety API: (default=meta-reference): meta-reference
> Enter the API provider for the agents API: (default=meta-reference): meta-reference
> Enter the API provider for the memory API: (default=meta-reference): meta-reference
> Enter the API provider for the telemetry API: (default=meta-reference): meta-reference
> (Optional) Enter a short description for your Llama Stack distribution:
Build spec configuration saved at ~/.conda/envs/llamastack-my-local-llama-stack/8b-instruct-build.yaml
Ollama(可选)
如果您计划使用 Ollama 进行推理,则需要按照这些说明安装服务器。
从模板构建
- 为了从替代 API 提供商进行构建,我们为用户提供了分发模板,以开始构建由不同提供商支持的分发。
以下命令将允许您查看可用的模板及其相应的提供程序。
llama stack build --list-templates
然后,您可以选择一个模板来根据自己的喜好使用提供商来构建您的发行版。
llama stack build --template local-tgi --name my-tgi-stack
$ llama stack build --template local-tgi --name my-tgi-stack
...
...
Build spec configuration saved at ~/.conda/envs/llamastack-my-tgi-stack/my-tgi-stack-build.yaml
You may now run `llama stack configure my-tgi-stack` or `llama stack configure ~/.conda/envs/llamastack-my-tgi-stack/my-tgi-stack-build.yaml`
从配置文件构建
- 除了模板之外,您还可以通过编辑配置文件来根据自己的喜好定制构建,并使用以下命令从配置文件进行构建。
- 配置文件的内容将类似于中的内容
llama_stack/distributions/templates/
。
$ cat llama_stack/distribution/templates/local-ollama-build.yaml
name: local-ollama
distribution_spec:
description: Like local, but use ollama for running LLM inference
providers:
inference: remote::ollama
memory: meta-reference
safety: meta-reference
agents: meta-reference
telemetry: meta-reference
image_type: conda
llama stack build --config llama_stack/distribution/templates/local-ollama-build.yaml
如何使用 Docker 镜像构建发行版
要构建 docker 镜像,您可以从模板开始并使用标志--image-type docker
指定docker
构建镜像类型。
llama stack build --template local --image-type docker --name docker-0
或者,您可以使用配置文件并在我们的文件中设置image_type
为,然后运行。将包含以下内容:docker
<name>-build.yaml
llama stack build <name>-build.yaml
<name>-build.yaml
name: local-docker-example
distribution_spec:
description: Use code from `llama_stack` itself to serve all llama stack APIs
docker_image: null
providers:
inference: meta-reference
memory: meta-reference-faiss
safety: meta-reference
agentic_system: meta-reference
telemetry: console
image_type: docker
以下命令允许您构建一个名为的 Docker 映像<name>
llama stack build --config <name>-build.yaml
Dockerfile created successfully in /tmp/tmp.I0ifS2c46A/DockerfileFROM python:3.10-slim
WORKDIR /app
...
...
You can run it with: podman run -p 8000:8000 llamastack-docker-local
Build spec configuration saved at ~/.llama/distributions/docker/docker-local-build.yaml
步骤2.配置
在我们的发行版构建完成后(以 docker 或 conda 环境的形式),我们将运行以下命令来
llama stack configure [ <name> | <docker-image-name> | <path/to/name.build.yaml>]
- 对于
conda
环境:<path/to/name.build.yaml> 将是从步骤 1 保存的生成的构建规范。 - 对于
docker
从 Dockerhub 下载的图像,您也可以用作参数。 - 运行
docker images
以检查机器上可用的图像列表。
$ llama stack configure 8b-instruct
Configuring API: inference (meta-reference)
Enter value for model (existing: Meta-Llama3.1-8B-Instruct) (required):
Enter value for quantization (optional):
Enter value for torch_seed (optional):
Enter value for max_seq_len (existing: 4096) (required):
Enter value for max_batch_size (existing: 1) (required):
Configuring API: memory (meta-reference-faiss)
Configuring API: safety (meta-reference)
Do you want to configure llama_guard_shield? (y/n): y
Entering sub-configuration for llama_guard_shield:
Enter value for model (default: Llama-Guard-3-8B) (required):
Enter value for excluded_categories (default: []) (required):
Enter value for disable_input_check (default: False) (required):
Enter value for disable_output_check (default: False) (required):
Do you want to configure prompt_guard_shield? (y/n): y
Entering sub-configuration for prompt_guard_shield:
Enter value for model (default: Prompt-Guard-86M) (required):
Configuring API: agentic_system (meta-reference)
Enter value for brave_search_api_key (optional):
Enter value for bing_search_api_key (optional):
Enter value for wolfram_api_key (optional):
Configuring API: telemetry (console)
YAML configuration has been written to ~/.llama/builds/conda/8b-instruct-run.yaml
此步骤成功后,您应该能够找到包含~/.llama/builds/conda/8b-instruct-run.yaml
以下内容的运行配置规范。您可以编辑此文件以更改设置。
可以看到,上面我们做了基本的配置,配置了:
- 在模型上运行推理
Meta-Llama3.1-8B-Instruct
(从中获得llama model list
) - Llama Guard 安全盾牌(带模型)
Llama-Guard-3-8B
- Prompt Guard 安全防护罩带模型
Prompt-Guard-86M
关于这些配置如何存储为 yaml,请查看配置末尾打印的文件。
请注意,所有配置和模型都存储在~/.llama
步骤3.运行
现在,让我们启动 Llama Stack 分发服务器。您将需要本步骤最后写出的 YAML 配置文件llama stack configure
。
llama stack run 8b-instruct
您应该看到 Llama Stack 服务器启动并打印其支持的 API
$ llama stack run 8b-instruct
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 19.28 seconds
NCCL version 2.20.5+cuda12.4
Finished model load YES READY
Serving POST /inference/batch_chat_completion
Serving POST /inference/batch_completion
Serving POST /inference/chat_completion
Serving POST /inference/completion
Serving POST /safety/run_shield
Serving POST /agentic_system/memory_bank/attach
Serving POST /agentic_system/create
Serving POST /agentic_system/session/create
Serving POST /agentic_system/turn/create
Serving POST /agentic_system/delete
Serving POST /agentic_system/session/delete
Serving POST /agentic_system/memory_bank/detach
Serving POST /agentic_system/session/get
Serving POST /agentic_system/step/get
Serving POST /agentic_system/turn/get
Listening on :::5000
INFO: Started server process [453333]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
配置已完成~/.llama/builds/local/conda/8b-instruct-run.yaml
。请随意增加max_seq_len
。
“本地”分布推理服务器目前仅支持 CUDA。它不适用于 Apple Silicon 机器。
您可能需要使用该标志--disable-ipv6
来禁用 IPv6 支持
该服务器正在本地运行 Llama 模型。
步骤 4. 使用客户端进行测试
一旦服务器设置好,我们就可以用客户端进行测试以查看示例输出。
cd /path/to/llama-stack
conda activate <env> # any environment containing the llama-stack pip package will work
python -m llama_stack.apis.inference.client localhost 5000
这将运行聊天完成客户端并查询分发的 /inference/chat_completion API。
以下是示例输出:
User>hello world, write me a 2 sentence poem about the moon
Assistant> Here's a 2-sentence poem about the moon:
The moon glows softly in the midnight sky,
A beacon of wonder, as it passes by.
类似地,您可以通过以下方式测试安全性(如果您配置了 llama-guard 和/或 prompt-guard 防护罩):
python -m llama_stack.apis.safety.client localhost 5000
您可以在我们的llama-stack-apps仓库中找到更多带有客户端 SDK 的示例脚本,以便与 Llama Stack 服务器通信。