Crawl4AI 简化了异步 Web 爬取和数据提取,使其可用于大型语言模型 (LLM) 和 AI 应用程序。🆓🌐
特点✨
- 🆓 完全免费且开源
- 🚀 性能超快,超越许多付费服务
- 🤖 LLM 友好的输出格式(JSON、清理的 HTML、markdown)
- 🌍 支持同时抓取多个 URL
- 🎨 提取并返回所有媒体标签(图像、音频和视频)
- 🔗 提取所有外部和内部链接
- 📚 从页面中提取元数据
- 🔄 爬取之前用于身份验证、标头和页面修改的自定义钩子
- 🕵️ 用户代理自定义
- 🖼️ 截取页面截图
- 📜 抓取前执行多个自定义 JavaScript
- 📊 使用 JsonCssExtractionStrategy 生成无需 LLM 的结构化输出
- 📚 各种分块策略:基于主题、正则表达式、句子等
- 🧠 高级提取策略:余弦聚类、LLM 等
- 🎯 CSS 选择器支持精确的数据提取
- 📝 传递指令/关键字以优化提取
- 🔒 代理支持,增强隐私和访问
- 🔄 针对复杂的多页面爬取场景的会话管理
- 🌐 异步架构,提高性能和可扩展性
安装
Crawl4AI 提供灵活的安装选项,以适应各种用例。您可以将其安装为 Python 包或使用 Docker。
使用 pip 🐍
选择最适合您需求的安装选项:
基本安装
对于基本的网页爬取和抓取任务:
pip install crawl4ai
默认情况下,这将安装 Crawl4AI 的异步版本,使用 Playwright 进行网络爬取。
👉 注意:安装 Crawl4AI 时,安装脚本应自动安装并设置 Playwright。但是,如果遇到任何与 Playwright 相关的错误,则可以使用以下方法之一手动安装它:
- 通过命令行:playwright install
- 如果上述方法不起作用,请尝试这个更具体的命令:python -m playwright install chromium
在某些情况下,第二种方法已被证明更为可靠。
同步版本安装
如果您需要使用 Selenium 的同步版本:
pip install crawl4ai[sync]
开发安装
对于计划修改源代码的贡献者:
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
快速入门🚀
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
高级用法🔬
执行 JavaScript 并使用 CSS 选择器
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
result = await crawler.arun(
url="https://www.nbcnews.com/business",
js_code=js_code,
css_selector=".wide-tease-item__description",
bypass_cache=True
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
使用代理
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
bypass_cache=True
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
无需 LLM 即可提取结构化数据
允许JsonCssExtractionStrategy
使用 CSS 选择器从网页中精确提取结构化数据。
import asyncio
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def extract_news_teasers():
schema = {
"name": "News Teaser Extractor",
"baseSelector": ".wide-tease-item__wrapper",
"fields": [
{
"name": "category",
"selector": ".unibrow span[data-testid='unibrow-text']",
"type": "text",
},
{
"name": "headline",
"selector": ".wide-tease-item__headline",
"type": "text",
},
{
"name": "summary",
"selector": ".wide-tease-item__description",
"type": "text",
},
{
"name": "time",
"selector": "[data-testid='wide-tease-date']",
"type": "text",
},
{
"name": "image",
"type": "nested",
"selector": "picture.teasePicture img",
"fields": [
{"name": "src", "type": "attribute", "attribute": "src"},
{"name": "alt", "type": "attribute", "attribute": "alt"},
],
},
{
"name": "link",
"selector": "a[href]",
"type": "attribute",
"attribute": "href",
},
],
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
extraction_strategy=extraction_strategy,
bypass_cache=True,
)
assert result.success, "Failed to crawl the page"
news_teasers = json.loads(result.extracted_content)
print(f"Successfully extracted {len(news_teasers)} news teasers")
print(json.dumps(news_teasers[0], indent=2))
if __name__ == "__main__":
asyncio.run(extract_news_teasers())
有关更多高级用法示例,请查看文档中的示例部分。
使用 OpenAI 提取结构化数据
import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url='https://openai.com/api/pricing/',
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
会话管理和动态内容抓取
Crawl4AI 擅长处理复杂场景,例如抓取通过 JavaScript 加载的动态内容的多个页面。以下是抓取跨多个页面的 GitHub 提交的示例:
import asyncio
import re
from bs4 import BeautifulSoup
from crawl4ai import AsyncWebCrawler
async def crawl_typescript_commits():
first_commit = ""
async def on_execution_started(page):
nonlocal first_commit
try:
while True:
await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')
commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')
commit = await commit.evaluate('(element) => element.textContent')
commit = re.sub(r'\s+', '', commit)
if commit and commit != first_commit:
first_commit = commit
break
await asyncio.sleep(0.5)
except Exception as e:
print(f"Warning: New content didn't appear after JavaScript execution: {e}")
async with AsyncWebCrawler(verbose=True) as crawler:
crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
all_commits = []
js_next_page = """
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
"""
for page in range(3): # Crawl 3 pages
result = await crawler.arun(
url=url,
session_id=session_id,
css_selector="li.Box-sc-g0xbh4-0",
js=js_next_page if page > 0 else None,
bypass_cache=True,
js_only=page > 0
)
assert result.success, f"Failed to crawl page {page + 1}"
soup = BeautifulSoup(result.cleaned_html, 'html.parser')
commits = soup.select("li")
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
if __name__ == "__main__":
asyncio.run(crawl_typescript_commits())
此示例展示了 Crawl4AI 处理异步加载内容的复杂场景的能力。它会抓取多个 GitHub 提交页面,执行 JavaScript 来加载新内容,并使用自定义钩子确保在继续操作之前已加载数据。
速度比较🚀
Crawl4AI 的设计主要以速度为重点。我们的目标是通过高质量的数据提取提供尽可能快的响应,最大限度地减少数据和用户之间的抽象。
我们对 Crawl4AI 和付费服务 Firecrawl 进行了速度比较。结果证明了 Crawl4AI 的卓越性能:
Firecrawl:
Time taken: 7.02 seconds
Content length: 42074 characters
Images found: 49
Crawl4AI (simple crawl):
Time taken: 1.60 seconds
Content length: 18238 characters
Images found: 49
Crawl4AI (with JavaScript execution):
Time taken: 4.64 seconds
Content length: 40869 characters
Images found: 89
如您所见,Crawl4AI 的表现明显优于 Firecrawl:
- 简单爬行:Crawl4AI 比 Firecrawl 快 4 倍以上。
- 使用 JavaScript 执行:即使在执行 JavaScript 来加载更多内容(将找到的图像数量加倍)时,Crawl4AI 仍然比 Firecrawl 的简单抓取速度更快。
您可以在我们的存储库中找到完整的比较代码docs/examples/crawl4ai_vs_firecrawl.py
。
文档📚
有关详细文档,包括安装说明、高级功能和 API 参考,请访问我们的文档网站。
贡献🤝
我们欢迎开源社区的贡献。查看我们的贡献指南了解更多信息。
许可证📄
Crawl4AI 是在Apache 2.0 许可证下发布的。