Crawl4AI：开源 LLM 友好型 Web 爬虫和抓取工具-AI应用

Crawl4AI 简化了异步 Web 爬取和数据提取，使其可用于大型语言模型 (LLM) 和 AI 应用程序。🆓🌐

特点✨

🆓 完全免费且开源
🚀 性能超快，超越许多付费服务
🤖 LLM 友好的输出格式（JSON、清理的 HTML、markdown）
🌍 支持同时抓取多个 URL
🎨 提取并返回所有媒体标签（图像、音频和视频）
🔗 提取所有外部和内部链接
📚 从页面中提取元数据
🔄 爬取之前用于身份验证、标头和页面修改的自定义钩子
🕵️ 用户代理自定义
🖼️ 截取页面截图
📜 抓取前执行多个自定义 JavaScript
📊 使用 JsonCssExtractionStrategy 生成无需 LLM 的结构化输出
📚 各种分块策略：基于主题、正则表达式、句子等
🧠 高级提取策略：余弦聚类、LLM 等
🎯 CSS 选择器支持精确的数据提取
📝 传递指令/关键字以优化提取
🔒 代理支持，增强隐私和访问
🔄 针对复杂的多页面爬取场景的会话管理
🌐 异步架构，提高性能和可扩展性

安装

Crawl4AI 提供灵活的安装选项，以适应各种用例。您可以将其安装为 Python 包或使用 Docker。

使用 pip 🐍

选择最适合您需求的安装选项：

基本安装

对于基本的网页爬取和抓取任务：

pip install crawl4ai

默认情况下，这将安装 Crawl4AI 的异步版本，使用 Playwright 进行网络爬取。

👉 注意：安装 Crawl4AI 时，安装脚本应自动安装并设置 Playwright。但是，如果遇到任何与 Playwright 相关的错误，则可以使用以下方法之一手动安装它：

通过命令行：playwright install
如果上述方法不起作用，请尝试这个更具体的命令：python -m playwright install chromium

在某些情况下，第二种方法已被证明更为可靠。

同步版本安装

如果您需要使用 Selenium 的同步版本：

pip install crawl4ai[sync]

开发安装

对于计划修改源代码的贡献者：

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .

快速入门🚀

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

高级用法🔬

执行 JavaScript 并使用 CSS 选择器

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
            css_selector=".wide-tease-item__description",
            bypass_cache=True
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

使用代理

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            bypass_cache=True
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

无需 LLM 即可提取结构化数据

允许JsonCssExtractionStrategy使用 CSS 选择器从网页中精确提取结构化数据。

import asyncio
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def extract_news_teasers():
    schema = {
        "name": "News Teaser Extractor",
        "baseSelector": ".wide-tease-item__wrapper",
        "fields": [
            {
                "name": "category",
                "selector": ".unibrow span[data-testid='unibrow-text']",
                "type": "text",
            },
            {
                "name": "headline",
                "selector": ".wide-tease-item__headline",
                "type": "text",
            },
            {
                "name": "summary",
                "selector": ".wide-tease-item__description",
                "type": "text",
            },
            {
                "name": "time",
                "selector": "[data-testid='wide-tease-date']",
                "type": "text",
            },
            {
                "name": "image",
                "type": "nested",
                "selector": "picture.teasePicture img",
                "fields": [
                    {"name": "src", "type": "attribute", "attribute": "src"},
                    {"name": "alt", "type": "attribute", "attribute": "alt"},
                ],
            },
            {
                "name": "link",
                "selector": "a[href]",
                "type": "attribute",
                "attribute": "href",
            },
        ],
    }

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
        )

        assert result.success, "Failed to crawl the page"

        news_teasers = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(news_teasers)} news teasers")
        print(json.dumps(news_teasers[0], indent=2))

if __name__ == "__main__":
    asyncio.run(extract_news_teasers())

有关更多高级用法示例，请查看文档中的示例部分。

Crawl4AI 开源LLM Web 爬虫快速入门指南

使用 OpenAI 提取结构化数据

import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url='https://openai.com/api/pricing/',
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), 
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
            ),            
            bypass_cache=True,
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

会话管理和动态内容抓取

Crawl4AI 擅长处理复杂场景，例如抓取通过 JavaScript 加载的动态内容的多个页面。以下是抓取跨多个页面的 GitHub 提交的示例：

import asyncio
import re
from bs4 import BeautifulSoup
from crawl4ai import AsyncWebCrawler

async def crawl_typescript_commits():
    first_commit = ""
    async def on_execution_started(page):
        nonlocal first_commit 
        try:
            while True:
                await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')
                commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')
                commit = await commit.evaluate('(element) => element.textContent')
                commit = re.sub(r'\s+', '', commit)
                if commit and commit != first_commit:
                    first_commit = commit
                    break
                await asyncio.sleep(0.5)
        except Exception as e:
            print(f"Warning: New content didn't appear after JavaScript execution: {e}")

    async with AsyncWebCrawler(verbose=True) as crawler:
        crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)

        url = "https://github.com/microsoft/TypeScript/commits/main"
        session_id = "typescript_commits_session"
        all_commits = []

        js_next_page = """
        const button = document.querySelector('a[data-testid="pagination-next-button"]');
        if (button) button.click();
        """

        for page in range(3):  # Crawl 3 pages
            result = await crawler.arun(
                url=url,
                session_id=session_id,
                css_selector="li.Box-sc-g0xbh4-0",
                js=js_next_page if page > 0 else None,
                bypass_cache=True,
                js_only=page > 0
            )

            assert result.success, f"Failed to crawl page {page + 1}"

            soup = BeautifulSoup(result.cleaned_html, 'html.parser')
            commits = soup.select("li")
            all_commits.extend(commits)

            print(f"Page {page + 1}: Found {len(commits)} commits")

        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")

if __name__ == "__main__":
    asyncio.run(crawl_typescript_commits())

此示例展示了 Crawl4AI 处理异步加载内容的复杂场景的能力。它会抓取多个 GitHub 提交页面，执行 JavaScript 来加载新内容，并使用自定义钩子确保在继续操作之前已加载数据。

速度比较🚀

Crawl4AI 的设计主要以速度为重点。我们的目标是通过高质量的数据提取提供尽可能快的响应，最大限度地减少数据和用户之间的抽象。

我们对 Crawl4AI 和付费服务 Firecrawl 进行了速度比较。结果证明了 Crawl4AI 的卓越性能：

Firecrawl:
Time taken: 7.02 seconds
Content length: 42074 characters
Images found: 49

Crawl4AI (simple crawl):
Time taken: 1.60 seconds
Content length: 18238 characters
Images found: 49

Crawl4AI (with JavaScript execution):
Time taken: 4.64 seconds
Content length: 40869 characters
Images found: 89

如您所见，Crawl4AI 的表现明显优于 Firecrawl：

简单爬行：Crawl4AI 比 Firecrawl 快 4 倍以上。
使用 JavaScript 执行：即使在执行 JavaScript 来加载更多内容（将找到的图像数量加倍）时，Crawl4AI 仍然比 Firecrawl 的简单抓取速度更快。

您可以在我们的存储库中找到完整的比较代码docs/examples/crawl4ai_vs_firecrawl.py。

文档📚

有关详细文档，包括安装说明、高级功能和 API 参考，请访问我们的文档网站。

贡献🤝

我们欢迎开源社区的贡献。查看我们的贡献指南了解更多信息。

许可证📄

Crawl4AI 是在Apache 2.0 许可证下发布的。

Crawl4AI：开源 LLM 友好型 Web 爬虫和抓取工具-AI应用

OpenAI Swarm是什么和使用用例

Crawl4AI 开源LLM Web 爬虫快速入门指南

小远

Related Posts

Research Town：开发者模拟研究社区多智能体平台-AI应用

Cofounder：人工智能生成的应用程序，全栈+生成用户界面-AI应用

Docling：快速地解析pdf/word/ppt导出为md/json格式-AI应用

Crawl4AI 开源LLM Web 爬虫快速入门指南

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs[不要过度思考2+3等于几在类LLM的过度思考上]-AI论文

Slow Perception: Let’s Perceive Geometric Figures Step-by-step[缓慢感知：让我们逐步感知几何图形]-AI论文

Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning[结合大型语言模型与过程奖励引导的树搜索以提升复杂推理能力]-AI论文

Large Concept Models:Language Modeling in a Sentence Representation Space[大型概念模型：在句子表示空间中的语言建模]-AI论文

Claude大模型学习社区

分类

Welcome Back!

Create New Account!

Retrieve your password

Crawl4AI：开源 LLM 友好型 Web 爬虫和抓取工具-AI应用

特点✨

安装

使用 pip 🐍

基本安装

同步版本安装

开发安装

快速入门🚀

高级用法🔬

执行 JavaScript 并使用 CSS 选择器

使用代理

无需 LLM 即可提取结构化数据

使用 OpenAI 提取结构化数据

会话管理和动态内容抓取

速度比较🚀

文档📚

贡献🤝

许可证📄

OpenAI Swarm是什么和使用用例

Crawl4AI 开源LLM Web 爬虫快速入门指南

Related Posts

Claude大模型学习社区

分类

标签

Welcome Back!

Create New Account!

Retrieve your password