icon

Crawl4AI

Crawl4AI is an open-source, LLM-friendly web crawler and data scraper supporting multiple LLM API keys, designed for AI pipelines, RAG, and knowledge base construction.

template cover
Deployed112 times
Publisherronglecat
Created2025-06-13
Services
service icon
Tags
CrawlerAIWeb

Crawl4AI

šŸš€šŸ¤– Crawl4AI is an open-source, LLM-friendly web crawler and data scraper optimized for AI application scenarios. It supports multiple mainstream LLM APIs and provides powerful data extraction and processing capabilities, making it an ideal choice for building AI pipelines, RAG systems, and knowledge bases.

Core Features

šŸ“ Markdown Generation

  • Generate clean, structured Markdown documents
  • AI-friendly content filtering with automatic noise removal
  • Smart citation management, converting links to numbered reference lists
  • Support for custom Markdown generation strategies
  • BM25 algorithm for core information extraction

šŸ“Š Structured Data Extraction

  • Support for all LLMs (open-source and proprietary) driven data extraction
  • Multiple chunking strategies: topic-based, regex-based, sentence-level
  • Semantic content retrieval based on cosine similarity
  • Fast CSS/XPath selector extraction
  • Custom schema definition for extracting structured JSON data

🌐 Browser Integration

  • Use user-owned browsers to completely avoid bot detection
  • Chrome DevTools Protocol support for remote control
  • Browser profile management with saved authentication states and cookies
  • Session management for multi-step crawling
  • Proxy support with authentication
  • Full browser control: modify headers, cookies, user agents, etc.
  • Compatible with Chromium, Firefox, and WebKit
  • Dynamic viewport adjustment for complete rendering

šŸ”Ž Crawling & Scraping

  • Media support: extract images, audio, videos, and responsive image formats
  • Dynamic content crawling: execute JS scripts and wait for async content
  • Page screenshot functionality for debugging and analysis
  • Support for raw HTML and local file processing
  • Comprehensive link extraction: internal, external links, and iframe content
  • Custom hooks for customizing crawling behavior at each step
  • Smart caching mechanism to improve speed and avoid redundant requests
  • Metadata extraction and seamless iframe content extraction
  • Lazy load handling and full-page scanning for infinite scroll pages

šŸš€ Deployment Features

  • Docker-optimized image with built-in FastAPI server
  • JWT token authentication for API security
  • One-click API gateway deployment
  • Scalable architecture for large-scale production environments
  • Cloud deployment ready configurations

Supported LLM APIs

OpenAI, Anthropic, Deepseek, Groq, Together, Mistral, Gemini

Quick Start

  1. Fill in the required LLM API keys (optional, for AI-driven extraction features)
  2. After deployment, visit /playground for the interactive crawler interface
  3. Check the official documentation for more advanced usage

References

License

Apache-2.0