Crawl4AI and N8N: Web Scraping for RAG without Code

2025-02-05
ℹ️Note on the source

This blog post was automatically generated (and translated). It is based on the following original, which I selected for publication on this blog:
n8n + Crawl4AI – Scrape ANY Website in Minutes with NO Code – YouTube.

Crawl4AI and N8N: Web Scraping for RAG without Code

Crawl4AI is an open-source web scraper designed to be LLM-friendly, simplifying the process of crawling websites and formatting the data for Retrieval-Augmented Generation (RAG) knowledge bases. This allows AI agents to access and utilize information from the web. While previous implementations required Python coding, a new approach leverages N8N, enabling web scraping and RAG implementation without writing any code.

Deploying Crawl4AI with Docker

To use Crawl4AI with N8N, it is essential to deploy it as an API endpoint using Docker. This allows N8N workflows to interact with Crawl4AI by sending requests to crawl specific websites. Crawl4AI then returns the scraped content in Markdown, along with links to images and HTML, ready for processing and integration into a knowledge base.

Ethical Web Scraping

It's important to remember to scrape ethically. Many websites explicitly disallow scraping, and respecting these rules is crucial. Check the /robots.txt file of a website to understand its scraping policies and always adhere to the terms of use.

Using Sitemap.xml for Efficient URL Extraction

For websites with extensive documentation or e-commerce stores, a sitemap.xml file often provides a comprehensive list of all available web pages. By parsing this file within the N8N workflow, all relevant URLs can be extracted, allowing for targeted scraping with Crawl4AI.

N8N Workflow

The N8N workflow involves several key steps:

  1. Fetching the sitemap.xml: The workflow starts by fetching the sitemap.xml file from the target website.
  2. Converting XML to JSON: The XML content is converted into JSON format for easier manipulation within N8N.
  3. Splitting into Individual Items: The list of URLs is split into individual items, allowing N8N to process each URL separately.
  4. Looping through URLs: N8N loops through each URL to scrape the content using Crawl4AI.
  5. Making API Requests to Crawl4AI: This involves configuring the HTTP request node with the Crawl4AI endpoint URL, authentication credentials (if required), and the target URL.
  6. Checking Task Status: Crawl4AI returns a task ID, requiring the workflow to check the task status periodically until it's completed.
  7. Inserting into Supabase: Once the scraping task is complete, the extracted Markdown content is inserted into a Supabase vector store for RAG.

Docker Deployment Options

There are several ways to deploy Crawl4AI with Docker:

  • Local AI Starter Kit: Integrate the Crawl4AI Docker image into a local AI starter kit.
  • Local or Cloud Instance: Run the Crawl4AI container locally or on the same cloud instance as N8N.
  • Dedicated Cloud Instance: Host Crawl4AI on a separate cloud instance to avoid resource contention with N8N.

Configuring the HTTP Request Node

The HTTP Request node in N8N needs to be configured with the correct URL for the Crawl4AI endpoint. If Crawl4AI is hosted locally, the URL should be http://localhost:11235/crawl. If it's deployed on a separate instance, use the appropriate external URL.

For secured endpoints, configure a generic credential type with the name authorization and the value bearer <your_api_token>. The API token should match the CRAWL_FOR_AI_API_TOKEN environment variable set during Docker deployment.

Interacting with the /crawl and /task Endpoints

When sending a request to the /crawl endpoint, Crawl4AI returns a task ID. This ID is used to query the /task endpoint to check the status of the scraping task. The workflow should include a loop that periodically checks the status until it's completed.

Improving RAG Performance

Several factors can impact the performance of the RAG system, including the chunk size used for splitting the text and the choice of embedding model. Experimenting with different settings can help optimize the system for specific use cases.

The Significance

By integrating Crawl4AI with N8N, a powerful, no-code solution emerges for web scraping and RAG. This approach simplifies the process of building AI agents capable of leveraging vast amounts of online information, opening up new possibilities for automation and knowledge extraction. Which avenues will this open up in the future?


Comments are closed.