Mathematics is not a careful march down a well-cleared highway, but a journey into a strange wilderness, where the explorers often get lost, – W.S. Anglin.

Complex numbers

This script implements an asynchronous web crawler that:

Efficiently navigate a local web server (e.g., a Hugo site).
Identify pages whose titles or meta-descriptions match a given phrase.
Extract their full content in Markdown format.

It leverages:

Python’s asyncio for concurrent operations.
aiohttp for high-performance HTTP requests.
BeautifulSoup for HTML parsing.
RapidFuzz for fuzzy matching.
markdownify for converting HTML documents to Markdown format.
A custom crawl4ai.AsyncWebCrawler for asynchronous web crawling in Crawl4AI. This crawler is optimized for speed and resource management.

The core functionality revolves around a breadth-first search (BFS) algorithm that systematically explores internal links, ensuring comprehensive coverage of the target domain.

Pages deemed “relevant” are those where a pre-defined PHRASE (case-insensitive and whitespace-agnostic) is found within their title tag or meta name="description"/ meta property="og:description attributes.

For each relevant page, the script extracts the title, description, and converts the entire HTML body into clean full Markdown content, providing a structured output for further analysis or display.

Installation & Setup

Create a virtual environment and install dependencies:

# Create a virtual environment
python -m venv .venv
# Activate it
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # Windows PowerShell

# Install dependencies
pip install aiohttp bs4 python-dotenv markdownify rapidfuzz crawl4ai

Create a .env in your project root:

BASE_URL=http://localhost:1313
REQUEST_TIMEOUT=15

Run: python web_crawler.py "search phrase".

web_crawler.py

#!/usr/bin/env python3
"""
web_crawler.py

Asynchronous BFS crawler that finds pages matching a phrase in their title/description and outputs full Markdown content.

Author: Máximo Núñez Alarcón
Copyright (c) 2023 Máximo Núñez Alarcón. This work is licensed under the terms of the MIT License. See LICENSE file for details.

Dependencies:
- `aiohttp`: For making asynchronous HTTP requests, enabling concurrent fetching of multiple pages.
- `BeautifulSoup` (bs4): For robust and flexible parsing of HTML and XML documents.
- `python-dotenv`: For securely loading configuration variables (like `BASE_URL` and `REQUEST_TIMEOUT`) from a `.env` file.
- `markdownify`: A utility for converting HTML documents into Markdown format.
- `crawl4ai`: A custom web crawling library that provides advanced capabilities for content extraction.

Usage:
- Set the BASE_URL and REQUEST_TIMEOUT in a .env file.
- Call the main asynchronous function with your desired search phrase to initiate crawling.

Example: python web_crawler.py "search phrase"

import asyncio
from web_crawler import main

if __name__ == "__main__":
    asyncio.run(main("Your Search Phrase Here"))
"""

import asyncio # Core library for writing concurrent code using the async/await syntax.
import sys # Provides access to system-specific parameters and functions
from urllib.parse import urljoin, urldefrag, urlparse # For URL manipulation

import aiohttp # Asynchronous HTTP client/server framework, essential for non-blocking web requests.
import os # For interacting with the operating system
from bs4 import BeautifulSoup # For parsing HTML and XML documents
from markdownify import markdownify as md # Function for converting HTML document to Markdown format
from crawl4ai import AsyncWebCrawler # A specialized asynchronous web crawler class from the 'crawl4ai' library.
from dotenv import load_dotenv # For loading environment variables from .env files
from util import normalize_string # Importing a function to normalize strings

fetch_html(…) — Asynchronous HTTP Request and Content Decoding

load_dotenv()  # Load environment variables from .env file
base_url = os.getenv("BASE_URL")  # Get the base URL from environment
# Set request timeout, default 15 seconds
REQUEST_TIMEOUT = int(os.getenv("REQUEST_TIMEOUT", 15))
if not base_url:
    # Raise error if not set
    raise ValueError("BASE_URL not set in .env file")
if not base_url.startswith(("http://", "https://")):
    raise ValueError("BASE_URL must start with http:// or https://") # Validate URL format

async def fetch_html(session: aiohttp.ClientSession, url: str) -> str | None:
    """
    Asynchronously fetches the HTML content from a given URL using an aiohttp client session.
    It attempts to decode the content using the server-declared charset, falling back to
    Latin-1 if decoding errors occur. Returns None on network errors, timeouts, or if
    the content is not HTML.

    Parameters:
    - session: aiohttp.ClientSession - An active aiohttp client session, optimized for making multiple HTTP requests efficiently by reusing connections.
    - url: str - the URL of the web page to fetch.

    Returns:
    - str | None: HTML content of the page as a string if successful or None if there was an error
    """
    load_dotenv() # Load environment variables from .env file
    REQUEST_TIMEOUT = int(os.getenv("REQUEST_TIMEOUT", 15)) # Set request timeout, default 15 seconds

    try:
        # The 'async with' statement ensures the HTTP request is made asynchronously
        # and the response object is properly closed, even if errors occur.
        # 'session.get(url, timeout=REQUEST_TIMEOUT)' initiates the GET request.
        # The 'await' keyword here pauses the execution of this 'fetch_html' coroutine
        # until the HTTP response headers are received or a timeout occurs.
        # While 'fetch_html' is paused, the asyncio event loop can execute other
        # pending tasks, making the overall crawling process non-blocking and efficient.
        async with session.get(url, timeout=REQUEST_TIMEOUT) as resp:
        # Asynchronously get the URL (opens a connection and sends HTTP headers)
            if resp.status != 200: # Check if the response status is not OK
                return None # Return None if the HTTP status is not 200 OK (not successful
            ctype = resp.headers.get("content-type", "") # Get the content type from headers
            if "text/html" not in ctype: # Check if the content type is HTML
                return None # Return None if the content is not HTML

            # If the content type is HTML, read the response body
            # 'await resp.read()' asynchronously reads the full response body into bytes.
            # It yields control to the event loop, allowing other tasks
            # to run while the data is being downloaded from the network.
            raw = await resp.read() # Read the response body as bytes

            # Determine the correct character encoding. Prioritize server-declared charset,
            # then aiohttp's inferred encoding, finally falling back to UTF-8.
            encoding = resp.charset or resp.get_encoding() or "utf-8"
            try:
                # Attempt to decode the raw bytes using the determined encoding and return the HTML content.
                return raw.decode(encoding)
            except (LookupError, UnicodeDecodeError):
                # If the primary decoding fails (e.g., unknown charset or malformed bytes),
                # fall back to 'latin-1' and ignore errors to prevent crashes,
                # as 'latin-1' can decode any byte sequence.
                return raw.decode("latin-1", errors="ignore")
    except (aiohttp.ClientError, asyncio.TimeoutError):
        # Catch any network-related errors (e.g., connection issues, DNS failures)
        # or explicit timeouts during the request. Return None to indicate failure.
        return None

Link extraction

def extract_links(html: str, base_url: str) -> set[str]:
    """
    Parses an HTML string to extract all valid, same-domain absolute links.
    It removes URL fragments (e.g., #section) and resolves relative URLs against the provided base URL.

    Parameters:
    - html: str - the HTML content to parse and extract links.
    - base_url: str - the base URL of the page being parsed to resolve relative links and filter for same-domain links.

    Returns:
    - set[str]: A unique set of absolute URLs found within the HTML that belong to the same domain as the `base_url`.
    """
    soup = BeautifulSoup(html, "html.parser") # Parse HTML with BeautifulSoup
    domain = urlparse(base_url).netloc # Extract the domain from base URL
    links = set() # Initialize a set to store and automatically handle duplicate links

    for a in soup.find_all("a", href=True): # Find all anchor tags that have an 'href' attribute.
        raw = a["href"] # Get the raw value of the href attribute.
        # 'urldefrag' splits a URL into its base part and its fragment identifier (e.g., #section), it returns (url, fragment)
        # Strip off any fragment via urldefrag from the URL
        cleaned, _ = urldefrag(raw)
        # 'urljoin' combines a base URL with a (potentially relative) URL
        # to produce or resolve to a full, absolute URL.
        abs_url = urljoin(base_url, cleaned)

        if urlparse(abs_url).netloc == domain:
            # Check if the extracted absolute URL belongs to the same domain as our base_url.
            # And keeps only those whose domain matches base_url.
            links.add(abs_url)
            # Add the valid, same-domain absolute URL to our set.

    return links
    # Return the set of unique, absolute, same-domain links.

Metadata & Body Extraction

# ---------------- Title / description / body helpers -------------------------------
def extract_metadata_and_body_html(html: str) -> tuple[str, str, str]:
    """
    Extracts the page title, meta description (standard or OpenGraph) and the raw HTML content of the body tag from an HTML string.
    Returns empty strings if missing.

    Parameters:
    - html: str - the HTML content to parse

    Returns:
    - tuple[str, str, str]: A tuple containing the extracted page title, description, and raw HTML content of the body tag or empty strings if not found
    """
    soup = BeautifulSoup(html, "html.parser") # Parse HTML with BeautifulSoup
    # Uses BeautifulSoup to pull out: title, meta description, and body

    # 1. Extract the page title from the TITLE tag
    # The .string gets the text content and .strip() removes leading/trailing whitespaces.
    title = soup.title.string.strip() if soup.title and soup.title.string else ""

    # 2. Extract the meta DESCRIPTION (standard or OpenGraph)
    desc = ""
    meta_name = soup.find("meta", attrs={"name": "description"}) # Find the standard meta description tag.
    meta_og = soup.find("meta", attrs={"property": "og:description"}) # Find the OpenGraph description tag.
    if meta_name and meta_name.get("content"):
        # Check if meta description exists
        desc = meta_name["content"].strip() # Get content from standard meta description if available.
    elif meta_og and meta_og.get("content"):
        # Check if OpenGraph description exists
        desc = meta_og["content"].strip() # Get content from OpenGraph meta description if available

    # 3. Extract raw HTML content of the BODY tag
    body_element = soup.find('body') # Find the BODY element
    # Convert the BeautifulSoup Tag object to its string representation (raw HTML).
    body_html_string = str(body_element) if body_element else "" # Get body HTML as a string

    return title, desc, body_html_string # Return title, description, and body HTML

Phrase Matching

def fuzzy_phrase_in_title_or_desc(
    html: str,
    phrase: str,
    threshold: float = 70.0
) -> bool:
    """
    Return True if the token-set fuzzy match between `phrase` and the page's title OR meta-description meets or exceeds `threshold` (0–100).
    In other words, The function checks if a given phrase is sufficiently similar to either the title or the meta-description of an HTML page.

    Uses RapidFuzz’s token_set_ratio which:
      - Splits both strings into tokens
      - Ignores the order of the tokens
      - Ignores duplicates
      - It is robust against differences in word order and count (extra/missing words)
      - Handles punctuation and case differences
      - Computes a normalized similarity score between 0 and 100

    Parameters:
    - html: str - A string containing the HTML content of a web page you want to parse
    - phrase: str - The phrase you want to check for similarity against the title or description.
    - threshold: float - A float value (default is 70.0) that defines the minimum similarity score required for a match.
    The score ranges from 0 to 100.

    Returns:
    - bool: True if the phrase is sufficiently similar to title or description, False otherwise
    """
    title, desc, _ = extract_metadata_and_body_html(html)
    # Extracts the title and description from the HTML content using the extract_metadata_and_body_html function.

    # If both title and description are empty, return False
    if not title and not desc:
        return False
    # If the phrase is empty or threshold is not a valid float between 0 and 100, raise ValueError
    if not phrase:
        return False
    if not isinstance(threshold, (int, float)) or not (0 <= threshold <= 100):
        raise ValueError("threshold must be a float between 0 and 100")
    # Validate that title, description, and phrase are strings
    # If not, raise ValueError
    if not isinstance(title, str) or not isinstance(desc, str):
        raise ValueError("title and description must be strings")
    if not isinstance(phrase, str):
        raise ValueError("phrase must be a string")

    # Normalize the strings (title, description, and the phrase) to ensure consistent comparison
    norm_title = normalize_string(title)
    norm_desc  = normalize_string(desc)
    norm_phrase = normalize_string(phrase)

    # Use RapidFuzz to compute the token-set ratio
    # between the normalized phrase and both the title and description
    score_title = fuzz.token_set_ratio(norm_phrase, norm_title)
    score_desc  = fuzz.token_set_ratio(norm_phrase, norm_desc)

    # Finally, the function returns True if the highest score (from either the title or description)
    # meets or exceeds the specified threshold; otherwise, it returns False.
    return max(score_title, score_desc) >= threshold

Core BFS Crawler

# ---------------- Main crawling loop ----------------------------------------
async def crawl_website_for_relevance(base_url: str, phrase: str):
    """
    Performs a breadth-first crawl of the website starting from base_url.
    It identifies pages where the `phrase` appears in the title or description and collects their metadata and raw HTML body.

    Parameters:
    - base_url: str - the starting URL for crawling
    - phrase: str - the keyword(s) or phrase to search for in page titles and descriptions

    Returns:
    - list[dict]: A list of dictionaries,
    where each dictionary contains the 'url', 'title', 'description', and 'body_html_string' for each relevant page found.
    """
    queue: list[str] = [base_url] # Initialize the queue for BFS with the base URL
    seen: set[str] = set() # A set to keep track of visited URLs to avoid infinite loops and redundant fetches
    relevant_pages_info: list[dict] = [] # A list to store the extracted info about pages that match the criteria

    # This creates an asynchronous HTTP client session for persistent HTTP connection.
    # Using a single session for multiple requests is crucial for performance in asynchronous web crawling.
    # It allows for connection pooling and reuse (connections stay open by default), reducing the overhead of
    # establishing new TCP connections for every request. The 'async with' ensures
    # the session is properly opened and closed.
    async with aiohttp.ClientSession() as session: # Create an asynchronous session
        while queue:
            # The main simple breadth-first crawling loop continues as long as there are URLs in the queue
            url = queue.pop(0) # Get the next URL from the queue (breadth-first).
            if url in seen: # Skip this URL if already visited
                continue
            seen.add(url) # Mark the current URL as seen or visited

            print(f"Crawling: {url}") # Provide real-time feedback on the crawling process

            html = await fetch_html(session, url)
            # Here’s where the asynchronous I/O happens.
            # The 'await' keyword tells the Python interpreter to pause the execution of this 'crawl_website_for_relevance'
            # coroutine at this point.
            # While 'fetch_html(session, url)' is waiting for the network response (which can take time)
            # The coroutine yields control while waiting for the network.
            # The asyncio event loop is free to switch to and execute other tasks (e.g., processing other URLs in the queue)
            # When the response arrives, the execution of 'crawl_website_for_relevance' resumes, reads bytes,
            # decodes, and returns the HTML string.
            if html is None: # Skip if fetching failed (e.g., network error, non-HTML content)
                continue

            if fuzzy_phrase_in_title_or_desc(html, phrase, 80.0): # Check if the phrase is sufficiently similar to the title or description
                # Extract title, description, and body HTML content from the fetched page.
                title, desc, body_html_string = extract_metadata_and_body_html(html)
                relevant_pages_info.append({
                "url": url,
                "title": title,
                "description": desc,
                "body_html_string": body_html_string,
                })
                # If the page is relevant, store its details (add it to the list).

            # Enqueue next links found on the current page
            for link in extract_links(html, url): # Extract links from the current page
                if link not in seen: # Only add unseen links to the queue
                    queue.append(link) # Add to the queue
    return relevant_pages_info # Return the list of relevant pages

Main Orchestration

main is async, so every time you call await, the event loop can do other work while it waits.
Phase 1 uses a BFS‐style crawler to identify and collect relevant pages, meaning those whose title or description match the specified phrase.
Phase 2 uses crawl4ai.AsyncWebCrawler to fetch and convert each page into Markdown.

async def is_url_accessible(url: str, timeout: float = 5.0) -> bool:
    """
    Quickly check if a URL responds with a 2xx or 3xx status.
    Falls back to GET if HEAD isn’t allowed.
    """
    try:
        async with aiohttp.ClientSession() as session:
            # First try a HEAD (lightweight). Some servers reject it.
            resp = await session.head(url, timeout=timeout)
            if 200 <= resp.status < 400:
                return True

            # If HEAD gave 405 Method Not Allowed or similar, try GET.
            if resp.status in (405, 501):
                resp = await session.get(url, timeout=timeout)
                return 200 <= resp.status < 400

            return False

    except asyncio.TimeoutError:
        return False
    except aiohttp.ClientError:
        return False

async def main(target_query: str):
    """
    Main function to initiate the crawling process.

    Parameters:
    - target_query: str - the phrase to search for in titles/descriptions
    """

    print(f"Starting full website crawl from: {base_url}") # Log the starting point

    # Pre-flight check
    print(f"{Fore.YELLOW}Checking accessibility for {base_url}…{Style.RESET_ALL}")
    accessible = await is_url_accessible(base_url)
    if not accessible:
        print(f"{Fore.RED}� {base_url} is not reachable. Aborting crawl.{Style.RESET_ALL}")
        return

    # Log and crawl
    print(f"{Fore.GREEN} {base_url} is up — starting full website crawl.{Style.RESET_ALL}")
    print(f"{Fore.CYAN}Starting full website crawl from: {base_url}{Style.RESET_ALL}")

    # Step 1: Crawl the website to identify all relevant URLs and extract their titles/descriptions
    relevant_pages_data = await crawl_website_for_relevance(base_url, target_query)

    print(f"\nFound {len(relevant_pages_data)} relevant pages for '{target_query}'") # Log the count of relevant pages

    # Step 2: Get full Markdown content for each relevant page using AsyncWebCrawler
    async with AsyncWebCrawler() as crawler:
        # AsyncWebCrawler is the core class for asynchronous web crawling in Crawl4AI.
        # It may set up its own HTTP session pool or other resources,
        # ensuring efficient management of network connections during crawling.
        # Initialize the web crawler.
        for i, page_info in enumerate(relevant_pages_data): # Iterate over relevant pages
            url = page_info["url"] # Extract URL
            title = page_info["title"] # Extract title
            description = page_info["description"] # Extract description
            content = page_info["body_html_string"] # Extract body HTML content

            print(f"\n{'='*90}") # Print separator
            print(f"## MATCH: {url}") # Log the matched URL
            print("="*90)
            print(f"**Title:** {title or '[none]'}\n") # Log the title
            print(f"**Description:** {description or '[none]'}\n") # Log the description
            print(f"**Content:** {content or '[none]'}\n") # Log the body content

            try:
            # Use AsyncWebCrawler to get the markdown content.
            # await crawler.arun(...) fetches the page (again), extracts clean Markdown, and returns it in result.markdown.
            result = await crawler.arun(
                url=url,
                extraction_strategy="markdown" # Specify extraction strategy as markdown
            )
            print(result.markdown) # Print the extracted markdown
            except Exception as e:
                print(f"Error crawling {url} for markdown: {str(e)}") # Log any errors
                print("="*90 + "\n") # Print separator

Entry point & event loop

asyncio.run creates a fresh asyncio event loop (a newly created instance of Python’s asyncio event loop — an object that manages and schedules the execution of asynchronous code in a single thread), sets it as the current event loop for the thread, runs your top‐level main(…) coroutine to completion, then closes the loop (after the coroutine finishes or raises an exception, the loop is closed and cleaned up). If you press Ctrl+C, you get a clean shutdown.

if __name__ == "__main__": # Entry point for the script
try:
    # 'asyncio.run()' is the function that starts the asyncio event loop.
    # It takes the top-level asynchronous function ('main' with the search phrase in this case)
    # and runs it until it completes. This is the bridge from synchronous
    # execution to the asynchronous world.
    asyncio.run(main("How to learn to type fast"))
except KeyboardInterrupt:
    # This block catches a KeyboardInterrupt (typically Ctrl+C), allowing
    # the program to exit gracefully instead of crashing.
    sys.exit() # Exit the program gracefully or cleanly

The asynchronous “pipeline”

asyncio.run(main) starts the loop. This function is the gateway to the asynchronous world. It initializes a new asyncio event loop. This event loop is like a conductor for an orchestra: it manages, schedules, and executes all the asynchronous tasks (coroutines) in a single thread, ensuring that when one task is waiting (e.g., for a network response), other tasks can run.
main coroutine (main Orchestration): calls crawl_website_for_relevance to identify all relevant URLs and extract their titles, descriptions, and bodies (the await keyword here is critical: it tells the event loop that main is temporarily pausing its execution and yielding control; the event loop remains free to manage other operations), then calls AsyncWebCrawler() as crawler (crawler.arun) to get the markdown content for each hit (relevant page).
Website Crawling (crawl_website_for_relevance coroutine): opens or creates an asynchronous session (aiohttp.ClientSession). This session is a highly optimized object for making HTTP requests. Instead of opening a new TCP connection for every single page fetch, aiohttp.ClientSession maintains a pool of connections, reusing them efficiently. Then, it loops while there are URLs to process and calls await fetch_html. Again, await is used. This means crawl_website_for_relevance pauses, allowing the event loop to handle the actual network request via fetch_html. While fetch_html is waiting for the server to respond, crawl_website_for_relevance is suspended, but the event loop is active. Finally, checks if the phrase is in the title or description, enqueue new links found on the current page, and repeats.
HTML Fetching (fetch_html coroutine). This coroutine uses the aiohttp.ClientSession passed to it to make the actual HTTP GET request. async with session.get(url, timeout=REQUEST_TIMEOUT) as resp: initiates the network request. The await here pauses fetch_html until the HTTP response headers are received. raw = await resp.read() reads the entire response body. This is another await point, as reading a large response body from the network can take time. At both await points within fetch_html, control is yielded back to the event loop.
After fetch_html returns the HTML content, crawl_website_for_relevance performs CPU-bound tasks like phrase_in_title_or_desc and extract_links. These functions are synchronous (they don’t use await) because they don’t involve waiting for external resources.
crawler.arun asynchronously refetches the page (again) and converts it to Markdown.

Building an Asynchronous Web Crawler in Python with aiohttp & BeautifulSoup