WebCrawler v2.0

🛠️ Technical Architecture (v2.0)

Overview

WebCrawler v2.0 is a full-stack Flask application designed for high-fidelity content extraction. It adheres to a Service-Oriented Architecture (SOA), separating core crawling logic from the web presentation layer.

📂 Project Structure

/workspaces/crawler
├── app/
│   ├── services/
│   │   └── crawler.py       # Core extraction logic & heuristic detection
│   ├── static/
│   │   ├── style.css        # Tailwind CSS customizations (Nexus theme)
│   │   └── script.js        # Frontend logic (QuillJS, file saving)
│   ├── templates/
│   │   ├── index.html       # Main application interface
│   │   └── doc.html         # Documentation renderer
│   ├── __init__.py          # Flask app factory
│   ├── routes.py            # HTTP route definitions
│   └── utils.py             # Helper functions (Tech Icon mapping)
├── docs/                    # Documentation files
│   ├── changelog.md
│   └── tech.md
├── run.py                   # Application entry point
├── .gitignore               # Git ignore rules
├── vercel.json              # Vercel deployment config
└── requirements.txt         # Python dependencies

🏗️ Technology Stack

Backend

Frontend

🧱 Key Components

1. Crawler Service (app/services/crawler.py)

This module contains the heavy lifting for the application.

def crawl_url(url: str) -> Dict[str, Any]:
    """
    Crawls the given URL, extracts its main content as Markdown,
    and detects the technology stack used.
    """
    # 1. Detect Tech Stack (Server-side)
    tech_stack = _detect_tech_stack(url)

    # 2. Fetch & Parse
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    # 3. Analyze Client-Side (Heuristics)
    more_tech = _analyze_client_side(url, soup)

    # 4. Clean & Convert
    _remove_clutter(soup)
    markdown = _extract_markdown(soup)

    return { 'success': True, 'markdown': markdown, ... }

2. Heuristic Detection

We don't just rely on headers; we inspect the HTML for framework signatures.

3. Routes (app/routes.py)

Handles the web traffic and documentation serving.