Open Source · Python · CLI · MCP

HTTP-first web content
for LLM pipelines.

pulldown fetches URLs and converts them to clean, level-controlled Markdown — no browser, no heavy dependencies, no surprise tokens.

Get started See features

HTTP-first no browser required

4 detail levels minimal → full

SSRF guard safe in server contexts

MCP ready tool surface included

Features

Everything a pipeline needs. Nothing it doesn't.

HTTP-First Fetching

Requests + BeautifulSoup. No browser, no Playwright, no heavyweight dependency chain. Fast cold starts, predictable memory.

requests · beautifulsoup4

Detail Levels

Four output modes — minimal, readable, full, raw — so you control how many tokens reach the model.

minimal · readable · full · raw

Batch Fetching

Pass a list of URLs and get back a list of results. Concurrency controlled, errors surfaced per-item, not globally.

concurrent · per-item errors

Bounded Crawl

Follow links from a seed URL up to a configurable depth and page limit. Same-domain by default. No runaway scrapes.

depth · page-limit · same-domain

Validator Caching

Uses ETag and Last-Modified headers to skip re-fetching unchanged pages. Correct conditional GET, not a naïve TTL.

ETag · Last-Modified · 304

SSRF Guard

Resolves hostnames to IP addresses before connecting and blocks private, loopback, and link-local ranges. Safe in server contexts.

SSRF · pre-connect · IP check

Detail Levels

One fetch. Four resolutions.

Every fetch accepts a detail_level parameter. Pick the resolution that matches your pipeline's token budget.

# Why HTTP-First Fetching Works

Main article text only. No nav, no sidebars, no ads.
Lowest token count. Best for classification or triage.

Title + core body paragraphs. Navigation, headers, footers, and repeated boilerplate stripped.

# Why HTTP-First Fetching Works

**Published:** 2025-11-04 · **Author:** Anthony Maio

Main article text with structure preserved. Section headings, lists,
and inline code blocks included. Images referenced as alt text.

## Section Heading

Content continues here with headings and lists intact.

Title, metadata, body paragraphs, headings, lists, inline code. Images become alt-text references.

# Why HTTP-First Fetching Works

**Published:** 2025-11-04 · **Author:** Anthony Maio · **Tags:** python, llm

Main article text. All structure preserved including code blocks,
tables, blockquotes, and footnotes.

```python
result = await client.fetch("https://example.com")
```

| Column A | Column B |
|----------|----------|
| value    | value    |

> Blockquote text preserved verbatim.

Full content including code blocks, tables, blockquotes, footnotes. No boilerplate stripped. Highest fidelity.

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Why HTTP-First Fetching Works</title>
    <meta name="description" content="...">
  </head>
  <body>
    <article>...</article>
  </body>
</html>

Raw HTML returned as-is. No conversion. For pipelines that need to do their own extraction or diffing.

Level	Headings	Lists	Code blocks	Tables	Images	Nav / footer
minimal	title only	—	—	—	—	stripped
readable	✓	✓	inline	—	alt text	stripped
full	✓	✓	✓	✓	✓	stripped
raw	Raw HTML returned, no conversion applied

Quick Start

Install. Fetch. Done.

Install

pip install pulldown
pip install 'pulldown[render]'   # + Playwright
pip install 'pulldown[mcp]'      # + MCP server

CLI

pulldown get https://example.com
pulldown get https://example.com --detail minimal
pulldown get https://example.com --render
pulldown crawl https://docs.example.com --max-pages 20

Python API

import asyncio
from pulldown import fetch, fetch_many, Detail

async def main():
    # Single fetch
    result = await fetch(
        "https://example.com/article",
        detail=Detail.readable,
    )
    print(result.title)
    print(result.content)

    # Batch
    results = await fetch_many(
        ["https://a.com", "https://b.com"],
        detail=Detail.minimal,
        concurrency=5,
    )
    for r in results:
        print(r.url, len(r.content))

asyncio.run(main())

Open the repo Read the README

Security

Safe to run in server contexts.

pulldown is designed for use inside MCP servers and other networked services where arbitrary user-supplied URLs are a real threat surface.

SSRF Guard

Resolves hostnames to IP before opening the connection. Blocks private (RFC 1918), loopback, link-local, and reserved ranges. Every redirect hop is validated — a public URL that 302s to 127.0.0.1 is blocked before the redirect is followed.

pre-connect IP resolution · redirect chain validated · configurable allow-list

Size Caps

Response bodies are capped before they reach memory. Configurable per-request maximum. A malicious or unexpectedly large page cannot exhaust the server's heap.

max_bytes · streaming truncation · no OOM risk

Scheme Validation

Only http:// and https:// URLs are accepted. file://, gopher://, ftp://, and other schemes are rejected before a connection is attempted.

scheme allowlist · early rejection · redirect chain validated

Integrations

CLI, Python, or MCP — pick your surface.

CLI

terminal

pulldown get https://example.com \
  --detail readable

pulldown crawl https://docs.example.com \
  --max-depth 2 \
  --max-pages 20 \
  --detail minimal

Python API

async

from pulldown import fetch, Detail

result = await fetch(
    url,
    detail=Detail.readable,
    max_bytes=1_000_000,
)

MCP Server

Claude · Codex · agents

{
  "mcpServers": {
    "pulldown": {
      "command": "python",
      "args": ["-m", "pulldown.mcp_server"],
      "env": {
        "PULLDOWN_CACHE_DIR": "~/.cache/pulldown"
      }
    }
  }
}

Comparison

Why not just use requests?

httpx and requests get you the bytes. pulldown gets you Markdown with structure intact, detail control, and the safety layer a server context requires.

Tool	Markdown output	Detail levels	SSRF guard	Batch API	MCP surface	No browser needed
pulldown	✓	✓ 4 levels	✓	✓	✓	✓
httpx / requests	—	—	manual	manual	—	✓
requests + BS4	manual	—	manual	manual	—	✓
Playwright / Puppeteer	manual	—	—	limited	—	browser required

Open on GitHub making-minds.ai

HTTP-first web contentfor LLM pipelines.