Open Source · Python · CLI · MCP

HTTP-first web content
for LLM pipelines.

pulldown fetches URLs and converts them to clean, level-controlled Markdown — no browser, no heavy dependencies, no surprise tokens.

<html> <nav>...</nav> <script>...</script> <div class="..."> <p>article content</p> </div></html> raw HTML pulldown extract · convert · level # Article Title The quick brown fox... jumps over the lazy dog. ## Section Clean, readable content. clean Markdown
HTTP-first no browser required
4 detail levels minimal → full
SSRF guard safe in server contexts
MCP ready tool surface included

Everything a pipeline needs. Nothing it doesn't.

01
HTTP-First Fetching

Requests + BeautifulSoup. No browser, no Playwright, no heavyweight dependency chain. Fast cold starts, predictable memory.

requests · beautifulsoup4
02
Detail Levels

Four output modes — minimal, readable, full, raw — so you control how many tokens reach the model.

minimal · readable · full · raw
03
Batch Fetching

Pass a list of URLs and get back a list of results. Concurrency controlled, errors surfaced per-item, not globally.

concurrent · per-item errors
04
Bounded Crawl

Follow links from a seed URL up to a configurable depth and page limit. Same-domain by default. No runaway scrapes.

depth · page-limit · same-domain
05
Validator Caching

Uses ETag and Last-Modified headers to skip re-fetching unchanged pages. Correct conditional GET, not a naïve TTL.

ETag · Last-Modified · 304
06
SSRF Guard

Resolves hostnames to IP addresses before connecting and blocks private, loopback, and link-local ranges. Safe in server contexts.

SSRF · pre-connect · IP check

One fetch. Four resolutions.

Every fetch accepts a detail_level parameter. Pick the resolution that matches your pipeline's token budget.

# Why HTTP-First Fetching Works

Main article text only. No nav, no sidebars, no ads.
Lowest token count. Best for classification or triage.

Title + core body paragraphs. Navigation, headers, footers, and repeated boilerplate stripped.

# Why HTTP-First Fetching Works

**Published:** 2025-11-04 · **Author:** Anthony Maio

Main article text with structure preserved. Section headings, lists,
and inline code blocks included. Images referenced as alt text.

## Section Heading

Content continues here with headings and lists intact.

Title, metadata, body paragraphs, headings, lists, inline code. Images become alt-text references.

# Why HTTP-First Fetching Works

**Published:** 2025-11-04 · **Author:** Anthony Maio · **Tags:** python, llm

Main article text. All structure preserved including code blocks,
tables, blockquotes, and footnotes.

```python
result = await client.fetch("https://example.com")
```

| Column A | Column B |
|----------|----------|
| value    | value    |

> Blockquote text preserved verbatim.

Full content including code blocks, tables, blockquotes, footnotes. No boilerplate stripped. Highest fidelity.

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Why HTTP-First Fetching Works</title>
    <meta name="description" content="...">
  </head>
  <body>
    <article>...</article>
  </body>
</html>

Raw HTML returned as-is. No conversion. For pipelines that need to do their own extraction or diffing.

Level Headings Lists Code blocks Tables Images Nav / footer
minimal title only stripped
readable inline alt text stripped
full stripped
raw Raw HTML returned, no conversion applied

Install. Fetch. Done.

Install
pip install pulldown
pip install 'pulldown[render]'   # + Playwright
pip install 'pulldown[mcp]'      # + MCP server
CLI
pulldown get https://example.com
pulldown get https://example.com --detail minimal
pulldown get https://example.com --render
pulldown crawl https://docs.example.com --max-pages 20
Python API
import asyncio
from pulldown import fetch, fetch_many, Detail

async def main():
    # Single fetch
    result = await fetch(
        "https://example.com/article",
        detail=Detail.readable,
    )
    print(result.title)
    print(result.content)

    # Batch
    results = await fetch_many(
        ["https://a.com", "https://b.com"],
        detail=Detail.minimal,
        concurrency=5,
    )
    for r in results:
        print(r.url, len(r.content))

asyncio.run(main())

Safe to run in server contexts.

pulldown is designed for use inside MCP servers and other networked services where arbitrary user-supplied URLs are a real threat surface.

SSRF Guard

Resolves hostnames to IP before opening the connection. Blocks private (RFC 1918), loopback, link-local, and reserved ranges. Every redirect hop is validated — a public URL that 302s to 127.0.0.1 is blocked before the redirect is followed.

pre-connect IP resolution · redirect chain validated · configurable allow-list
Size Caps

Response bodies are capped before they reach memory. Configurable per-request maximum. A malicious or unexpectedly large page cannot exhaust the server's heap.

max_bytes · streaming truncation · no OOM risk
Scheme Validation

Only http:// and https:// URLs are accepted. file://, gopher://, ftp://, and other schemes are rejected before a connection is attempted.

scheme allowlist · early rejection · redirect chain validated

CLI, Python, or MCP — pick your surface.

CLI
terminal
pulldown get https://example.com \
  --detail readable

pulldown crawl https://docs.example.com \
  --max-depth 2 \
  --max-pages 20 \
  --detail minimal
Python API
async
from pulldown import fetch, Detail

result = await fetch(
    url,
    detail=Detail.readable,
    max_bytes=1_000_000,
)
MCP Server
Claude · Codex · agents
{
  "mcpServers": {
    "pulldown": {
      "command": "python",
      "args": ["-m", "pulldown.mcp_server"],
      "env": {
        "PULLDOWN_CACHE_DIR": "~/.cache/pulldown"
      }
    }
  }
}

Why not just use requests?

httpx and requests get you the bytes. pulldown gets you Markdown with structure intact, detail control, and the safety layer a server context requires.

Tool Markdown output Detail levels SSRF guard Batch API MCP surface No browser needed
pulldown ✓ 4 levels
httpx / requests manual manual
requests + BS4 manual manual manual
Playwright / Puppeteer manual limited browser required