Requests + BeautifulSoup. No browser, no Playwright, no heavyweight dependency chain. Fast cold starts, predictable memory.
HTTP-first web content
for LLM pipelines.
pulldown fetches URLs and converts them to clean, level-controlled Markdown — no browser, no heavy dependencies, no surprise tokens.
Everything a pipeline needs. Nothing it doesn't.
Four output modes — minimal, readable, full, raw — so you control how many tokens reach the model.
Pass a list of URLs and get back a list of results. Concurrency controlled, errors surfaced per-item, not globally.
Follow links from a seed URL up to a configurable depth and page limit. Same-domain by default. No runaway scrapes.
Uses ETag and Last-Modified headers to skip re-fetching unchanged pages. Correct conditional GET, not a naïve TTL.
Resolves hostnames to IP addresses before connecting and blocks private, loopback, and link-local ranges. Safe in server contexts.
One fetch. Four resolutions.
Every fetch accepts a detail_level parameter. Pick the resolution that matches your pipeline's token budget.
# Why HTTP-First Fetching Works
Main article text only. No nav, no sidebars, no ads.
Lowest token count. Best for classification or triage.
Title + core body paragraphs. Navigation, headers, footers, and repeated boilerplate stripped.
# Why HTTP-First Fetching Works
**Published:** 2025-11-04 · **Author:** Anthony Maio
Main article text with structure preserved. Section headings, lists,
and inline code blocks included. Images referenced as alt text.
## Section Heading
Content continues here with headings and lists intact.
Title, metadata, body paragraphs, headings, lists, inline code. Images become alt-text references.
# Why HTTP-First Fetching Works
**Published:** 2025-11-04 · **Author:** Anthony Maio · **Tags:** python, llm
Main article text. All structure preserved including code blocks,
tables, blockquotes, and footnotes.
```python
result = await client.fetch("https://example.com")
```
| Column A | Column B |
|----------|----------|
| value | value |
> Blockquote text preserved verbatim.
Full content including code blocks, tables, blockquotes, footnotes. No boilerplate stripped. Highest fidelity.
<!DOCTYPE html>
<html lang="en">
<head>
<title>Why HTTP-First Fetching Works</title>
<meta name="description" content="...">
</head>
<body>
<article>...</article>
</body>
</html>
Raw HTML returned as-is. No conversion. For pipelines that need to do their own extraction or diffing.
| Level | Headings | Lists | Code blocks | Tables | Images | Nav / footer |
|---|---|---|---|---|---|---|
| minimal | title only | — | — | — | — | stripped |
| readable | ✓ | ✓ | inline | — | alt text | stripped |
| full | ✓ | ✓ | ✓ | ✓ | ✓ | stripped |
| raw | Raw HTML returned, no conversion applied | |||||
Install. Fetch. Done.
pip install pulldown
pip install 'pulldown[render]' # + Playwright
pip install 'pulldown[mcp]' # + MCP server
pulldown get https://example.com
pulldown get https://example.com --detail minimal
pulldown get https://example.com --render
pulldown crawl https://docs.example.com --max-pages 20
import asyncio
from pulldown import fetch, fetch_many, Detail
async def main():
# Single fetch
result = await fetch(
"https://example.com/article",
detail=Detail.readable,
)
print(result.title)
print(result.content)
# Batch
results = await fetch_many(
["https://a.com", "https://b.com"],
detail=Detail.minimal,
concurrency=5,
)
for r in results:
print(r.url, len(r.content))
asyncio.run(main())
Safe to run in server contexts.
pulldown is designed for use inside MCP servers and other networked services where arbitrary user-supplied URLs are a real threat surface.
Resolves hostnames to IP before opening the connection. Blocks private (RFC 1918), loopback, link-local, and reserved ranges. Every redirect hop is validated — a public URL that 302s to 127.0.0.1 is blocked before the redirect is followed.
Response bodies are capped before they reach memory. Configurable per-request maximum. A malicious or unexpectedly large page cannot exhaust the server's heap.
Only http:// and https:// URLs are accepted. file://, gopher://, ftp://, and other schemes are rejected before a connection is attempted.
CLI, Python, or MCP — pick your surface.
pulldown get https://example.com \
--detail readable
pulldown crawl https://docs.example.com \
--max-depth 2 \
--max-pages 20 \
--detail minimal
from pulldown import fetch, Detail
result = await fetch(
url,
detail=Detail.readable,
max_bytes=1_000_000,
)
{
"mcpServers": {
"pulldown": {
"command": "python",
"args": ["-m", "pulldown.mcp_server"],
"env": {
"PULLDOWN_CACHE_DIR": "~/.cache/pulldown"
}
}
}
}
Why not just use requests?
httpx and requests get you the bytes. pulldown gets you Markdown with structure intact, detail control, and the safety layer a server context requires.
| Tool | Markdown output | Detail levels | SSRF guard | Batch API | MCP surface | No browser needed |
|---|---|---|---|---|---|---|
| pulldown | ✓ | ✓ 4 levels | ✓ | ✓ | ✓ | ✓ |
| httpx / requests | — | — | manual | manual | — | ✓ |
| requests + BS4 | manual | — | manual | manual | — | ✓ |
| Playwright / Puppeteer | manual | — | — | limited | — | browser required |