Metadata-Version: 2.1
Name: abx-dl
Version: 1.11.142
Summary: All-in-one CLI tool to download and extract content from URLs
Keywords: scraping,crawling,downloading,internet archiving,web archiving,digipres,warc,preservation,backups,archiving,web,bookmarks,puppeteer,browser,download
Author: Nick Sweeting, ArchiveBox
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Legal Industry
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Internet :: WWW/HTTP :: WSGI :: Application
Classifier: Topic :: Sociology :: History
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: System :: Archiving
Classifier: Topic :: System :: Archiving :: Backup
Classifier: Topic :: System :: Recovery Tools
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Project-URL: Homepage, https://github.com/ArchiveBox/abx-dl
Project-URL: Source, https://github.com/ArchiveBox/abx-dl
Project-URL: Documentation, https://github.com/ArchiveBox/abx-dl
Project-URL: Bug Tracker, https://github.com/ArchiveBox/abx-dl/issues
Project-URL: Changelog, https://github.com/ArchiveBox/abx-dl/releases
Project-URL: Community, https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community
Project-URL: Donate, https://github.com/ArchiveBox/ArchiveBox/wiki/Donations
Requires-Python: >=3.12
Requires-Dist: rich-click>=1.8.0
Requires-Dist: rich>=13.0.0
Requires-Dist: abxbus==2.5.8
Requires-Dist: abxpkg>=1.11.139
Requires-Dist: abx-plugins>=1.11.142
Requires-Dist: jambo>=0.1.7
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: platformdirs>=4.0.0
Requires-Dist: requests>=2.28.0
Requires-Dist: psutil>=7.2.1
Description-Content-Type: text/markdown

# ⬇️ `abx-dl`

> A simple all-in-one CLI tool to auto-detect and download *everything* available from a URL.

```bash
uvx abx-dl 'https://example.com'

# OR

docker run -it -v "$PWD:/out" -v abxdl-personas:/data/personas archivebox/abxdl 'https://example.com'
```
---

✨ *Ever wish you could `yt-dlp`, `gallery-dl`, `wget`, `curl`, `puppeteer`, etc. all in one command?*

`abx-dl` is an all-in-one CLI tool for downloading URLs "by any means necessary".

It's useful for scraping, downloading, OSINT, digital preservation, and more.
`abx-dl` provides a simpler one-shot CLI interface to the [ArchiveBox plugin ecosystem](https://archivebox.github.io/abx-plugins/).

<img width="1000" height="1082" alt="Screenshot 2026-03-11 at 6 53 03 AM" src="https://github.com/user-attachments/assets/4e19d985-1a93-4f65-9970-2565be16b718" />


---

<br/>

#### 🍜 What does it save?

```bash
abx-dl --plugins=wget,title,screenshot,pdf,readability,git 'https://example.com'
```

`abx-dl` runs all plugins by default (and auto installs dependencies). You can specify `--plugins=wget,favicon,title` or filters like `--output=html,pdf,ico,text/` to limit plugin selection.
- HTML, JS, CSS, images, etc. rendered with a headless browser
- title, favicon, headers, outlinks, and other metadata
- audio, video, subtitles, playlists, comments
- snapshot of the page as a PDF, screenshot, and [Singlefile](https://github.com/gildas-lormeau/single-file-cli) HTML
- article text, `git` source code
- [and much more](https://archivebox.github.io/abx-plugins/)...

<br/>

#### 🧩 How does it work?

`abx-dl` uses the **[Plugin Library](https://archivebox.github.io/abx-plugins/)** (shared with [ArchiveBox](https://github.com/ArchiveBox/ArchiveBox)) to run a collection of downloading and scraping tools.

Plugins are loaded from the installed [`abx-plugins`](https://pypi.org/project/abx-plugins/) package (or from `ABX_PLUGINS_DIR` if you override it) and execute in distinct phases:
1. **Install phase** runner reads plugins `config.json`: `required_binaries` and emits `BinaryRequestEvent`s for `abxpkg.binary_service.BinaryService`, which resolves or installs binaries using built-in providers such as env, pip, npm, brew, apt, cargo, and browser-specific providers. `BinaryCacheService` and the `abx-dl` cache backend then project resolved state into `derived.env`.
2. **CrawlSetup hooks** (`on_CrawlSetup__*`) launch/configure expensive crawl-scoped processes like chrome, or trigger side effects. they emit no stdout JSONL records.
4. **Snapshot hooks** (`on_Snapshot__*`) run per URL to extract content and emit only `ArchiveResult`, `Snapshot`, and `Tag` records


<br/>

#### ⚙️ Configuration

Configuration is handled via environment variables plus a user config file (`~/.config/abx/config.env`). Runtime-derived cache entries such as resolved binary paths are stored separately in `~/.config/abx/derived.env`:

```bash
abx-dl config                        # show all config (global + per-plugin)
abx-dl config --get WGET_TIMEOUT     # get a specific value
abx-dl config --set TIMEOUT=120      # set persistently (resolves aliases)
```

Output is grouped by section:
```bash
# GLOBAL
TIMEOUT=60
USER_AGENT="Mozilla/5.0 ..."
...

# plugins/wget
WGET_BINARY="wget"
WGET_TIMEOUT=60
...

# plugins/chrome
CHROME_BINARY="chromium"
...
```

Common options:
- `TIMEOUT=60` - default timeout for hooks
- `USER_AGENT` - default user agent string
- `{PLUGIN}_BINARY` - path or name of the binary to use (e.g. `WGET_BINARY=wget` or `CHROME_BINARY=/usr/bin/chromium`)
- `{PLUGIN}_ENABLED=True/False` - enable/disable specific plugins
- `{PLUGIN}_TIMEOUT=120` - per-plugin timeout overrides

Aliases are automatically resolved (e.g. `--set USE_WGET=false` saves as `WGET_ENABLED=false`).

One-off config is easy via env vars or CLI args:

```bash
env \
  CHROME_BINARY=/usr/bin/chromium \
  TIMEOUT=120 \
  abx-dl 'https://example.com'

# CLI args
abx-dl \
  --dir=/tmp/test \
  --plugins=wget,title,favicon \
  --output=html,txt,css,js,ico,png,pdf \
  --timeout=90 \
  'https://example.com'
```

<br/>

---

<br/>

### 📦 Install

```bash
pip install abx-dl
abx-dl 'https://example.com'

# Or run the published CLI without installing it globally
uvx abx-dl 'https://example.com'

# Check / Install any missing plugin dependencies
abx-dl version
abx-dl install
```

<br/>

### 🔠 Usage

```bash
# Default command - a bare URL archives with all enabled plugins:
abx-dl 'https://example.com'

# Select plugins by output type (mimetypes, categories, or file extensions):
abx-dl --output=html,pdf,video/ 'https://example.com'
abx-dl -o text -o image -o mp4 'https://example.com'

# Limit work to a subset of plugins by name:
abx-dl --plugins=wget,title,screenshot,pdf 'https://example.com'

# Skip auto-installing missing dependencies (emit warnings instead):
abx-dl --no-install 'https://example.com'

# Specify output directory (default is current working dir):
abx-dl --dir=./downloads 'https://example.com'

# Set timeout:
abx-dl --timeout=120 'https://example.com'
```

#### Commands

```bash
abx-dl <url>                              # Download URL (default shorthand)
abx-dl plugins                            # Check + show info for all plugins
abx-dl plugins wget ytdlp git             # Check + show info for specific plugins
abx-dl install wget ytdlp git             # Pre-install plugin dependencies
abx-dl config                             # Show all config values
abx-dl config --get TIMEOUT               # Get a specific config value
abx-dl config --set TIMEOUT=120           # Set a config value persistently
```

#### Installing Dependencies

Many plugins require external binaries (e.g., `wget`, `chrome`, `yt-dlp`, `single-file`).

By default, `abx-dl` lazily installs missing dependencies as needed when you download a URL.
Use `--no-install` to skip plugins with missing dependencies instead. `install` runs only the pre-run dependency pipeline (`required_binaries` → `BinaryRequestEvent` → `BinaryEvent`) without starting crawl setup or snapshot extraction:

```bash
# Auto-installs missing deps on-the-fly (default behavior)
abx-dl 'https://example.com'

# Skip plugins with missing deps, emit warnings instead
abx-dl --no-install 'https://example.com'

# Install dependencies for specific plugins only
abx-dl install wget singlefile ytdlp

# Check which dependencies are available/missing
abx-dl plugins
```

Successful preflight installs are cached for 24 hours in `derived.env` under `ABX_INSTALL_CACHE`, keyed by binary name. If a binary was installed successfully recently, `abx-dl` skips re-running the install preflight for that binary. Cached abspaths are still validated at use time, and stale cache entries fall back to `abxpkg` provider resolution.

The normal runtime flow after dependency preflight is:
- `CrawlEvent` (internal lifecycle root)
- `CrawlSetupEvent` → plugin `on_CrawlSetup__*` hooks
- `CrawlStartEvent` → `SnapshotEvent`
- `SnapshotEvent` → plugin `on_Snapshot__*` hooks
- `SnapshotCleanupEvent` / `CrawlCleanupEvent`

Hook output contract:
- binary preflight is driven by plugin `required_binaries` and handled by `abxpkg`, not by plugin hooks
- `on_CrawlSetup__*` hooks emit no stdout JSONL records
- `on_Snapshot__*` hooks emit only `ArchiveResult`, `Snapshot`, and `Tag`
- the TUI and services consume structured events derived from those hook records

Dependencies are installed to `~/.config/abx/lib/{arch}/` using the appropriate package manager:
- **pip packages** → `~/.config/abx/lib/{arch}/pip/venv/`
- **npm packages** → `~/.config/abx/lib/{arch}/npm/`
- **brew/apt packages** → system locations

You can override the install location with `LIB_DIR=/path/to/lib abx-dl install wget`.

<br/>

---

<br/>

### Output Structure

By default, `abx-dl` writes results into the current working directory. Each run creates an `index.jsonl` manifest plus one subdirectory per plugin that produced output. If you want to keep runs isolated, `cd` into a scratch directory first or pass `--dir=/path/to/run`.

```bash
mkdir -p /tmp/abx-run && cd /tmp/abx-run
uvx --from abx-dl abx-dl --plugins=title,wget 'https://example.com'
```

```
./
├── index.jsonl             # Snapshot metadata and results (JSONL format)
├── title/
│   └── title.txt
├── favicon/
│   └── favicon.ico
├── screenshot/
│   └── screenshot.png
├── pdf/
│   └── output.pdf
├── dom/
│   └── output.html
├── wget/
│   └── example.com/
│       └── index.html
├── singlefile/
│   └── output.html
└── ...
```

<br/>

### All Outputs

- `index.jsonl` - snapshot metadata and plugin results (JSONL format, ArchiveBox-compatible)
- `title/title.txt` - page title
- `favicon/favicon.ico` - site favicon
- `screenshot/screenshot.png` - full page screenshot (Chrome)
- `pdf/output.pdf` - page as PDF (Chrome)
- `dom/output.html` - rendered DOM (Chrome)
- `wget/example.com/...` - mirrored site files
- `singlefile/output.html` - single-file HTML snapshot
- ... and more via plugin library ...

---

### Available Plugins

See the [`abx-plugins` marketplace](https://github.com/ArchiveBox/abx-plugins).

#### Snapshot / Extraction Plugins

- `ytdlp` - downloads media plus sidecars: audio, video, images/thumbnails, subtitles (`.srt`, `.vtt`), JSON metadata, and text descriptions.
- `gallerydl` - downloads gallery/media sets as images, videos, JSON sidecars, text sidecars, and ZIP archives.
- `forumdl` - exports forum/thread archives as JSONL, WARC, and mailbox-style message archives.
- `git` - clones repository contents including text, binaries, images, audio, video, fonts, and other tracked files.
- `wget` - mirrors pages and requisites as HTML, WARC, images, CSS, JavaScript, fonts, audio, and video.
- `archivedotorg` - saves a Wayback Machine archive link as plain text.
- `favicon` - saves site favicons and touch icons as image files.
- `modalcloser` - setup helper only; no direct archive files.
- `consolelog` - saves browser console events as JSONL.
- `dns` - saves observed DNS activity as JSONL.
- `ssl` - saves TLS certificate/connection metadata as JSONL.
- `responses` - saves HTTP response metadata as JSONL and can record referenced text, images, audio, video, apps, and fonts.
- `redirects` - saves redirect chains as JSONL.
- `staticfile` - saves non-HTML direct file responses such as PDF, EPUB, images, audio, video, JSON, XML, CSV, ZIP, and generic binary files.
- `headers` - saves main-document HTTP headers as JSON.
- `chrome` - manages shared browser state and emits plain-text and JSON runtime metadata.
- `seo` - saves SEO metadata such as meta tags and Open Graph fields as JSON.
- `accessibility` - saves the browser accessibility tree as JSON.
- `infiniscroll` - page-expansion helper only; no direct archive files.
- `claudechrome` - saves Claude-computer-use interaction results as JSON plus PNG screenshots.
- `singlefile` - saves a full self-contained page snapshot as HTML.
- `screenshot` - saves rendered page screenshots as PNG.
- `pdf` - saves rendered pages as PDF.
- `dom` - saves fully rendered DOM output as HTML.
- `title` - saves the final page title as plain text.
- `readability` - extracts article HTML, plain text, and JSON metadata.
- `defuddle` - extracts cleaned article HTML, plain text, and JSON metadata.
- `mercury` - extracts article HTML, plain text, and JSON metadata.
- `claudecodeextract` - generates cleaned Markdown from other extractor outputs.
- `htmltotext` - converts archived HTML into plain text.
- `trafilatura` - extracts article content as plain text, Markdown, HTML, CSV, JSON, and XML/TEI.
- `papersdl` - downloads academic papers as PDF.
- `parse_html_urls` - emits discovered links from HTML as JSONL records.
- `parse_txt_urls` - emits discovered links from text files as JSONL records.
- `parse_rss_urls` - emits discovered feed entry URLs from RSS/Atom as JSONL records.
- `parse_netscape_urls` - emits discovered bookmark URLs from Netscape bookmark exports as JSONL records.
- `parse_jsonl_urls` - emits discovered bookmark URLs from JSONL exports as JSONL records.
- `parse_dom_outlinks` - emits crawlable rendered-DOM outlinks as JSONL records.
- `search_backend_sqlite` - writes a searchable SQLite FTS index database.
- `search_backend_sonic` - pushes content into Sonic search; no local archive files declared.
- `claudecodecleanup` - writes cleanup/deduplication results as plain text.
- `hashes` - writes file hash manifests as JSON.
- and more via the [`abx-plugins` marketplace](https://github.com/ArchiveBox/abx-plugins)...

---

### AI Skill

This repo includes an `abx-dl` skill for coding agents that need to run the standalone ArchiveBox extractor pipeline without a full ArchiveBox install.

- Skill source: [`skills/abx-dl/SKILL.md`](./skills/abx-dl/SKILL.md)
- skills.sh page: https://skills.sh/archivebox/abx-dl/abx-dl

---

### Architecture

`abx-dl` is built on these components:

- **`abx_dl/plugins.py`** - Plugin discovery from `abx-plugins` or `ABX_PLUGINS_DIR`
- **`abx_dl/executor.py`** - Hook execution engine with config propagation
- **`abx_dl/config.py`** - Environment variable configuration
- **`abx_dl/cli.py`** - Rich CLI with live progress display

### Related Projects

- `abxbus` https://abxbus.archivebox.io https://github.com/archiveBox/abxbus
- `abxpkg` https://abxpkg.archivebox.io https://github.com/archiveBox/abxpkg
- `abx-plugins` https://abx-plugins.archivebox.io https://github.com/ArchiveBox/abx-plugins
- `archivebox` https://archivebox.io https://github.com/ArchiveBox/ArchiveBox
- And lots more...
  - https://github.com/stars/pirate/lists/internet-archiving
  - https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community

---

For more advanced use with collections, parallel downloading, a Web UI + REST API, etc.
See: [`ArchiveBox/ArchiveBox`](https://github.com/ArchiveBox/ArchiveBox)
