Skip to content

Asset Grabber Playbook

Tools for pulling images / videos / sounds from anywhere on the web, optimizing them, and cataloging Justin's asset library.

What's installed

Tool Where Purpose
Playwright (python) + Chromium venv ~/.forge-venvs/assets/ Headless browser for scraping + search
cwebp apt (webp) WebP encoding
avifenc apt (libavif-bin) AVIF encoding
yt-dlp apt Video downloads (YouTube, generic)
gallery-dl pip Gallery/profile bulk downloads (IG, Twitter, etc.)
Pillow pip Image resize / format conversion
ffmpeg, imagemagick, exiftool apt (pre-existing) Media handling + EXIF strip

Venv path: /home/justinwieb/.forge-venvs/assets/bin/python. Scripts auto-resolve it via FORGE_VENV env var or a sensible default.

Scripts (scripts/assets/)

Script What it does
forge_assets_grab.py <url> Playwright-powered page scraper. Extracts <img>/<video>/<source> plus direct image-URL anchors. Downloads through the browser context so cookies/referer are honored. Writes provenance.json.
forge_assets_search.py <query> Image search via Google / Bing / DuckDuckGo (Playwright, no API key). Also supports Unsplash / Pexels APIs if keys are in env.
forge_assets_optimize.py <path> WebP + AVIF copies at responsive widths (default 320/640/1280/1920), strips EXIF by default, emits optimize.json.
forge_assets_catalog.py Scans /mnt/workspace/Assets + forge/assets, emits forge/data/assets-catalog.json with dimensions, sha256, tags, and source URLs.
run Thin bash wrapper, scripts/forge_assets_run.sh {grab|search|optimize|catalog} … dispatches to the venv.
lib/provenance.py Helper that writes provenance.json, merges across calls so you can append to an existing grab dir.

Provenance (not a license gate, just a trace)

Every script that downloads files writes a provenance.json in the output dir:

{
  "generated_at": "2026-04-22T01:30:40-0500",
  "items": [
    {
      "filename": "000_downtown-austin.jpg",
      "source_url": "https://upload.wikimedia.org/.../Downtown_Austin.jpg",
      "page_url": "https://en.wikipedia.org/wiki/Austin,_Texas",
      "engine": "grab",
      "query": "",
      "referer": "https://en.wikipedia.org/wiki/Austin,_Texas",
      "content_type": "image/jpeg",
      "bytes": 103514,
      "sha256": "…",
      "saved_at": "2026-04-22T01:30:40-0500",
      "extra": {"alt": "…", "kind": "image"}
    }
  ]
}

This is purely an organizational record (where did this come from, when). No license check, no gating, pull anything.

Where assets land by default

Source Default path
forge_assets_grab.py <url> /mnt/workspace/Assets/Web-Grabs/<date>_<host>/
forge_assets_search.py <query> /mnt/workspace/Assets/Web-Grabs/<date>_<query>/
Brand assets (manual) /mnt/workspace/<Brand>/Brand-Assets/ (existing convention)
Optimized outputs for a specific page forge/sites/<site>/<page>/assets/ (deploy-ready)

Override with --out DIR on any script.

Common recipes

Grab images from any URL

scripts/forge_assets_run.sh grab "https://example.com/article" --max 30 --min-width 600
scripts/forge_assets_run.sh search "neon retro grid" --engine google --count 15 \
  --out /mnt/workspace/Assets/Web-Grabs/2026-04-22_neon

Unsplash (clean API if key set)

export UNSPLASH_ACCESS_KEY=...
scripts/forge_assets_run.sh search "coffee beans" --engine unsplash --count 8

Optimize a directory of grabs into deploy-ready assets

scripts/forge_assets_run.sh optimize /mnt/workspace/Assets/Web-Grabs/2026-04-22_neon \
  --out sites/justinsforge.com/neon/assets --widths 640 1280 1920

Bulk social-media grabs

~/.forge-venvs/assets/bin/gallery-dl "https://www.instagram.com/some_account/"
# output goes to ./gallery-dl/ by default; configure via ~/.config/gallery-dl/config.json

Video downloads

yt-dlp "https://youtube.com/watch?v=..." -o "/mnt/workspace/Assets/Video/%(title)s.%(ext)s"

Catalog

scripts/forge_assets_run.sh catalog
# → /home/justinwieb/forge/data/assets-catalog.json

Auth / cookies

For authenticated fetches (Shopify, Adobe Stock, wherever Justin has a login), drop a persistent Chromium profile at ~/.forge-venvs/assets/browser-profile/ and reuse it:

from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    ctx = p.chromium.launch_persistent_context(
        "/home/justinwieb/.forge-venvs/assets/browser-profile/",
        headless=False,    # first time: log in manually over VNC/X forwarding
    )

After first login, run headless, the profile keeps cookies. Never commit the profile directory to git (already outside the repo).

Useful API keys (optional)

All optional, scripts work without them via Playwright scraping.

Env var Service Notes
UNSPLASH_ACCESS_KEY Unsplash Free tier: 50 req/hr
PEXELS_API_KEY Pexels Free tier: ~200 req/hr

Security posture

  • No browser service binds to the public internet. Everything runs local/CLI.
  • Session cookies live in ~/.forge-venvs/assets/browser-profile/ if you set one up, outside the git repo.
  • All downloads go through the browser's request context with the originating page as Referer.
  • Default rate limit: 250ms between downloads inside a single grab run.