Featured image of post I Built a Tool to Migrate 500+ Images to WebP in One Hour

I Built a Tool to Migrate 500+ Images to WebP in One Hour

My Lighthouse score was crying because of heavy images. So I wrote a Python ETL to bulk-convert everything to WebP and update all the URLs automatically. Here's how.

I was staring at my Lighthouse report like it owed me money.

Performance: 62.

The culprit? Images. Hundreds of them scattered across markdown files, hosted on Flickr, Imgur, GitHub… all in glorious, unoptimized JPEG and PNG formats.

The manual fix would be:

  1. Download each image
  2. Convert to WebP
  3. Upload to my CDN
  4. Find and replace every URL in every markdown file

For 500 unique images? That’s not a weekend project. That’s a prison sentence.

So I did what any lazy engineer would do: I automated it.

The Problem

My blog uses Hugo with markdown files. Images are referenced everywhere:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# In frontmatter
image: "https://live.staticflickr.com/65535/54519397357_403fc67f4a_k. jpg"

# In gallery shortcodes
< gallery id="example_id">
- https://live.staticflickr.com/65535/54525108024_adbff3cc9b_k. jpg
- https://live.staticflickr.com/65535/54520449879_784f0f24ca_k. jpg
< /gallery >

# Standard markdown
![My photo](https://i.imgur.com/sXyG3GX. jpeg)

Each image had to be:

  • Downloaded from the original source
  • Converted to WebP (smaller, faster)
  • Uploaded to my new CDN
  • URL replaced in the markdown file

Multiply by 500. No thanks.

The Solution: An ETL Pipeline

I built bulk-webp-url-replacer—a Python tool that does exactly what it says:

1
2
3
4
5
6
python -m bulk_webp_url_replacer \
  --scan-dir ./content \
  --download-dir ./downloads \
  --output-dir ./webp_images \
  --new-url-prefix "https://cdn.example.com/images" \
  --threads 8

What it does:

  1. Extract — Scans all .md files for image URLs (frontmatter, galleries, inline)
  2. Transform — Downloads each image and converts to WebP
  3. Load — Replaces all old URLs with new CDN paths

One command. 500 images. Done.

The Technical Bits

Regex Patterns for URL Extraction

Markdown has multiple ways to embed images. My extractor handles them all:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
PATTERNS = [
    # YAML frontmatter: image: "https://..."
    re.compile(r'^image:\s*["\']?(https?://[^"\'>\s]+)["\']?\s*$'),
    # TOML frontmatter: image = "https://..."
    re.compile(r'^image\s*=\s*["\']?(https?://[^"\'>\s]+)["\']?\s*$'),
    # Gallery shortcodes: - https://...
    re.compile(r'^\s*-\s+(https?://[^\s]+\.(jpg|jpeg|png|gif|webp))\s*$'),
    # Standard markdown: ![alt](https://...)
    re.compile(r'!\[[^\]]*\]\((https?://[^)]+)\)'),
]

Parallel Downloads

Downloading 500 images sequentially? Slow. With ThreadPoolExecutor:

1
2
3
4
with ThreadPoolExecutor(max_workers=8) as executor:
    futures = {executor.submit(process_url, url): url for url in urls}
    for future in as_completed(futures):
        # Process results as they complete

8 threads = 8x faster. Simple math.

Rate Limiting & Retries

Imgur wasn’t happy with my enthusiasm. HTTP 429 errors everywhere.

The fix: exponential backoff with browser-like headers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...',
    'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
}

for attempt in range(max_retries):
    response = requests.get(url, headers=HEADERS, timeout=30)
    if response.status_code == 429:
        time.sleep(2 ** attempt)  # 1s, 2s, 4s...
        continue

Smart Skipping

The tool saves a mapping.json after each run:

1
2
3
{
  "https://old-url.com/image.jpg": "new-filename.webp"
}

Next run? It skips already-processed images. Incremental migrations FTW.

The Results

Before:

  • 612 image references across 72 markdown files
  • Images scattered across Flickr, Imgur, GitHub
  • Lighthouse begging for mercy

After:

  • All images converted to WebP
  • Hosted on a single CDN
  • URLs automatically updated
  • One hour of work (mostly watching the progress bar)

Performance improvement:

  • Average image size: 60-80% smaller
  • Lighthouse Performance: 62 → 89

Lessons Learned

  1. Automation scales. What would take days manually took an hour to build and minutes to run.

  2. Rate limiting is real. Always add retries and backoff. Sites like Imgur will throttle you.

  3. Dry-run first. The --dry-run flag saved me from accidentally breaking 72 files.

  4. WebP is worth it. Same quality, fraction of the size. There’s no reason to serve JPEGs in 2026.

Try It Yourself

The tool is open source on GitHub.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Preview what would change
bulk-webp-url-replacer \
  --scan-dir ./content \
  --download-dir ./downloads \
  --output-dir ./webp \
  --dry-run

# Run for real
bulk-webp-url-replacer \
  --scan-dir ./content \
  --download-dir ./downloads \
  --output-dir ./webp \
  --new-url-prefix "https://your-cdn.com/images" \
  --threads 8

Your Lighthouse score will thank you. 🚀

Example Output

After running the migration tool, the URLs are automatically updated to point to the optimized WebP versions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# In frontmatter
image: "https://raw.githubusercontent.com/HoangGeek/store/refs/heads/main/webp/54519397357_403fc67f4a_k.webp"

# In gallery shortcodes
< gallery id="example_id">
- https://raw.githubusercontent.com/HoangGeek/store/refs/heads/main/webp/54525108024_adbff3cc9b_k.webp
- https://raw.githubusercontent.com/HoangGeek/store/refs/heads/main/webp/54520449879_784f0f24ca_k.webp
< /gallery >

# Standard markdown
![My photo](https://raw.githubusercontent.com/HoangGeek/store/refs/heads/main/webp/sXyG3GX.webp)
Made with laziness love 🦥

Subscribe to My Newsletter