Part 1: Foundations — The Mental Model
You have probably used browser “Reader View” extensions, read-it-later apps like Pocket, or web clippers. Under the hood, almost all of them rely on one ancient, legendary library: Mozilla Readability.js.
While Readability is great, the modern web has evolved. Pages now have complex math formulas (MathJax/KaTeX), elaborate code blocks with syntax highlighting, nested footnotes, and JavaScript-rendered content (like X/Twitter or ChatGPT chats). Readability often strips these or outputs messy HTML that is terrible for converting into Markdown.
Enter Defuddle by kepano (the creator of the Obsidian Web Clipper).
Mental Model: Think of Defuddle as Readability 2.0 specialized for Markdown enthusiasts. It is a content extractor that looks at a cluttered web page and surgically isolates the main article, heavily standardizing complex elements (like math, code, and footnotes) into clean, semantic HTML so that subsequent HTML-to-Markdown converters (like Turndown) produce perfect results.
Part 2: The Investigation — Architecture Deep Dive
Defuddle is written purely in TypeScript and designed to run uniformly in the Browser, in Node.js, and via CLI.
The Pipeline Architecture
When you pass a document to Defuddle, it runs the HTML through a multi-stage pipeline:
- Extraction (Site-Specific or Heuristic)
- Standardization (Elements Pipeline)
- Scoring & Cleanup
1. The Extractor Registry
Mozilla Readability applies one giant set of heuristic rules to every site. Defuddle, however, maintains an Extractor Registry (src/extractors/).
If you are parsing a generic blog, it uses heuristics. But if you are parsing a known site, it uses a dedicated extractor. Defuddle ships with built-in extractors for:
- AI Chats:
chatgpt.ts,claude.ts,gemini.ts,grok.ts - Social & Forums:
reddit.ts,hackernews.ts,x-article.ts,twitter.ts - Media & Code:
github.ts,youtube.ts
These extractors know exactly where the payload is on those specific DOM structures, avoiding the need to guess. There is even a useAsync option that falls back to third-party APIs (like FxTwitter) if the local HTML is a blank SPA frame.
2. The Standardization Pipeline
The magic of Defuddle happens in src/elements/. Once the raw content is isolated, Defuddle standardizes it so Markdown converters won’t choke:
- Code Blocks (
code.ts): Checks forpre > code. It strips out line numbers and arbitrary syntax highlighting spans, leaving only semantic<code data-lang="js" class="language-js">tokens. - Math (
math.ts): Detects MathJax, KaTeX, and MathML. It converts them into standardized<math data-latex="...">elements (using libraries likemathml-to-latexandtemmlin the “full” bundle). - Footnotes (
footnotes.ts): Detects various footnote reference patterns (superscripts, brackets) and rewrites them into a standard ordered list at the bottom of the DOM, ensuring Markdown converters create strict[^1]syntax. - Headings (
headings.ts): Demotes H1s to H2s, removes anchor links inside headings, and drops the first heading if it perfectly matches the<title>.
3. Tree Shaking Bundles
Defuddle compiles to three targets:
defuddle/core: Tiny browser bundle. No external dependencies.defuddle/full: Browser bundle that includes heavy MathML/LaTeX parsing libraries.defuddle/node: Optimized for backend scraping using JSDOM.
Part 3: The Diagnosis — What It Does for Developers
For developers building scrapers, AI ingestion pipelines, or productivity tools, Defuddle solves several long-standing headaches.
Problem 1: Ingesting AI Chat Logs
If you try to scrape a ChatGPT or Claude URL using standard Readability, the heavy DOM nesting confuses the heuristic scorer. Defuddle’s specific chatgpt.ts extractor identifies the user/assistant message bubbles and formats them cleanly, making it trivial to dump your chat history into an Obsidian vault or another LLM context window.
Problem 2: Preserving Code and Math
If you clip a technical blog post containing Python code and LaTeX math, standard extractors often destroy the backticks and render math as garbled text. By enforcing the Standardize pipeline, Defuddle ensures that when you pipe its output to Turndown, you get ```python and $$a \neq 0$$.
Example Output
When you run Defuddle, you don’t just get HTML; you get a rich metadata object:
| |
Part 4: The Resolution — How to Use Defuddle
You can integrate Defuddle into your stack in minutes.
For Python / CLI Developers (Quick Web Scraping)
If you just want to extract a page via bash quickly:
| |
For Node.js (Backend Scrapers & AI Pipelines)
If you are writing a data ingestion pipeline or vector-database loader in Node.js (make sure your package.json has "type": "module"):
| |
For Browser Extensions (Frontend)
If you are building a React app, Chrome extension, or Web Clipper:
| |
Final Mental Model
| |
GitHub: kepano/defuddle
Playground: Defuddle Playground
