Ghost Upload Content Manipulation Playbook

This document describes how content is transformed when syncing from the Quarto site to Ghost.

Processing Pipeline Overview

Quarto HTML → Extract → Pre-process → Parse → Transform → Clean → Upload

1. Content Extraction (extract_article_content)

  • Reads local HTML file from public/ghost-content/posts/{slug}/index.html
  • Extracts OpenGraph metadata (og:image, og:description, etc.)
  • Detects special content types before parsing:
    • Observable JS (has_observable_js)
    • Plotly charts (has_plotly)
    • Code annotations (has_code_annotations)
    • Code folds (has_code_folds)
    • LaTeX math (has_math)

2. Pre-Processing (Before HTML Parsing)

Order matters - these run before the HTML is parsed into a DOM:

Function Purpose Why Pre-parse?
embed_plotly_charts Extract Plotly data and embed as HTML cards Scripts would be lost during parsing
embed_ojs_cells Wrap OJS placeholder divs Preserve empty divs Ghost would strip
embed_videos Wrap <video> tags, add autoplay/loop Ghost strips video tags
embed_code_folds Wrap <details class="code-fold"> Ghost strips details/summary
embed_cell_outputs Wrap <div class="cell-output"> Preserve classes for CSS styling

3. Post-Processing (After HTML Parsing)

  • Extract content using selectors: #quarto-content mainbody
  • style_callout_blocks - Convert Quarto callouts to Ghost’s kg-callout-card format
  • process_code_annotations - Transform annotated code into custom HTML with markers

4. Content Cleaning (clean_content)

Transformation Regex/Method Purpose
Inline SVGs → Data URI RE_SVG + convert_inline_svgs Ghost strips inline SVG
Large SVGs → Placeholder Size > 500KB Upload separately to Ghost
Relative img URLs → Absolute RE_IMG_SRC Fix paths for Ghost
Relative asset hrefs → Absolute RE_HREF Fix asset links
Remove Quarto attributes RE_QUARTO_ATTR ({#id}, {.class}) Clean up Quarto syntax
Remove fig-alt markers RE_FIG_ALT ({fig-alt="..."}) Clean up
Remove empty paragraphs RE_EMPTY_P (<p>\s*</p>) Clean up
Ensure img alt text ensure_image_alt_text Accessibility

Code Injection System

Per-post code is injected into codeinjection_head based on content detection:

Condition Injection Constant Purpose
has_code_blocks PRISM_CODE_INJECTION Syntax highlighting + language detection
has_plotly PLOTLY_CODE_INJECTION Plotly.js CDN
has_annotations ANNOTATION_CODE_INJECTION CSS + click-to-highlight JS
has_code_folds CODE_FOLD_CSS_INJECTION Toggle styling + dark mode
has_math MATHJAX_CODE_INJECTION MathJax 3 for LaTeX rendering
has_ojs Custom OJS runtime Links to quarto-ojs runtime from source site

Ghost HTML Card Pattern

Ghost strips many HTML elements. The workaround is wrapping in “HTML cards”:

<!--kg-card-begin: html-->
<preserved-content>
<!--kg-card-end: html-->

Used for: videos, code-folds, cell-outputs, OJS cells, Plotly charts.

Image Handling

Small SVGs (< 500KB)

  1. Optimize (optimize_svg): remove comments, metadata, reduce decimal precision
  2. Base64 encode
  3. Embed as <img src="data:image/svg+xml;base64,...">

Large SVGs (> 500KB)

  1. Insert placeholder: __LARGE_SVG_PLACEHOLDER_N__
  2. Upload to Ghost via API (upload_svg_to_ghost)
  3. Replace placeholder with Ghost image URL

Feature Images

Priority order: 1. og:image from HTML 2. media:content from RSS 3. Local preview image (preview.png, banner.jpg, *-preview.*)

Mobiledoc Format

Ghost uses mobiledoc internally. The tool generates:

{
  "version": "0.3.1",
  "markups": [],
  "atoms": [],
  "cards": [["html", {"html": "..."}]],
  "sections": [[10, 0]]
}

For posts with uploaded images, native image cards are interspersed with HTML cards.

Known Limitations / Missing Features

Not Currently Handled

  1. Tables - Ghost may strip complex tables; may need HTML card wrapping

  2. Footnotes - Quarto footnote markup may not survive Ghost processing

Adding New Content Type Support

Pattern to follow:

  1. Detection: Add has_<feature>(content: &str) -> bool function
  2. Regex: Add static RE_<FEATURE>: LazyLock<Regex> if needed
  3. Embedding: Add embed_<feature>(content: &str) -> String if Ghost strips it
  4. Code Injection: Add const <FEATURE>_CODE_INJECTION: &str for CSS/JS
  5. Wire up in extract_article_content and create_post_from_entry

Important Regex Safety Notes

Quarto Attribute Regex

The RE_QUARTO_ATTR regex must require # or . prefix to avoid stripping LaTeX math:

// CORRECT: Only matches {#id} or {.class}
r"\{[#.][a-zA-Z][^}]*\}"

// WRONG: Would strip LaTeX like {aligned}, {Moon}, {Earth}
r"\{[#.]?[a-zA-Z][^}]*\}"

Code Annotation Regex

The outer_cell_regex must require data-code-annotations attribute to avoid matching across sections:

// CORRECT: Only matches cells with data-code-annotations
r#"<div[^>]*class="[^"]*cell[^"]*"[^>]*data-code-annotations="[^"]*"[^>]*>..."#

// WRONG: Matches any cell div, may capture content between cells
r#"<div[^>]*class="[^"]*cell[^"]*"[^>]*>..."#

Debugging Tips

  • Run with --verbose for detailed logging
  • Run with --dry-run to see what would happen
  • Check Ghost Admin → Post → Code injection to see injected CSS/JS
  • View page source on Ghost to verify HTML cards preserved content
  • If math content disappears, check regex patterns aren’t matching LaTeX brace syntax