Executable Notebooks (Part 2): Custom Pandoc Filters for Technical Documentation
All posts in this series
- Executable Notebooks (Part 0): Series Overview
- Executable Notebooks (Part 1): Reproducible Analysis Notebooks with Markdown + Python + LaTeX
- Executable Notebooks (Part 2): Custom Pandoc Filters for Technical Documentation
- Executable Notebooks (Part 3): Pre-commit Validation for Technical Documents
- Executable Notebooks (Part 4): LaTeX tcolorbox Environments for Technical Reports
Building custom Pandoc filters to handle Mermaid diagrams, syntax highlighting, TikZ rendering, and conditional code display in a Markdown-to-PDF pipeline.
Note: Code examples in this post are simplified for illustration. The actual filters include additional error handling and edge cases. A complete starter template is available on Gumroad.
Motivation
Pandoc is powerful out of the box, but technical documentation often requires:
- Mermaid diagrams rendered to PDF-compatible images
- Syntax highlighting via minted (superior to built-in highlighting)
- TikZ graphics compiled to images for non-LaTeX outputs
- Conditional display of code blocks (notebook vs document mode)
- File includes with YAML front matter stripping
Each of these requirements necessitates a filter. Pandoc supports both Lua filters (fast, native) and JSON filters (any language via pandocfilters).
Filter Chain Architecture
Filters run sequentially, transforming the AST:
1
2
3
4
5
6
pandoc input.md \
--lua-filter=include-files.lua \
--lua-filter=notebook-toggle.lua \
--filter=pandoc-mermaid.py \
--filter=pandoc-minted.py \
-o output.pdf
Order matters. The recommended chain:
graph LR
A["Markdown Input"] --> B["include-files.lua<br/>Expand !include directives"]
B --> C["notebook-toggle.lua<br/>Remove suppressed blocks"]
C --> D["pandoc-mermaid.py<br/>Render diagrams"]
D --> E["pandoc-minted.py<br/>Apply syntax highlighting"]
E --> F["PDF Output"]
Filter 1: Include Files (Lua)
Pandoc does not natively support !include directives. This filter adds that capability:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- include-files.lua
function Para(el)
local text = pandoc.utils.stringify(el)
local path = text:match("^!include%s+(.+)$")
if path then
local content = read_include_file(path)
if content then
-- Parse without YAML metadata (strip it)
local doc = pandoc.read(content, "markdown-yaml_metadata_block")
return doc.blocks
end
end
return el
end
Key Details
Path resolution: Paths are relative to the input document:
1
2
3
4
5
6
7
function get_input_dir(meta)
if PANDOC_STATE and PANDOC_STATE.input_files then
local input_file = PANDOC_STATE.input_files[1]
input_dir = pandoc.path.directory(input_file)
end
return meta
end
YAML stripping: Included files often have their own front matter. This must be stripped:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
function strip_yaml_frontmatter(content)
local lines = {}
local in_frontmatter = false
local frontmatter_ended = false
local line_num = 0
for line in content:gmatch("([^\n]*)\n?") do
line_num = line_num + 1
if line_num == 1 and line:match("^%s*%-%-%-+%s*$") then
in_frontmatter = true
elseif in_frontmatter and line:match("^%s*%-%-%-+%s*$") then
in_frontmatter = false
frontmatter_ended = true
elseif not in_frontmatter then
table.insert(lines, line)
end
end
return table.concat(lines, "\n")
end
Filter registration: Lua filters return a table of functions:
1
2
3
4
return {
{ Meta = get_input_dir }, -- First pass: capture input directory
{ Para = Para, RawBlock = RawBlock } -- Second pass: process includes
}
Filter 2: Notebook Toggle (Lua)
Code visibility is controlled via YAML frontmatter. This filter reads the notebook field and removes code blocks accordingly.
Frontmatter Processing
The filter accesses frontmatter via Pandoc’s doc.meta table:
1
2
3
4
5
6
7
8
9
-- notebook-toggle.lua
function Pandoc(doc)
-- Access the 'notebook' field from YAML frontmatter
local is_notebook = doc.meta.notebook
-- Pandoc converts YAML booleans to MetaBool objects
-- Direct comparison works: if is_notebook == false then
...
end
Pandoc parses the frontmatter and makes all fields available:
1
2
3
4
5
6
---
title: "Analysis Report"
notebook: true
toc: true
custom_field: "any value"
---
In Lua filters:
doc.meta.title→"Analysis Report"(MetaInlines)doc.meta.notebook→true(MetaBool)doc.meta.toc→true(MetaBool)doc.meta.custom_field→"any value"(MetaInlines)
Document Modes
The notebook field enables dual-mode documents from a single source:
| Mode | Frontmatter | Result |
|---|---|---|
| Notebook | notebook: true | Code blocks visible (except .suppress) |
| Report | notebook: false | All code blocks hidden |
| Default | (field omitted) | Same as notebook: true |
Frontmatter in LaTeX Templates
Pandoc templates access frontmatter fields via $field$ syntax:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
% In template.latex
\title{$title$}
$if(author)$
\author{$author$}
$endif$
$if(toc)$
\tableofcontents
$endif$
$if(abstract)$
\begin{abstract}
$abstract$
\end{abstract}
$endif$
Custom fields work the same way. Define project_id: "XYZ-2024" in frontmatter, then use $project_id$ in the template.
Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- notebook-toggle.lua
function Pandoc(doc)
local is_notebook = doc.meta.notebook
if is_notebook == false then
-- Document mode: suppress all code
return doc:walk({
CodeBlock = function() return {} end,
Div = function() return {} end
})
end
-- Notebook mode: only suppress .suppress class
return doc:walk({
CodeBlock = function(el)
if el.classes:includes('suppress') then
return {}
end
return el
end
})
end
Per-Block Suppression
Even in notebook mode, specific blocks can be suppressed:
1
2
3
4
5
6
7
8
9
```python {.suppress}
# Hidden setup code
API_KEY = load_secret()
```
```python
# This code shows in PDF
print("Hello!")
```
Format-Aware Suppression
Suppression can be limited to LaTeX output (keeping code in HTML):
1
2
3
4
5
6
7
local function should_suppress()
-- FORMAT is a global variable in Pandoc filters
if FORMAT and FORMAT:match('latex') then
return true
end
return false
end
Filter 3: Mermaid Diagrams (Python)
Mermaid diagrams in Markdown must be rendered to PDF images for LaTeX:
1
2
3
4
5
```mermaid
graph TD
A[Start] --> B[Process]
B --> C[End]
```
Filter Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/usr/bin/env python3
from pandocfilters import toJSONFilter, RawBlock
import subprocess
import hashlib
import os
def mermaid(key, value, format_, meta):
if key != "CodeBlock":
return None
[[ident, classes, keyvals], code] = value
if "mermaid" not in classes:
return None
# Generate deterministic filename from content hash
code_hash = hashlib.sha256(code.encode()).hexdigest()[:12]
pdf_file = f"build/mermaid/{code_hash}.pdf"
# Cache: skip if already rendered
if not os.path.exists(pdf_file):
render_mermaid(code, pdf_file)
# Return LaTeX includegraphics
latex = f"\\includegraphics[width=\\textwidth]{{{pdf_file}}}"
return RawBlock("latex", latex)
if __name__ == "__main__":
toJSONFilter(mermaid)
Rendering with mmdc
The Mermaid CLI (mmdc) renders to PDF:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def render_mermaid(code, output_path):
# Write temp .mmd file
with open("temp.mmd", "w") as f:
f.write(code)
cmd = [
"mmdc",
"-i", "temp.mmd",
"-o", output_path,
"--outputFormat", "pdf",
"--pdfFit",
"--cssFile", os.environ.get("MERMAID_CSS", ""),
]
subprocess.run(cmd, check=True)
os.remove("temp.mmd")
Attribute Handling
Width, caption, and centering attributes are supported:
1
2
3
4
```{.mermaid width="50%" caption="System Architecture" center="true"}
graph TD
A --> B
```
Parsing and LaTeX conversion:
1
2
3
4
5
6
7
8
9
10
11
12
13
def convert_percentage_to_tex(value):
"""Convert '50%' to '0.5\\textwidth'"""
if value.endswith("%"):
percentage = float(value.strip("%")) / 100
return f"{percentage}\\textwidth"
return value
# Build LaTeX with figure environment
if center:
latex = f"""\\begin{{figure}}[h]\\centering
\\includegraphics[width={width}]{{{pdf_file}}}
\\caption{{{caption}}}
\\end{{figure}}"""
Caching
Unchanged diagrams should not be re-rendered:
1
2
3
4
5
6
7
8
9
10
11
12
13
def get_file_hash(content):
return hashlib.sha256(content.encode()).hexdigest()
# Check cache
hash_file = f"{filename}.hash"
if os.path.exists(pdf_file) and os.path.exists(hash_file):
with open(hash_file) as f:
if f.read().strip() == get_file_hash(code):
return # Skip rendering
# After rendering, save hash
with open(hash_file, "w") as f:
f.write(get_file_hash(code))
Filter 4: Minted Syntax Highlighting (Python)
Pandoc’s built-in highlighting is limited. Minted (via Pygments) provides superior results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/usr/bin/env python3
from string import Template
from pandocfilters import toJSONFilter, RawBlock, RawInline
def minted(key, value, format, meta):
if format != 'latex':
return
if key == 'CodeBlock':
[[_, classes, attrs], contents] = value
language = classes[0] if classes else 'text'
# Skip empty blocks
if not contents.strip():
return None
# Choose environment based on class
if 'noexec' in classes:
env = 'exampleminted' # Different styling
else:
env = 'myminted'
latex_code = f'\\begin{{{env}}}{{{language}}}\n{contents}\n\\end{{{env}}}'
return [RawBlock(format, latex_code)]
elif key == 'Code':
# Inline code
[[_, classes, _], contents] = value
language = classes[0] if classes else 'text'
return [RawInline(format, f'\\mintinline{{{language}}}{{{contents}}}')]
if __name__ == '__main__':
toJSONFilter(minted)
Custom Environments
Custom minted environments are defined in the LaTeX preamble:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
% In template.latex
\usepackage{minted}
% Standard code block
\newenvironment{myminted}[2][]{%
\VerbatimEnvironment
\begin{minted}[
frame=lines,
framesep=2mm,
fontsize=\small,
#1
]{#2}%
}{%
\end{minted}%
}
% Example code (not executed) - different styling
\newenvironment{exampleminted}[2][]{%
\VerbatimEnvironment
\begin{minted}[
frame=leftline,
framesep=2mm,
fontsize=\small,
bgcolor=gray!10,
#1
]{#2}%
}{%
\end{minted}%
}
Filter 5: TikZ to PNG (Lua)
For DOCX output, TikZ must be rendered to images:
1
2
3
4
5
6
7
8
-- tikz_to_png.lua
function CodeBlock(elem)
if elem.classes:includes("tikz") then
local png_file = render_tikz_to_png(elem.text)
return pandoc.Para { pandoc.Image({}, png_file) }
end
return elem
end
Rendering Pipeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
function render_tikz_to_png(tikz_code, output_dir, index)
local tex_file = output_dir .. "/tikz_" .. index .. ".tex"
local pdf_file = output_dir .. "/tikz_" .. index .. ".pdf"
local png_file = output_dir .. "/tikz_" .. index .. ".png"
-- Write standalone LaTeX document
local tex_content = [[
\documentclass{article}
\usepackage{tikz}
\usepackage[active,tightpage]{preview}
\begin{document}
\begin{preview}
]] .. tikz_code .. [[
\end{preview}
\end{document}
]]
-- Compile with lualatex
os.execute("lualatex -output-directory=" .. output_dir .. " " .. tex_file)
-- Convert PDF to PNG with ImageMagick
os.execute("convert -density 300 " .. pdf_file .. " " .. png_file)
return png_file
end
Path Resolution
TikZ code may reference images. Paths must be resolved before compiling:
1
2
3
4
5
6
7
8
9
10
function replace_image_paths(tikz_code)
return tikz_code:gsub(
"\\includegraphics(%b[])(%b{})",
function(options, filename)
local file = filename:match("^{(.-)}$")
local full_path = find_image_file(file)
return "\\includegraphics" .. options .. "{" .. full_path .. "}"
end
)
end
Integration
Shell Script
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash
# create_pdf.sh
FILTER_DIR=".pdf_pipeline/latex/filters"
pandoc "$INPUT" \
--from markdown+raw_tex \
--template=template.latex \
--lua-filter="$FILTER_DIR/include-files.lua" \
--lua-filter="$FILTER_DIR/notebook-toggle.lua" \
--filter="$FILTER_DIR/pandoc-mermaid.py" \
--filter="$FILTER_DIR/pandoc-minted.py" \
--highlight-style=pygments \
-o "$OUTPUT.tex"
latexmk -lualatex -shell-escape "$OUTPUT.tex"
Environment Variables
Filters are configured via environment variables:
1
2
3
export MERMAID_BIN="./node_modules/.bin/mmdc"
export MERMAID_FILTER_MERMAID_CSS=".pdf_pipeline/config/mermaid.css"
export MERMAID_OUTPUT_DIR="build/mermaid_images"
Directory Structure
graph TD
A[".pdf_pipeline"] --> B["latex"]
A --> C["config"]
A --> D["scripts"]
B --> B1["filters"]
B --> B2["template.latex"]
B1 --> B1a["include-files.lua"]
B1 --> B1b["notebook-toggle.lua"]
B1 --> B1c["pandoc-mermaid.py"]
B1 --> B1d["pandoc-minted.py"]
B1 --> B1e["tikz_to_png.lua"]
C --> C1["mermaid-config.json"]
C --> C2["mermaid.css"]
D --> D1["create_pdf.sh"]
Debugging Filters
Lua Filters
Write to stderr:
1
io.stderr:write("[filter-name] Processing block\n")
Run with --verbose:
1
pandoc input.md --lua-filter=filter.lua --verbose -o output.pdf
Python Filters
Use stderr (stdout is the AST):
1
2
import sys
sys.stderr.write(f"DEBUG: Processing {key}\n")
AST Inspection
Examining what Pandoc parses:
1
pandoc input.md -t json | python -m json.tool | head -100
Performance Considerations
- Lua vs Python: Lua filters are approximately 10x faster (no IPC overhead)
- Caching: Essential for Mermaid/TikZ (seconds per diagram)
- Filter order: Place fast filters first, expensive ones last
- Conditional processing: Check format early to skip unnecessary work:
1
2
3
def my_filter(key, value, format, meta):
if format != 'latex':
return None # Skip for HTML
Common Pitfalls
Wrong AST Element Type
Filters receive AST elements, not raw text:
1
2
3
4
5
6
# Wrong: treating value as string
if "mermaid" in value:
# Right: unpack the structure
[[ident, classes, keyvals], code] = value
if "mermaid" in classes:
Missing Return Statement
Filters must return the element (or None to keep original):
1
2
3
4
5
6
7
8
9
-- Wrong: no return
function Para(el)
process(el)
end
-- Right: return the element
function Para(el)
return process(el)
end
Path Issues
Filters run from Pandoc’s working directory, not the input file’s:
1
2
3
-- Use PANDOC_STATE for input file location
local input_dir = pandoc.path.directory(PANDOC_STATE.input_files[1])
local full_path = pandoc.path.join({input_dir, relative_path})
Summary
Custom Pandoc filters bridge the gap between Markdown simplicity and LaTeX power. Key insights:
- Lua for speed: Simple transformations (includes, toggles)
- Python for complexity: External tool integration (Mermaid, minted)
- Aggressive caching: Diagram rendering is slow
- Order matters: Process includes first, expensive filters last
The full filter implementations are included in the starter template on Gumroad.
References
All posts in this series
- Executable Notebooks (Part 0): Series Overview
- Executable Notebooks (Part 1): Reproducible Analysis Notebooks with Markdown + Python + LaTeX
- Executable Notebooks (Part 2): Custom Pandoc Filters for Technical Documentation
- Executable Notebooks (Part 3): Pre-commit Validation for Technical Documents
- Executable Notebooks (Part 4): LaTeX tcolorbox Environments for Technical Reports