Post

Executable Notebooks (Part 1): Reproducible Analysis Notebooks with Markdown + Python + LaTeX

Executable Notebooks (Part 1): Reproducible Analysis Notebooks with Markdown + Python + LaTeX

Jupyter notebooks mix code, prose, and results but produce messy version control diffs, break when dependencies change, and require manual “Run All” before sharing. LaTeX gives publication-quality output but separates analysis from documentation. Plain Python scripts are reproducible but generate no formatted output.

This system combines the best of all three: write analysis in Markdown with embedded Python code blocks, execute them automatically, and generate PDFs via Pandoc and LaTeX. Code runs fresh on every build, ensuring results match current dependencies. Markdown remains readable in version control. Output is publication-ready without Jupyter’s cell execution hassles.

This post walks through the complete workflow: Markdown syntax for executable code blocks, Python execution engine that captures output, Pandoc filters for LaTeX integration, and automation via make and Docker. You’ll build self-documenting analyses that compile to PDFs with plots, tables, and equations.

Note: Code examples in this post are simplified for illustration. The actual implementation includes additional error handling, edge cases, and features. A complete starter template is available on Gumroad.

Quick Start

Get a working PDF in under 5 minutes. Choose your environment, then follow the steps.

Step 0: Environment Setup

Option A: Docker

Use the docker-latex container—a LaTeX compiler tailored for Markdown-to-PDF conversion:

1
2
3
git clone https://github.com/Derrekito/docker-latex.git
cd docker-latex
docker build -t latex-env .

The image includes TeX Live, Pandoc, Python, Pygments, and Mermaid CLI.

Option B: Native Installation (Arch Linux)

Install core dependencies:

1
2
3
4
5
6
7
8
9
10
11
12
# TeX Live (full distribution)
sudo pacman -S texlive-meta

# Document conversion
sudo pacman -S pandoc

# Python + syntax highlighting
sudo pacman -S python python-pygments python-pip

# Mermaid diagrams (optional, for flowcharts)
sudo pacman -S nodejs npm
npm install -g @mermaid-js/mermaid-cli

For minted syntax highlighting, ensure pygmentize is in your PATH:

1
pygmentize -V  # Should print version

Create a Python virtual environment for project-specific dependencies:

1
2
3
python -m venv .venv
source .venv/bin/activate
pip install pandocfilters

Step 1: Your First PDF

Create a file called hello.md:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
---
title: "My First Document"
author: "Your Name"
date: "2024-03-15"
---

# Introduction

This is a simple document with **bold** and *italic* text.

## Math Support

Inline math: $E = mc^2$

Display math:

$$
\int_0^\infty e^{-x^2} dx = \frac{\sqrt{\pi}}{2}
$$

## Code Listing

Here's some Python code (not executed, just displayed):

```python
def hello():
    print("Hello, World!")
1
2
3
4
5
6
7
8
9
Convert to PDF:

```bash
# Docker
docker run --rm -v "$(pwd):/app" latex-env pandoc hello.md -o hello.pdf

# Native
pandoc hello.md -o hello.pdf

Open hello.pdf. You have a professionally typeset document.

Step 2: Add Executable Code

Now let’s make the code run. Create analysis.md:

1
2
3
4
5
6
7
8
9
10
11
12
13
---
title: "Analysis Report"
notebook: true
---

# Data Analysis

```python
import math

# Calculate something
result = math.sqrt(2) * math.pi
print(f"Result: {result:.4f}")

The code above executes and shows output below it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
To execute Python blocks, you need the preprocessor script. For now, here's a minimal version:

```python
#!/usr/bin/env python3
# save as: execute_notebook.py
import re
import sys

def execute_notebook(input_file):
    with open(input_file) as f:
        content = f.read()

    namespace = {}
    output_lines = []

    # Simple regex for ```python ... ``` blocks
    pattern = r'```python\n(.*?)```'
    last_end = 0

    for match in re.finditer(pattern, content, re.DOTALL):
        # Add content before this block
        output_lines.append(content[last_end:match.start()])

        code = match.group(1)
        output_lines.append(f"```python\n{code}```\n")

        # Capture print output
        from io import StringIO
        import contextlib

        stdout = StringIO()
        with contextlib.redirect_stdout(stdout):
            exec(code, namespace)

        printed = stdout.getvalue()
        if printed.strip():
            output_lines.append(f"\n**Output:**\n```text\n{printed}```\n")

        last_end = match.end()

    output_lines.append(content[last_end:])

    output_file = input_file.replace('.md', '_executed.md')
    with open(output_file, 'w') as f:
        f.write(''.join(output_lines))

    print(f"Written: {output_file}")

if __name__ == "__main__":
    execute_notebook(sys.argv[1])

Run the pipeline:

1
2
3
4
5
# Execute Python blocks
python execute_notebook.py analysis.md

# Convert to PDF
pandoc analysis_executed.md -o analysis.pdf

The PDF now shows both code and its output.

Step 3: Toggle Code Visibility

The notebook frontmatter field controls whether code appears in the PDF.

Create two versions from the same source:

1
2
3
4
5
6
# Notebook mode (code visible) - for tutorials, auditing
pandoc analysis_executed.md -o analysis_notebook.pdf

# Report mode (code hidden) - for publication
# Change "notebook: true" to "notebook: false" in frontmatter
pandoc analysis_executed.md -o analysis_report.pdf

With the full system (described below), a Lua filter handles this automatically based on the frontmatter value.

What’s Next

You now have the core workflow:

  1. Write Markdown with Python code blocks
  2. Execute with preprocessor → _executed.md
  3. Convert with Pandoc → PDF

The rest of this series adds:

  • Modular includes (!include sections/intro.md)
  • Code block modifiers ({.suppress}, {.latex}, {.noexec})
  • Pandoc filters for Mermaid diagrams, syntax highlighting
  • Pre-commit validation for formatting and citations
  • Custom LaTeX environments for styled output boxes

Continue reading for the full implementation, or grab the complete template on Gumroad to start immediately.


Problem Statement

Technical analysis reports require a workflow that:

  • Embeds executable Python code alongside prose
  • Generates professional PDFs with LaTeX formatting
  • Supports modular, reusable content via includes
  • Allows conditional content based on runtime data
  • Keeps source files clean and version-controllable

Jupyter notebooks excel at exploration but are awkward for document generation. LaTeX excels at document formatting but is cumbersome for code execution. This system combines both capabilities.

Solution: Execute and Expand

The core is a ~450-line Python script (execute_and_expand.py) that:

  1. Parses Markdown with fenced Python code blocks
  2. Executes code sequentially with persistent state
  3. Processes !include directives (including dynamic, runtime-conditional ones)
  4. Captures outputs and embeds them in the document
  5. Outputs “executed” Markdown ready for Pandoc/LaTeX

Architecture

graph TD
    A["Data File (Excel/CSV)"] -->|Read data| B["Master.md (with !include directives)"]
    B -->|Process includes & execute code| C["execute_and_expand.py"]
    C -->|Executes Python, evaluates conditions| D["Master_executed.md (fully expanded)"]
    D -->|Convert to LaTeX| E["Pandoc + LuaLaTeX"]
    E -->|Compile| F["Final PDF"]

Basic Usage

YAML Frontmatter

Every master document begins with YAML frontmatter that controls document behavior:

1
2
3
4
5
6
7
---
title: "Analysis Report"
notebook: true
toc: true
bibliography: references/bibliography.bib
csl: ieee.csl
---

Required Fields

FieldPurposeExample
titleDocument title (appears in PDF header)"SEU Cross-Section Analysis"

Display Control

FieldValuesEffect
notebooktrueShow code blocks in PDF (tutorial/notebook mode)
 falseHide all code blocks (report/publication mode)
toctrueGenerate table of contents
 falseOmit table of contents

The notebook field is the key to dual-mode documents. The same source file can produce:

  • Notebook mode (notebook: true): Full code + output for learning/auditing
  • Report mode (notebook: false): Clean prose + results for publication

Even in notebook mode, individual blocks can be suppressed with {.suppress}.

Bibliography Fields

FieldPurposeExample
bibliographyPath to BibTeX filereferences/bibliography.bib
cslCitation style (optional)ieee.csl, apa.csl

Additional Fields

FieldPurposeExample
authorDocument author(s)"Jane Doe"
dateDocument date"2024-03-15"
abstractDocument abstract"This report analyzes..."
keywordsSearch keywords[radiation, SEU, cross-section]
documentclassLaTeX document classarticle, report
geometryPage geometrymargin=1in
fontsizeBase font size11pt, 12pt
header-includesRaw LaTeX for preambleCustom package imports

Complete Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
---
title: "SEU Cross-Section Analysis: Device XYZ"
author: "Test Engineering Team"
date: "2024-03-15"
abstract: |
  This report presents single-event upset cross-section measurements
  for Device XYZ under heavy-ion irradiation at TAMU cyclotron facility.
notebook: true
toc: true
bibliography: references/bibliography.bib
csl: ieee.csl
geometry: margin=1in
fontsize: 11pt
header-includes: |
  \usepackage{siunitx}
  \usepackage{booktabs}
---

The frontmatter is processed by Pandoc and passed to the LaTeX template. Custom templates can access any field via $field$ syntax.

Master Document Structure

A master document ties together the frontmatter and modular content:

1
2
3
4
5
6
7
8
9
10
---
title: "Analysis Report"
notebook: true
toc: true
---

!include sections/01_introduction.md
!include sections/02_data-loading.md
!include sections/03_analysis.md
!include sections/04_results.md

Executable Code Blocks

Standard Python fenced blocks are executed:

1
2
3
4
5
6
```python
import pandas as pd

df = pd.read_excel(__data_file__)
print(f"Loaded {len(df)} rows")
```

After execution, the output appears below the code:

1
2
3
4
5
6
7
8
9
10
11
```python
import pandas as pd

df = pd.read_excel(__data_file__)
print(f"Loaded {len(df)} rows")
```

**Output:**
```text
Loaded 42 rows
```

Code Block Modifiers

Execution and display are controlled with CSS-like classes in curly braces after the language identifier.

Available Modifiers

ModifierExecutesShows CodeShows OutputHandled By
(none)YesYesYes-
{.suppress}YesNoYesPandoc filter
{.noexec}NoYesN/APython preprocessor
{.nooutput}YesYesNoPython preprocessor
{.latex}YesYesAs raw LaTeXPython preprocessor

Processing Stages

Modifiers are handled at two stages:

  1. Python preprocessor (execute_and_expand.py):
    • .noexec - skips execution entirely
    • .latex - treats print() output as raw LaTeX
    • .nooutput - executes but discards output
  2. Pandoc filter (notebook-toggle.lua):
    • .suppress - removes code block from PDF, keeps output

Examples

Suppress code, show output - hide boilerplate setup:

1
2
3
4
5
6
```python {.suppress}
# Reader doesn't need to see this
import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams['figure.dpi'] = 150
```

The PDF shows only the output, not the code.

Show code, do not execute - documentation examples:

1
2
3
4
5
```python {.noexec}
# Example API usage (not actually run)
client = APIClient(key="your-key-here")
result = client.query("example")
```

Execute silently - setup without output clutter:

1
2
3
```python {.nooutput}
CONFIG = {"threshold": 0.05, "iterations": 1000}
```

Raw LaTeX output - tables, equations, custom formatting:

1
2
3
4
5
```python {.latex}
print(r"\begin{equation}")
print(r"E = mc^2")
print(r"\end{equation}")
```

Creating Custom Modifiers

To add a custom modifier, edit execute_and_expand.py:

1
2
3
4
5
6
7
8
9
10
# Modifiers are detected by checking the code fence line:
code_fence = lines[i]  # e.g., "```python {.mymodifier}"

# Add your check:
my_flag = '.mymodifier' in code_fence

# Use in processing logic:
if my_flag:
    # Custom behavior
    pass

For display-only modifiers (like .suppress), edit notebook-toggle.lua to filter blocks based on class.

LaTeX Output

The {.latex} modifier treats print() output as raw LaTeX, enabling programmatic generation of tables, equations, and styled content.

Tables from Data

1
2
3
4
5
6
7
8
9
10
11
12
```python {.latex}
print(r"\begin{center}")
print(r"\begin{tabular}{lrr}")
print(r"\toprule")
print(r"\textbf{Name} & \textbf{Value} & \textbf{Error} \\")
print(r"\midrule")
for name, val, err in results:
    print(rf"{name} & {val:.3f} & {err:.3f} \\")
print(r"\bottomrule")
print(r"\end{tabular}")
print(r"\end{center}")
```

Equations

1
2
3
4
5
```python {.latex}
print(r"\begin{equation}")
print(rf"\sigma_{{sat}} = {sigma_sat:.3e} \text{{ cm}}^2")
print(r"\end{equation}")
```

Styled Output Boxes

Reusable tcolorbox environments can be defined and called from Python:

1
2
3
4
5
6
7
8
9
10
11
12
```python {.latex}
from src.print_helpers import begin_resultbox, end_resultbox

begin_resultbox("FITTED PARAMETERS")
print(r"\begin{center}")
print(r"\begin{tabular}{ll}")
print(rf"$\sigma$ & ${sigma:.3e}$ \\")
print(rf"$\alpha$ & ${alpha:.3f}$ \\")
print(r"\end{tabular}")
print(r"\end{center}")
end_resultbox()
```

The helper functions emit the LaTeX boilerplate:

1
2
3
4
5
def begin_resultbox(title="RESULT"):
    print(r"\begin{resultbox}[title={\textcolor{white}{\textbf{" + title + r"}}}]")

def end_resultbox():
    print(r"\end{resultbox}")

Dynamic Styling

Box colors can change based on validation results:

1
2
3
4
5
6
7
8
9
10
11
12
13
```python {.latex}
if test_passed:
    color = "green!70!black"
    status = "PASS"
else:
    color = "red!70!black"
    status = "FAIL"

print(rf"\begin{{statusbox}}{{{color}}}[title={{\textcolor{{white}}{{\textbf{{{status}}}}}}}]")
print(rf"Result: {value:.3f}")
print(rf"Threshold: {threshold:.3f}")
print(r"\end{statusbox}")
```

Inline Math in Markdown

Outside of code blocks, standard LaTeX math syntax applies:

1
2
3
4
5
6
7
The coefficient was $\alpha = 1.23 \times 10^{-5}$.

For display equations:

$$
\chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i}
$$

Persistent State

All code blocks share a namespace. Variables defined in one block are available in subsequent blocks:

1
2
3
4
5
6
7
8
9
10
11
```python
# Block 1: Load data
df = pd.read_excel(__data_file__)
USE_BOOTSTRAP = len(df) > 10
```

```python
# Block 2: Uses variables from Block 1
if USE_BOOTSTRAP:
    results = run_bootstrap(df, n_iterations=1000)
```

Dynamic Includes

A powerful capability: conditionally include content based on runtime state.

Printing an !include statement from code causes the preprocessor to process it:

1
2
3
4
5
6
```python
if event_count > 0:
    print("!include methods/standard_analysis.md")
else:
    print("!include methods/zero_event_handling.md")
```

The preprocessor:

  1. Executes the code
  2. Sees !include in the output
  3. Processes the include
  4. Recursively executes code in the included file

This enables data-driven document assembly.

Auto-Figure Embedding

The script monkey-patches matplotlib.figure.Figure.savefig to track saved figures. When code saves a PDF figure:

1
2
3
fig, ax = plt.subplots()
ax.plot(x, y)
fig.savefig("output/my_plot.pdf")

The figure reference is automatically inserted:

1
![my plot](/absolute/path/to/output/my_plot.pdf)

No manual image embedding is required.

Bibliography and Citations

Pandoc’s citation syntax works with a BibTeX bibliography.

Setup

Add bibliography to the YAML front matter:

1
2
3
4
5
---
title: "My Report"
bibliography: references/bibliography.bib
csl: ieee.csl  # Optional citation style
---

Citing Sources

Use \cite{key} in .latex blocks or raw LaTeX output:

1
2
3
According to \cite{smith2023}, the effect was significant.

Multiple citations \cite{smith2023,jones2024} support this.

Bibliography File

Standard BibTeX format in references/bibliography.bib:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
@article{smith2023,
  title   = {The Effect of X on Y},
  author  = {Smith, John and Doe, Jane},
  journal = {Journal of Examples},
  year    = {2023},
  volume  = {42},
  pages   = {1--10}
}

@inproceedings{jones2024,
  title     = {A New Approach},
  author    = {Jones, Alice},
  booktitle = {Conference Proceedings},
  year      = {2024}
}

Validation

The pre-commit validator checks that all \cite{key} references exist in the bibliography:

1
make verify-citations

PDF Pipeline

After execution, Pandoc converts the Markdown to LaTeX, which LuaLaTeX compiles to PDF:

1
2
3
4
5
6
7
8
9
10
11
12
#!/bin/bash
# Simplified version of create_pdf.sh

pandoc "$INPUT" \
  --from markdown+raw_tex \
  --template=template.latex \
  --lua-filter=include-files.lua \
  --lua-filter=notebook-toggle.lua \
  --filter=pandoc-minted.py \
  -o "$OUTPUT.tex"

latexmk -lualatex -shell-escape "$OUTPUT.tex"

Key Pandoc filters:

  • include-files.lua: Handles !include directives Pandoc does not process
  • notebook-toggle.lua: Respects notebook: true/false to show/hide code
  • pandoc-minted.py: Syntax highlighting via minted package

Makefile Integration

A Makefile orchestrates the pipeline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
NOTEBOOK ?= Project
SCRIPTS  := notebooks/scripts
PIPELINE := .pdf_pipeline/scripts

.PHONY: notebook notebook-execute notebook-pdf clean-notebooks

notebook: notebook-execute notebook-pdf

notebook-execute:
	python $(SCRIPTS)/execute_and_expand.py \
	  notebooks/$(NOTEBOOK)_Master.md \
	  --output notebooks/$(NOTEBOOK)_Master_executed.md

notebook-pdf:
	$(PIPELINE)/build-pdf.sh notebooks/$(NOTEBOOK)_Master_executed.md

clean-notebooks:
	rm -f notebooks/*_executed.md
	rm -rf notebooks/output/*

Usage:

1
2
3
make notebook NOTEBOOK=BrainChip   # Full pipeline
make notebook-execute              # Execute only (no PDF)
make notebook-pdf                  # PDF only (from existing executed file)

Validation

Pre-commit hooks catch errors before they enter version control:

1
2
3
4
5
6
7
8
9
10
.PHONY: verify verify-structure verify-citations

verify: verify-structure verify-citations
	@echo "All checks passed"

verify-structure:
	@.validation/scripts/structural-validator.sh

verify-citations:
	@.validation/scripts/citation-validator.sh
1
2
make verify           # Run all checks
make verify-structure # Check includes only

Directory Structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
project/
├── notebooks/
│   ├── *_Master.md           # Master documents
│   ├── ProjectA/             # Modular sections
│   │   ├── 10_intro.md
│   │   ├── 20_data.md
│   │   └── 30_analysis.md
│   ├── common/               # Shared content
│   │   ├── theory/           # Reusable appendices
│   │   └── methods/          # Shared implementations
│   ├── scripts/              # execute_and_expand.py
│   ├── src/                  # Python utilities
│   ├── data/                 # Input data files
│   ├── output/               # Generated figures
│   └── pdf/                  # Final PDFs
├── .pdf_pipeline/
│   ├── latex/
│   │   ├── template.latex
│   │   └── filters/
│   └── scripts/
│       └── create_pdf.sh
└── Makefile

Example: Conditional Appendices

Real-world use case: include statistical method appendices only when that method was used:

1
2
3
4
5
6
7
8
9
10
11
12
```python
# Determine which methods apply based on data
has_zero_events = (df['events'] == 0).any()
used_bootstrap = n_events >= 10

# Dynamically include relevant theory
if has_zero_events:
    print("!include common/theory/ZeroEventMethods.md")

if used_bootstrap:
    print("!include common/theory/BootstrapTheory.md")
```

The generated PDF only includes appendices for methods actually used in the analysis.

Comparison with Jupyter

FeatureJupyterThis System
Code executionYesYes
Version controlAwkward (JSON)Clean (Markdown)
Modular includesNoYes
Conditional contentNoYes
LaTeX outputLimitedNative
PDF generationnbconvert (limited)Full LaTeX
ReproducibilityCell order issuesSequential by design

Limitations

  • Not interactive: No cell-by-cell execution during development
  • Python only: Extension to other languages is possible but has not been needed
  • Requires LaTeX: Full TeX Live install for PDF generation
  • Single namespace: All code shares state (feature and limitation)
  • Setup required: Needs Python venv (make venv) and dependencies installed

For interactive development, vim-medieval or similar tools can execute individual blocks, then the full preprocessor runs for document generation.

Summary

This system bridges the gap between executable notebooks and publication-quality documents. The key insight: treating !include directives as executable enables data-driven document assembly. Combined with Pandoc’s flexibility and LaTeX’s typesetting, reproducible, professional technical documents can be generated.

A complete starter template with all features described in this series is available on Gumroad.

This post is licensed under CC BY 4.0 by the author.