Post

Executable Notebooks (Part 3): Pre-commit Validation for Technical Documents

Executable Notebooks (Part 3): Pre-commit Validation for Technical Documents

Building a pre-commit validation pipeline that catches formatting errors, broken includes, missing citations, and accidentally staged build artifacts before they pollute the repository.

Note: Code examples in this post are simplified for illustration. The actual validators include additional patterns and edge cases. A complete starter template is available on Gumroad.

Problem Statement

Technical documentation projects accumulate subtle errors:

  • LaTeX math syntax mixed with Unicode where it should not be
  • !include directives pointing to renamed or deleted files
  • Citations referencing keys that do not exist in the bibliography
  • Executed notebooks and build artifacts accidentally committed

These errors slip through because they do not break anything immediately. The document still compiles. The PDF still generates. But reproducibility suffers, collaborators become confused, and the repository fills with extraneous files.

Solution: Validation Pipeline

A set of bash scripts run as a pre-commit hook:

graph TD
    A[".validation/"] --> B["scripts/"]
    B --> C["run-all-validators.sh<br/>(Orchestrator)"]
    B --> D["formatting-validator.sh<br/>(LaTeX/Unicode rules)"]
    B --> E["structural-validator.sh<br/>(Include file checks)"]
    B --> F["citation-validator.sh<br/>(Bibliography verification)"]
    B --> G["clean-check.sh<br/>(No generated files)"]

Each validator:

  • Exits 0 on pass, 1 on failure
  • Generates a detailed report in build/
  • Only checks staged files (fast)
  • Provides actionable error messages

Orchestrator Implementation

run-all-validators.sh coordinates all validators:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/bin/bash
set -e

PROJECT_ROOT="${PROJECT_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)}"
VALIDATION_DIR="${PROJECT_ROOT}/.validation/scripts"

# Track overall status
OVERALL_STATUS=0

# Define validators
declare -A VALIDATORS=(
    ["Formatting"]="$VALIDATION_DIR/formatting-validator.sh"
    ["Structure"]="$VALIDATION_DIR/structural-validator.sh"
    ["Citations"]="$VALIDATION_DIR/citation-validator.sh"
    ["Clean check"]="$VALIDATION_DIR/clean-check.sh"
)

# Run each validator
for name in "Formatting" "Structure" "Citations" "Clean check"; do
    script="${VALIDATORS[$name]}"

    echo -n "Running $name validation... "

    if bash "$script" 2>&1; then
        echo "[PASS]"
    else
        echo "[FAIL]"
        OVERALL_STATUS=1
    fi
done

exit $OVERALL_STATUS

Output example:

1
2
3
4
5
6
7
8
9
Running pre-commit validation pipeline...

Running Formatting validation... [PASS]
Running Structure validation... [PASS]
Running Citations validation... [FAIL]
Running Clean check... [PASS]

Validation failed!
Review report: build/validation-summary.txt

Validator 1: Formatting

Enforces consistent math notation:

ContextUseExample
Python codeUnicodeσ_sat, , ·
Plot labelsLaTeX$\sigma_{sat}$
Markdown textLaTeX$\alpha$, $$...$$
.latex blocksLaTeXRaw LaTeX output

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#!/bin/bash
# formatting-validator.sh

STAGED_FILES=$(git diff --cached --name-only --diff-filter=ACM | grep '\.md$' || true)

for file in $STAGED_FILES; do
    in_python_block=false
    has_latex_attribute=false
    line_num=0

    while IFS= read -r line; do
        line_num=$((line_num + 1))

        # Track code block state
        if [[ "$line" =~ ^\`\`\`python ]]; then
            in_python_block=true
            # Check for .latex attribute
            [[ "$line" =~ \{.*\.latex.*\} ]] && has_latex_attribute=true
            continue
        fi

        if [[ "$line" =~ ^\`\`\` ]] && [ "$in_python_block" = true ]; then
            in_python_block=false
            has_latex_attribute=false
            continue
        fi

        # Check for violations in Python blocks
        if [ "$in_python_block" = true ] && [ "$has_latex_attribute" = false ]; then
            # Skip plot labels (allowed to have LaTeX)
            if [[ "$line" =~ (xlabel|ylabel|title)\s*= ]]; then
                continue
            fi

            # Flag LaTeX $ in Python code
            if [[ "$line" =~ [^#]*\$[^f\"] ]] && [[ ! "$line" =~ ^\s*# ]]; then
                echo "VIOLATION: LaTeX in Python code at $file:$line_num"
                VIOLATIONS=$((VIOLATIONS + 1))
            fi
        fi
    done < "$file"
done

Detection Examples

1
2
3
4
5
6
7
8
# VIOLATION - LaTeX in Python code
coefficient = $\alpha_{max}$  # Line flagged

# OK - Unicode in Python
coefficient = α_max

# OK - LaTeX in plot labels (whitelisted)
plt.ylabel(r'$\alpha_{max}$ (units)')

Code blocks with the .latex modifier are also whitelisted:

1
2
3
```python {.latex}
print(r"\alpha_{max}")
```

Validator 2: Structure

Verifies !include directives resolve to existing files:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/bin/bash
# structural-validator.sh

for file in $STAGED_FILES; do
    line_num=0

    while IFS= read -r line; do
        line_num=$((line_num + 1))

        # Match !include directives
        if [[ "$line" =~ ^!include[[:space:]]+(.+)$ ]]; then
            include_path="${BASH_REMATCH[1]}"

            # Resolve relative to file's directory
            file_dir=$(dirname "$file")
            full_path="$file_dir/$include_path"

            if [ ! -f "$full_path" ]; then
                echo "VIOLATION: Include not found"
                echo "  File: $file:$line_num"
                echo "  Include: $include_path"
                echo "  Expected: $full_path"
                VIOLATIONS=$((VIOLATIONS + 1))
            fi
        fi
    done < "$file"
done

Detection Examples

1
2
3
4
!include sections/introduction.md      # OK if file exists
!include sections/introdcution.md      # FAIL - typo
!include deleted_section.md            # FAIL - file removed
!include ../common/theory.md           # OK - relative path resolved

Validator 3: Citations

Cross-references \cite{key} patterns against bibliography.bib:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# citation-validator.sh

BIB_FILE="${PROJECT_ROOT}/references/bibliography.bib"

# Extract citation keys from bibliography
BIB_KEYS=$(grep -E '^@[a-zA-Z]+\{' "$BIB_FILE" | \
           sed -E 's/@[a-zA-Z]+\{([^,]+),.*/\1/' | sort -u)

for file in $STAGED_FILES; do
    line_num=0

    while IFS= read -r line; do
        line_num=$((line_num + 1))

        # Find \cite{key} patterns
        while [[ "$line" =~ \\cite\{([a-zA-Z0-9_:-]+)\} ]]; do
            cite_key="${BASH_REMATCH[1]}"

            # Check if key exists
            if ! echo "$BIB_KEYS" | grep -q "^${cite_key}$"; then
                echo "VIOLATION: Citation key not found: $cite_key"
                echo "  File: $file:$line_num"
                VIOLATIONS=$((VIOLATIONS + 1))
            fi

            # Remove matched pattern, find next
            line="${line#*\\cite\{${cite_key}\}}"
        done
    done < "$file"
done

Detection Examples

1
2
3
According to \cite{smith2023}...    # FAIL if not in bibliography.bib
See \cite{vaswani2017} for...       # OK if key exists in bibliography.bib
Multiple \cite{foo,bar}...          # Checks both keys

Validator 4: Clean Check

Prevents generated files from being committed:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# clean-check.sh

STAGED_FILES=$(git diff --cached --name-only --diff-filter=ACM || true)

PROHIBITED_PATTERNS=(
    "*_Master_executed.md"
    "build/*"
    "*.aux"
    "*.fdb_latexmk"
    "*.fls"
    "*.log"
    "*/__pycache__/*"
    "notebooks/*/output/*"
)

for file in $STAGED_FILES; do
    for pattern in "${PROHIBITED_PATTERNS[@]}"; do
        if [[ "$file" == $pattern ]]; then
            echo "VIOLATION: Generated file staged: $file"
            VIOLATIONS=$((VIOLATIONS + 1))
            break
        fi
    done

    # Specific checks
    if [[ "$file" =~ _Master_executed\.md$ ]]; then
        echo "VIOLATION: Executed notebook staged: $file"
        VIOLATIONS=$((VIOLATIONS + 1))
    fi
done

Detection Examples

1
2
3
4
notebooks/BrainChip_Master_executed.md  # FAIL - should be gitignored
build/output.pdf                        # FAIL - build artifact
src/__pycache__/module.pyc              # FAIL - Python cache
notebooks/output/figure1.pdf            # WARNING - may be intentional

Git Hook Integration

Install as pre-commit hook:

1
2
3
4
#!/bin/bash
# .git/hooks/pre-commit

exec ./.validation/scripts/run-all-validators.sh

Make executable:

1
chmod +x .git/hooks/pre-commit

Every git commit now runs validation:

1
2
3
4
5
6
7
8
9
10
11
12
$ git add notebooks/analysis.md
$ git commit -m "Update analysis"

Running pre-commit validation pipeline...

Running Formatting validation... [PASS]
Running Structural validation... [PASS]
Running Citation validation... [PASS]
Running Clean check... [PASS]

All validation checks passed!
[main abc1234] Update analysis

Makefile Integration

Validators are exposed via Make targets:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
.PHONY: verify verify-formatting verify-structure verify-citations verify-clean

verify:
    @./.validation/scripts/run-all-validators.sh

verify-formatting:
    @./.validation/scripts/formatting-validator.sh

verify-structure:
    @./.validation/scripts/structural-validator.sh

verify-citations:
    @./.validation/scripts/citation-validator.sh

verify-clean:
    @./.validation/scripts/clean-check.sh

Manual execution:

1
2
3
make verify              # All checks
make verify-formatting   # Just formatting
make verify-citations    # Just citations

Detailed Reports

Each validator writes a report to build/:

graph TD
    A["build/"] --> B["formatting-validation.txt"]
    A --> C["structural-validation.txt"]
    A --> D["citation-validation.txt"]
    A --> E["clean-check.txt"]
    A --> F["validation-summary.txt"]

Example report:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Formatting Validation Report
============================
Date: Sat Feb 22 16:00:00 MST 2026

Checking staged markdown files...

VIOLATION: LaTeX math in Python code (use Unicode)
  File: notebooks/analysis.md:42
  Line: sigma = $\sigma_{sat}$

============================
Total violations: 1
RESULT: FAIL

Fix violations according to guidelines:
  - Python code blocks: Use Unicode (σ_sat, ·, →)
  - Plot titles/labels: Use LaTeX ($\sigma_{sat}$)
  - Markdown text: Use LaTeX math mode

Bypassing Validation

Commits can proceed despite failures when necessary:

1
git commit --no-verify -m "WIP: experimental changes"

The orchestrator reminds users of this escape hatch:

1
2
3
4
Validation failed!
Review report: build/validation-summary.txt

To bypass (NOT recommended): git commit --no-verify

This option should be used sparingly. Violations should be fixed before merging to main.

Performance

Validators are fast because they:

  1. Only check staged files: Not the entire repository
1
STAGED_FILES=$(git diff --cached --name-only --diff-filter=ACM | grep '\.md$')
  1. Exit early: Skip checks when no relevant files exist
1
2
3
4
if [ -z "$STAGED_FILES" ]; then
    echo "No markdown files staged."
    exit 0
fi
  1. Run lightweight checks first: Formatting before structure before citations

  2. Stream processing: Line-by-line, without loading entire files

Adding New Validators

Follow the established pattern:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#!/bin/bash
# my-validator.sh

set -e

PROJECT_ROOT="${PROJECT_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)}"
REPORT_FILE="${PROJECT_ROOT}/build/my-validation.txt"

# Initialize report
mkdir -p "$(dirname "$REPORT_FILE")"
echo "My Validation Report" > "$REPORT_FILE"
echo "Date: $(date)" >> "$REPORT_FILE"

VIOLATIONS=0

# Get staged files
STAGED_FILES=$(git diff --cached --name-only --diff-filter=ACM | grep '\.md$' || true)

if [ -z "$STAGED_FILES" ]; then
    echo "No files to check." >> "$REPORT_FILE"
    exit 0
fi

# Validation logic here
for file in $STAGED_FILES; do
    # Check something...
    if [[ condition ]]; then
        echo "VIOLATION: description" >> "$REPORT_FILE"
        VIOLATIONS=$((VIOLATIONS + 1))
    fi
done

# Summary
echo "Total violations: $VIOLATIONS" >> "$REPORT_FILE"

if [ $VIOLATIONS -eq 0 ]; then
    echo "RESULT: PASS" >> "$REPORT_FILE"
    exit 0
else
    echo "RESULT: FAIL" >> "$REPORT_FILE"
    exit 1
fi

Then add to run-all-validators.sh:

1
2
3
4
declare -A VALIDATORS=(
    ...
    ["My check"]="$VALIDATION_DIR/my-validator.sh"
)

CI/CD Integration

Run in GitHub Actions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
name: Validate

on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run validation
        run: |
          # Stage all files for validation
          git add -A
          make verify

Note: CI requires files to be staged for validation. Use git add -A or modify validators to check all files in CI mode.

Common Fixes

Formatting Violations

1
2
3
4
5
# Before (violation)
result = $\alpha$ * coefficient

# After (fixed)
result = α * coefficient

Missing Includes

1
2
3
4
5
# Find the file
find notebooks -name "*introduction*"

# Fix the path
sed -i 's/!include intro.md/!include sections\/introduction.md/' file.md

Missing Citations

1
2
3
4
5
6
7
8
9
# Add to bibliography.bib
cat >> references/bibliography.bib << 'EOF'
@article{smith2023,
  title = {The Title},
  author = {Smith, John},
  journal = {Journal Name},
  year = {2023}
}
EOF

Staged Generated Files

1
2
3
4
5
6
7
8
# Unstage
git reset HEAD notebooks/*_executed.md

# Clean
make clean-notebooks

# Verify .gitignore
echo "*_executed.md" >> .gitignore

Summary

Pre-commit validation catches errors at the cheapest possible moment: before they enter the repository. Key principles:

  • Fast: Only check staged files
  • Specific: Clear violation messages with file:line
  • Escapable: --no-verify for emergencies
  • Extensible: Simple pattern for new validators

The full implementation is included in the starter template on Gumroad.

This post is licensed under CC BY 4.0 by the author.