Post

A Pandoc Filter for Minted Syntax Highlighting in LaTeX

A Pandoc Filter for Minted Syntax Highlighting in LaTeX

Pandoc’s default LaTeX code output relies on verbatim environments or the listings package, both of which produce suboptimal syntax highlighting. This post presents a pandocfilters-based Python filter that intercepts code blocks during JSON AST processing and converts them to minted LaTeX output, leveraging Pygments for superior highlighting quality.

Problem Statement

When converting Markdown to LaTeX, Pandoc generates code blocks using one of several approaches:

Verbatim Environment: Plain monospace text without highlighting:

1
2
3
4
5
6
\begin{verbatim}
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
\end{verbatim}

Listings Package: Basic keyword highlighting with limited language support:

1
2
3
4
5
6
\begin{lstlisting}[language=Python]
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
\end{lstlisting}

Pandoc Highlighting: Internal highlighting converted to styled environments:

1
2
3
4
5
6
7
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{def} \FunctionTok{fibonacci}\NormalTok{(n):}
    \ControlFlowTok{if}\NormalTok{ n }\OperatorTok{<=} \DecValTok{1}\NormalTok{:}
        \ControlFlowTok{return}\NormalTok{ n}
\end{Highlighting}
\end{Shaded}

Each approach has limitations:

MethodHighlighting QualityLanguage SupportCustomization
VerbatimNoneN/AMinimal
ListingsBasic~80 languagesModerate
PandocModerate~140 languagesLimited
MintedExcellent500+ languagesExtensive

The minted package delegates highlighting to Pygments, a Python library that supports hundreds of programming languages with accurate tokenization and extensive styling options.

Technical Background

Pandoc Filter Architecture

Pandoc processes documents through an Abstract Syntax Tree (AST). The pipeline follows this structure:

graph LR
    A["Input Document"] --> B["Reader"]
    B --> C["AST"]
    C --> D["Filters"]
    D --> E["Writer"]
    E --> F["Output"]

Filters intercept the AST between parsing and writing, enabling transformations that are format-aware. Pandoc supports two filter types:

  1. Lua Filters: Native integration, fast execution, no external dependencies
  2. JSON Filters: Language-agnostic via stdin/stdout JSON AST exchange

JSON filters receive the entire document AST as JSON on stdin and emit the transformed AST on stdout. The pandocfilters Python library simplifies this exchange.

pandocfilters Library

The pandocfilters library provides utilities for walking and transforming the Pandoc AST:

1
2
3
4
5
6
7
8
9
10
11
from pandocfilters import toJSONFilter, RawBlock, RawInline

def my_filter(key, value, format, meta):
    # key: element type ("CodeBlock", "Code", "Para", etc.)
    # value: element contents (structure varies by type)
    # format: output format ("latex", "html", etc.)
    # meta: document metadata from YAML front matter
    pass

if __name__ == '__main__':
    toJSONFilter(my_filter)

The toJSONFilter function handles JSON parsing, AST walking, and output generation. Filter functions return transformed elements or None to leave elements unchanged.

Code Element Structure

Pandoc represents code elements with this structure:

CodeBlock (fenced code blocks):

1
2
3
4
5
6
7
{
  "t": "CodeBlock",
  "c": [
    ["identifier", ["class1", "class2"], [["attr1", "value1"]]],
    "code contents"
  ]
}

Code (inline code):

1
2
3
4
5
6
7
{
  "t": "Code",
  "c": [
    ["identifier", ["class1"], []],
    "inline code"
  ]
}

The first element is a triple: identifier, classes list, and key-value attributes. The second element contains the code text.

Implementation Walkthrough

The filter comprises three functions: unpacking code elements, extracting metadata settings, and the main filter function.

Code Element Unpacking

The unpack_code function extracts language, attributes, and contents from Pandoc’s code element structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def unpack_code(value, language):
    ''' Unpack the body and language of a pandoc code element.

    Args:
        value       contents of pandoc object
        language    default language
    '''
    [[_, classes, attributes], contents] = value

    if len(classes) > 0:
        language = classes[0]

    attributes = ', '.join('='.join(x) for x in attributes)

    return {'contents': contents, 'language': language,
            'attributes': attributes}

Key behaviors:

  • Language extraction: The first class becomes the language identifier
  • Attribute formatting: Key-value pairs convert to minted option syntax (key=value, key=value)
  • Default fallback: If no class is specified, the passed default language is used

Metadata Extraction

Document-level settings are configured via YAML front matter:

1
2
3
4
5
---
title: "Document Title"
pandoc-minted:
  language: python
---

The unpack_metadata function parses these settings:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def unpack_metadata(meta):
    ''' Unpack the metadata to get pandoc-minted settings.

    Args:
        meta    document metadata
    '''
    settings = meta.get('pandoc-minted', {})
    if settings.get('t', '') == 'MetaMap':
        settings = settings['c']

        # Get language.
        language = settings.get('language', {})
        if language.get('t', '') == 'MetaInlines':
            language = language['c'][0]['c']
        else:
            language = None

        return {'language': language}

    else:
        # Return default settings.
        return {'language': 'text'}

Pandoc metadata arrives in a typed format where t indicates the type (MetaMap, MetaInlines, etc.) and c contains the content. The function navigates this structure to extract the default language setting.

Main Filter Function

The minted function performs the actual transformation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def minted(key, value, format, meta):
    ''' Use minted for code in LaTeX.

    Args:
        key     type of pandoc object
        value   contents of pandoc object
        format  target output format
        meta    document metadata
    '''
    if format != 'latex':
        return

    # Determine what kind of code object this is.
    if key == 'CodeBlock':
        template = Template(
            '\\begin{minted}[$attributes]{$language}\n$contents\n\end{minted}'
        )
        Element = RawBlock
    elif key == 'Code':
        template = Template('\\mintinline[$attributes]{$language}{$contents}')
        Element = RawInline
    else:
        return

    settings = unpack_metadata(meta)

    code = unpack_code(value, settings['language'])

    return [Element(format, template.substitute(code))]

Control flow:

  1. Format check: Non-LaTeX output passes through unmodified
  2. Element type dispatch: Different templates for blocks vs. inline code
  3. Template selection: RawBlock for environments, RawInline for inline commands
  4. Settings merge: Document defaults combine with element-specific settings
  5. Template substitution: Python’s string.Template generates final LaTeX

Complete Filter Script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#!/usr/bin/env python
''' A pandoc filter that has the LaTeX writer use minted for typesetting code.

Usage:
    pandoc --filter ./pandoc-minted.py -o myfile.tex myfile.md
'''

from string import Template
from pandocfilters import toJSONFilter, RawBlock, RawInline


def unpack_code(value, language):
    [[_, classes, attributes], contents] = value

    if len(classes) > 0:
        language = classes[0]

    attributes = ', '.join('='.join(x) for x in attributes)

    return {'contents': contents, 'language': language,
            'attributes': attributes}


def unpack_metadata(meta):
    settings = meta.get('pandoc-minted', {})
    if settings.get('t', '') == 'MetaMap':
        settings = settings['c']

        language = settings.get('language', {})
        if language.get('t', '') == 'MetaInlines':
            language = language['c'][0]['c']
        else:
            language = None

        return {'language': language}

    else:
        return {'language': 'text'}


def minted(key, value, format, meta):
    if format != 'latex':
        return

    if key == 'CodeBlock':
        template = Template(
            '\\begin{minted}[$attributes]{$language}\n$contents\n\end{minted}'
        )
        Element = RawBlock
    elif key == 'Code':
        template = Template('\\mintinline[$attributes]{$language}{$contents}')
        Element = RawInline
    else:
        return

    settings = unpack_metadata(meta)
    code = unpack_code(value, settings['language'])

    return [Element(format, template.substitute(code))]


if __name__ == '__main__':
    toJSONFilter(minted)

Minted Integration Requirements

Shell Escape

The minted package executes Pygments as an external process, requiring shell-escape mode:

1
2
3
pdflatex -shell-escape document.tex
# or
latexmk -pdf -shell-escape document.tex

Without shell-escape, minted fails with:

1
! Package minted Error: You must invoke LaTeX with the -shell-escape flag.

Pygments Installation

Pygments must be available in the system PATH:

1
2
3
4
pip install pygments

# Verify installation
pygmentize -V

LaTeX Package Setup

The document preamble must load minted:

1
2
3
4
5
6
\documentclass{article}
\usepackage{minted}

\begin{document}
% minted environments will be inserted here
\end{document}

Optional configuration in the preamble:

1
2
3
4
5
6
7
8
9
10
11
12
\usepackage{minted}

% Set default style
\usemintedstyle{monokai}

% Global options
\setminted{
    frame=lines,
    framesep=2mm,
    fontsize=\small,
    linenos=true
}

Metadata-Driven Configuration

Default Language

Documents with predominantly one language can set a default:

1
2
3
4
5
---
title: "Python Tutorial"
pandoc-minted:
  language: python
---

Code blocks without explicit language inherit this default:

1
2
3
4
5
```
# This block will use python highlighting
def greet(name):
    print(f"Hello, {name}!")
```

Override Per Block

Explicit language classes override the document default:

1
2
3
4
5
6
7
8
```bash
#!/bin/bash
echo "This uses bash highlighting"
```

```sql
SELECT * FROM users WHERE active = true;
```

Usage Examples

Basic Conversion

1
pandoc input.md --filter ./pandoc-minted.py -o output.tex

Input Markdown

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
---
title: "Code Examples"
pandoc-minted:
  language: python
---

## Fibonacci Implementation

```python
def fibonacci(n: int) -> int:
    """Calculate nth Fibonacci number."""
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)
```

Inline code like `fibonacci(10)` also converts.

## Shell Commands

```bash
pip install pandocfilters pygments
```

Output LaTeX

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
\section{Fibonacci Implementation}

\begin{minted}[]{python}
def fibonacci(n: int) -> int:
    """Calculate nth Fibonacci number."""
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)
\end{minted}

Inline code like \mintinline[]{python}{fibonacci(10)} also converts.

\section{Shell Commands}

\begin{minted}[]{bash}
pip install pandocfilters pygments
\end{minted}

With Attributes

Fenced code blocks support attributes that pass through to minted:

1
2
3
4
```{.python linenos=true frame=lines}
def example():
    return "highlighted with options"
```

Output:

1
2
3
4
\begin{minted}[linenos=true, frame=lines]{python}
def example():
    return "highlighted with options"
\end{minted}

Comparison: Before and After

Without Filter (Pandoc default):

1
2
3
4
5
6
7
8
9
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{def}\NormalTok{ fibonacci(n: }\BuiltInTok{int}\NormalTok{) -> }\BuiltInTok{int}\NormalTok{:}
    \CommentTok{"""Calculate nth Fibonacci number."""}
    \ControlFlowTok{if}\NormalTok{ n <= }\DecValTok{1}\NormalTok{:}
        \ControlFlowTok{return}\NormalTok{ n}
    \ControlFlowTok{return}\NormalTok{ fibonacci(n }\OperatorTok{-} \DecValTok{1}\NormalTok{) + fibonacci(n }\OperatorTok{-} \DecValTok{2}\NormalTok{)}
\end{Highlighting}
\end{Shaded}

With Filter (minted output):

1
2
3
4
5
6
7
\begin{minted}[]{python}
def fibonacci(n: int) -> int:
    """Calculate nth Fibonacci number."""
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)
\end{minted}

The minted output is cleaner, more maintainable, and produces superior PDF rendering through Pygments.

Integration with Docker LaTeX Builds

The filter integrates with containerized LaTeX environments as described in Docker-Based LaTeX Compilation. The container must include:

  1. Python virtual environment with pandocfilters and pygments
  2. Filter script accessible in the container
  3. Shell-escape enabled in latexmk configuration

Makefile Integration

1
2
3
4
5
6
7
8
9
10
11
DOCKER_IMAGE := latex-env
DOCKER_RUN := docker run --rm -v $(PWD):/app -u $(shell id -u):$(shell id -g)
FILTER := /app/filters/pandoc-minted.py

%.tex: %.md
	$(DOCKER_RUN) $(DOCKER_IMAGE) \
	    pandoc --filter $(FILTER) -o $@ $<

%.pdf: %.tex
	$(DOCKER_RUN) $(DOCKER_IMAGE) \
	    latexmk -pdf -shell-escape $<

Pipeline Script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/bash
# build-document.sh

INPUT="$1"
OUTPUT="${INPUT%.md}.pdf"

# Generate LaTeX with minted filter
pandoc "$INPUT" \
    --filter ./filters/pandoc-minted.py \
    --template=template.latex \
    -o "${INPUT%.md}.tex"

# Compile with shell-escape
latexmk -pdf -shell-escape "${INPUT%.md}.tex"

# Cleanup
latexmk -c

Debugging

Inspect AST Structure

Examine what Pandoc parses before filtering:

1
pandoc input.md -t json | python -m json.tool | head -50

Filter Logging

Add debug output to stderr (stdout is reserved for AST output):

1
2
3
4
5
import sys

def minted(key, value, format, meta):
    sys.stderr.write(f"Processing: {key}\n")
    # ... rest of function

Common Issues

Pygments not found: Ensure pygmentize is in PATH. In Docker, verify the virtual environment is activated or PATH is configured.

Shell-escape disabled: The error message explicitly states the requirement. Add -shell-escape to the LaTeX compilation command.

Empty attributes bracket: The filter generates [] even when no attributes exist. This is valid minted syntax but can be removed with a conditional:

1
2
3
4
if code['attributes']:
    attr_str = f"[{code['attributes']}]"
else:
    attr_str = ""

Conclusion

The pandoc-minted filter bridges Pandoc’s Markdown processing with minted’s superior syntax highlighting. The implementation demonstrates pandocfilters patterns applicable to other filter development:

  • Format-conditional processing via the format parameter
  • Metadata extraction for document-level configuration
  • Template-based output generation for maintainable code
  • Proper element type handling with RawBlock and RawInline

Combined with containerized LaTeX builds, this filter enables reproducible, high-quality technical document generation from Markdown sources.

References

This post is licensed under CC BY 4.0 by the author.