Post

SEU-TUI: A Curses-Based Terminal Interface for GPU Single Event Upset Monitoring

SEU-TUI: A Curses-Based Terminal Interface for GPU Single Event Upset Monitoring

Single Event Upsets (SEUs) pose significant reliability challenges for GPU-based computing systems operating in high-radiation environments. This post describes the design and implementation of SEU-TUI, a terminal-based monitoring application that provides real-time visualization of GPU error correction data during cache march testing experiments.

Problem Statement

Graphics Processing Units (GPUs) contain billions of transistors in their memory hierarchies and computational units. When deployed in environments with elevated radiation levels—such as high-altitude aircraft, spacecraft, or particle accelerator facilities—these transistors become susceptible to Single Event Effects (SEEs). A charged particle passing through a transistor junction can deposit sufficient charge to flip a stored bit, resulting in a Single Event Upset.

Modern NVIDIA GPUs incorporate Error Correction Code (ECC) memory and SRAM protection mechanisms that detect and, in many cases, correct these bit flips. However, monitoring these corrections during testing requires:

  1. Real-time access to ECC counter data from nvidia-smi
  2. Correlation with application-level error detection from cache march tests
  3. Per-Streaming Multiprocessor (SM) breakdown of detected upsets
  4. Historical tracking of volatile versus aggregate error counts

Existing tools provide either raw command-line output or web-based dashboards unsuitable for headless systems in shielded test facilities. A lightweight terminal interface addresses this gap by providing comprehensive monitoring without external dependencies.

Technical Background

Single Event Upsets

A Single Event Upset occurs when ionizing radiation deposits charge in a semiconductor device, causing a bit flip in memory or a logic transient in combinational circuits. SEUs are classified as:

  • Single Event Upset (SEU): A persistent bit flip in a memory element (latch, SRAM cell, register)
  • Single Event Transient (SET): A temporary voltage glitch in combinational logic that may propagate to storage elements

GPU architectures are particularly susceptible due to their high transistor density and large cache hierarchies. Modern GPUs contain:

  • L1 caches per SM (typically 128KB)
  • Shared L2 cache (several MB)
  • Register files per SM
  • Texture and constant caches

Cache March Testing

Cache march tests are algorithmic sequences designed to detect memory faults. A march element consists of a sequence of read and write operations applied to each address in memory. The basic march test pattern involves:

1
{⇑(w0); ⇑(r0,w1); ⇑(r1,w0); ⇓(r0,w1); ⇓(r1,w0); ⇑(r0)}

Where indicates ascending address order, descending order, and the operations read (r) or write (w) values 0 or 1.

When applied to GPU memory during radiation exposure, march tests detect upsets by comparing read values against expected patterns. The test application tracks:

  • Total test iterations (loop counter)
  • Tests containing errors
  • Total SEU count across all SMs
  • Total SET count across all SMs
  • Per-SM upset breakdown

ECC Error Counters

NVIDIA GPUs expose ECC statistics through the nvidia-smi utility. The relevant counters include:

Counter TypeDescription
SRAM CorrectableSingle-bit errors corrected by ECC
SRAM Uncorrectable ParityMulti-bit errors detected by parity
SRAM Uncorrectable SEC-DEDDouble-bit errors detected by SEC-DED codes
DRAM CorrectableHBM/GDDR single-bit corrections
DRAM UncorrectableHBM/GDDR multi-bit errors

Counters are reported in two categories:

  • Volatile: Errors since last driver reload
  • Aggregate: Lifetime error count (persistent across reboots on some GPUs)

Architecture Overview

SEU-TUI employs a modular architecture separating data acquisition, parsing, and presentation concerns.

graph LR
    A["nvidia-smi subprocess"] -->|XML output| B["parseable.py<br/>nvidia-smi XML Parser"]
    C["YAML Tailer<br/>SEU data"] -->|SEU data| D["data.py<br/>Queue-based Data Flow"]
    E["input_handler.py<br/>Keyboard Input"] --> D
    B -->|Parsed GPU data| D
    D -->|Display updates| F["ui.py<br/>Curses Renderer"]
    F -->|Terminal output| G["Terminal Display"]

    style A fill:#f0f0f0
    style B fill:#e3f2fd
    style C fill:#f0f0f0
    style D fill:#e3f2fd
    style E fill:#f0f0f0
    style F fill:#e3f2fd
    style G fill:#f5f5f5

Component Responsibilities

parseable.py: Executes nvidia-smi -q and parses the hierarchical output into a structured dictionary. The parser handles nvidia-smi’s indentation-based format by maintaining a stack of parent dictionaries:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def parse_output(lines):
    stack = [(-1, {})]
    for line in lines:
        indent = len(line) - len(line.lstrip(" "))
        text = line.strip()

        if ": " in text:
            # Key-value pair
            key, value = text.split(": ", 1)
            while len(stack) > 1 and indent <= stack[-1][0]:
                stack.pop()
            parent = stack[-1][1]
            parent[key] = value
        else:
            # Section heading
            while len(stack) > 1 and indent <= stack[-1][0]:
                stack.pop()
            parent = stack[-1][1]
            new_dict = {}
            parent[text] = new_dict
            stack.append((indent, new_dict))

    return stack[0][1]

data.py: Provides thread-safe data management through a Queue and Event mechanism. Data sources include:

  • Standard input (for piped nvidia-smi output)
  • YAML file tailing (for cache march test results)

The YAML tailer parses per-SM upset data from the test application’s output format:

1
2
3
4
5
6
7
8
9
10
11
12
13
def parse_values_line(line, columns):
    # Extract values from comma-separated line
    values = [v.strip() for v in line.split(",")]

    # Map to column definitions
    sm_data = {}
    for i in range(sm_count):
        sm_data[f'sm{i}'] = {
            'seu': parsed_values[columns.index(f'sm{i}_seu')],
            'set': parsed_values[columns.index(f'sm{i}_set')]
        }

    return {'sm_data': sm_data, 'tot_seu': total, ...}

ui.py: Implements the curses-based rendering engine with:

  • Dynamic terminal size handling
  • Color-coded error highlighting
  • Box-drawing characters for visual organization
  • Screenshot functionality for logging

input_handler.py: Manages keyboard input in a separate thread, handling:

  • q: Application exit
  • h: Help overlay toggle
  • s: Screenshot capture

TUI Display Layout

The interface presents GPU and SEU data in a dual-panel layout optimized for 80x24 terminal dimensions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
┌──────────────────────────────────────────────────────────────────────────────┐
│                          CUDA Cache March Test                               │
├────────────────────────────────────┬─────────────────────────────────────────┤
│┌──────────────SEU Data─────────────┐│┌───────────GPU Info────────────────────┐│
││ Loop Counter: 1547                ││ │ Timestamp: 2026-09-27 14:32:01       ││
││ Tests With Errors: 3              ││ │ Memory: 24576 MiB / 40960 MiB        ││
││ Total SEU: 12     Total SET: 0    ││ │ Reset Required: No                   ││
│└───────────────────────────────────┘│└───────────────────────────────────────┘│
│┌───────────────SM Data─────────────┐│┌───────────ECC Errors──────────────────┐│
││ SM00: S:0 T:0   SM08: S:2 T:0     ││ │ Volatile          Aggregate          ││
││ SM01: S:0 T:0   SM09: S:0 T:0     ││ │ SRAM Corr: 45     SRAM Corr: 1203    ││
││ SM02: S:1 T:0   SM10: S:0 T:0     ││ │ SRAM Uncorr: 0    SRAM Uncorr: 0     ││
││ SM03: S:0 T:0   SM11: S:3 T:0     ││ │ DRAM Corr: 12     DRAM Corr: 892     ││
││ SM04: S:0 T:0   SM12: S:0 T:0     ││ │ DRAM Uncorr: 0    DRAM Uncorr: 0     ││
││ SM05: S:4 T:0   SM13: S:0 T:0     ││ │                                      ││
││ SM06: S:0 T:0   SM14: S:2 T:0     ││ │ SRAM Sources                         ││
││ SM07: S:0 T:0   SM15: S:0 T:0     ││ │ L2: 3  SM: 42  PCIE: 0  Other: 0     ││
│└───────────────────────────────────┘│└───────────────────────────────────────┘│
│┌──────────GPU Performance──────────┐│┌──────────Retired Pages────────────────┐│
││ PCI Replays: 0                    ││ │ Single Bit: 0                        ││
││ GPU Util: 98% | Mem Util: 45%     ││ │ Double Bit: 0                        ││
││ GPU Clock: 1410 MHz               ││ │ Pending: No                          ││
││ Power Draw: 250W / 300W           ││ │                                      ││
││ Temp: 72 C (Max: 83 C)            ││ │                                      ││
│└───────────────────────────────────┘│└───────────────────────────────────────┘│
└──────────────────────────────────────────────────────────────────────────────┘
                           Press 'h' for help | 'q' to quit | 's' to screenshot

Panel Descriptions

SEU Data Panel: Displays aggregate statistics from the cache march test:

  • Loop counter: Total test iterations completed
  • Tests with errors: Number of iterations detecting at least one upset
  • Total SEU/SET: Cumulative single-event upset and transient counts

SM Data Panel: Per-Streaming Multiprocessor breakdown showing SEU (S:) and SET (T:) counts. Non-zero values receive highlight coloring for rapid identification.

GPU Info Panel: System-level GPU status including memory utilization and reset status flags.

ECC Errors Panel: Volatile and aggregate error counts from nvidia-smi, with SRAM source breakdown (L2 cache, SM caches, PCIe, other).

GPU Performance Panel: Real-time performance metrics including utilization, clocks, power, and temperature.

Retired Pages Panel: Page retirement status indicating permanent hardware failures.

Implementation Details

Curses Initialization and Color Handling

The curses library requires explicit initialization and cleanup. The application uses curses.wrapper() to ensure proper terminal restoration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def curses_main(smi_file, seu_file, cols, exit_event):
    def main(stdscr):
        curses.use_default_colors()
        curses.curs_set(0)  # Hide cursor

        # Initialize color pairs
        curses.start_color()
        curses.init_pair(1, curses.COLOR_BLUE, -1)    # Headers
        curses.init_pair(2, curses.COLOR_GREEN, -1)   # Highlights
        curses.init_pair(3, curses.COLOR_YELLOW, -1)  # Warnings
        curses.init_pair(4, curses.COLOR_RED, -1)     # Errors
        curses.init_pair(5, curses.COLOR_CYAN, -1)    # Titles

        # Main render loop
        while not exit_event.is_set():
            draw_dashboard(stdscr, smi_data, seu_data, max_x, cols)
            time.sleep(0.05)

    curses.wrapper(main)

The -1 background value preserves terminal transparency on supporting terminals.

Safe String Rendering

Terminal boundaries require careful handling. The safe_addstr() function wraps curses operations with boundary checks and maintains a screen buffer for screenshots:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def safe_addstr(stdscr, y, x, text, attr=0):
    try:
        max_y, max_x = stdscr.getmaxyx()
        if y >= 0 and y < max_y and x >= 0 and x < max_x:
            max_len = max_x - x
            display_text = text[:max_len] if len(text) > max_len else text
            stdscr.addstr(y, x, display_text, attr)

            # Update screen buffer for screenshots
            if len(screen_buffer) <= y:
                screen_buffer.extend([''] * (y - len(screen_buffer) + 1))
            screen_buffer[y] = screen_buffer[y][:x] + display_text

            return True
    except curses.error:
        pass
    return False

Box Drawing with Unicode Characters

Visual containers use Unicode box-drawing characters for improved aesthetics:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def draw_box_around(stdscr, start_y, start_x, height, width, title=None, attr=0):
    # Top border with corners
    safe_addstr(stdscr, start_y, start_x,
                "" + "" * (width - 2) + "", attr)

    # Title insertion
    if title:
        title_pos = start_x + (width - len(title) - 4) // 2
        safe_addstr(stdscr, start_y, title_pos,
                    "" + title + "", attr | curses.A_BOLD)

    # Vertical borders
    for i in range(1, height - 1):
        safe_addstr(stdscr, start_y + i, start_x, "", attr)
        safe_addstr(stdscr, start_y + i, start_x + width - 1, "", attr)

    # Bottom border
    safe_addstr(stdscr, start_y + height - 1, start_x,
                "" + "" * (width - 2) + "", attr)

Data Update Mechanism

The application uses a producer-consumer pattern with thread-safe queues. Data threads (stdin reader, YAML tailer) push updates to a shared queue:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Producer (data.py)
def stdin_reader(queue, exit_event):
    while not exit_event.is_set():
        line = sys.stdin.readline()
        data = json.loads(line)
        queue.put({'smi_data': data})
        data_event.set()

# Consumer (ui.py - render loop)
while not data_queue.empty():
    data = data_queue.get_nowait()
    if 'smi_data' in data:
        current_smi_data = data['smi_data']
    elif 'seu_data' in data:
        current_seu_data = data['seu_data']

The data_event signals the render loop to refresh, reducing CPU usage during idle periods.

Signal Handling and Cleanup

Proper signal handling ensures terminal restoration even during abnormal termination:

1
2
3
4
5
6
7
8
9
10
def signal_handler(signum, _frame):
    logger.info(f"Received signal {signum}, shutting down")
    exit_event.set()
    cleanup()
    time.sleep(0.5)
    sys.exit(0)

signal.signal(signal.SIGINT, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGPIPE, signal.SIG_IGN)

The cleanup function restores terminal settings:

1
2
3
4
5
def cleanup():
    sys.stdout.flush()
    sys.stderr.flush()
    if os.isatty(sys.stdin.fileno()):
        os.system('stty sane')

Usage

The application supports two data input modes:

Piped nvidia-smi output (continuous monitoring):

1
2
3
./parseable.py | python3 -m src.main \
    --smi-file /dev/stdin \
    --seu-file /path/to/test_results.yml

Tailed log files (post-processing recorded data):

1
2
3
4
tail -f ecc_log.json | python3 -m src.main \
    --smi-file /dev/stdin \
    --seu-file seu_results.yml \
    --cols 8

Command-line options:

OptionDescriptionDefault
--smi-filePath to SMI JSON or - for stdinnvidia_data_latest.json
--seu-filePath to SEU YAML fileRequired
--colsColumns for SM data display10
--log-levelLogging verbosityINFO

Conclusion

SEU-TUI demonstrates the continued relevance of curses-based interfaces for specialized monitoring applications. The modular architecture separates concerns effectively, allowing independent development of data parsers, display components, and input handling.

Key design decisions include:

  • Thread-safe queue-based data flow prevents race conditions
  • Unicode box drawing provides visual organization without external dependencies
  • Color highlighting enables rapid identification of anomalous values
  • Screenshot functionality supports documentation and post-analysis

The application addresses a specific need in radiation effects testing: lightweight, real-time monitoring of GPU reliability metrics in environments where graphical interfaces are impractical. The patterns demonstrated—curses safe rendering, signal-safe cleanup, producer-consumer data flow—apply broadly to terminal monitoring applications.

Source code: seu-tui

This post is licensed under CC BY 4.0 by the author.