PDF to Knowledge Graph (Part 1): PDF Extraction with MinerU
All posts in this series
- PDF to Knowledge Graph (Part 0): From PDFs to Knowledge Graphs
- PDF to Knowledge Graph (Part 1): PDF Extraction with MinerU
- PDF to Knowledge Graph (Part 2): Structured LLM Extraction with Instructor
- PDF to Knowledge Graph (Part 3): Building Knowledge Graphs with Kuzu
- PDF to Knowledge Graph (Part 4): Automated PDF Pipeline with Watchdog
- PDF to Knowledge Graph (Part 5): Knowledge Graph Visualization with vis.js
- PDF to Knowledge Graph (Part 6): RAG with Knowledge Graphs
Part 1 of the PDF to Knowledge Graph series.
PDF extraction is deceptively difficult. Standard libraries produce reasonable results on simple documents but fail catastrophically on technical papers with multi-column layouts, embedded equations, and complex tables. This post presents MinerU, a deep learning-based solution that preserves document structure.
Problem Statement
Consider a typical research paper:
- Two-column layout with figures spanning columns
- LaTeX equations in inline and display mode
- Tables with merged cells and nested headers
- Footnotes, citations, and cross-references
- Headers and footers to be ignored
Standard extraction libraries (PyPDF2, pdfplumber, pypdf) treat the page as a linear stream, producing:
1
2
The transformer architecture [1] revolu- We propose a modification to the
tionized NLP through self-attention. attention mechanism that reduces...
Two columns interleaved. Equations become x = y 2 + z. Tables collapse into word salad. The text is technically “extracted” but unusable for downstream processing.
MinerU: Layout-Aware Extraction
MinerU (formerly magic-pdf) uses deep learning models to:
- Detect layout regions (text blocks, figures, tables, equations)
- Determine reading order across columns and pages
- Extract equations as LaTeX notation
- Convert tables to Markdown format
- Preserve hierarchy (headings, lists, paragraphs)
The result is clean Markdown suitable for LLM processing.
Installation
Via pip
1
pip install mineru
Via conda (recommended for complex dependencies)
1
2
3
conda create -n mineru python=3.10
conda activate mineru
conda install -c conda-forge mineru
Verification
1
mineru --help
MinerU downloads model weights on first run (~2GB). Adequate disk space must be available.
Basic Usage
Command Line
1
2
3
4
5
6
7
8
9
# Convert single PDF
mineru -p paper.pdf -o ./output -m auto
# The output structure:
# output/
# paper/
# auto/
# paper.md # Markdown output
# images/ # Extracted figures
Programmatic Conversion
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import os
import subprocess
import time
import shutil
CONVERTER_OUTPUT_DIR = "./mineru_outputs"
def run_mineru(pdf_path: str) -> str | None:
"""
Convert PDF to Markdown using MinerU.
Returns path to the generated markdown file, or None on failure.
"""
pdf_name = os.path.basename(pdf_path).replace(".pdf", "")
pdf_size_mb = os.path.getsize(pdf_path) / (1024 * 1024)
print(f"[INFO] PDF: {pdf_name} ({pdf_size_mb:.1f} MB)")
# MinerU creates: {output_dir}/{pdf_name}/auto/{pdf_name}.md
base_folder = os.path.join(CONVERTER_OUTPUT_DIR, pdf_name)
method_folder = os.path.join(base_folder, "auto")
expected_md_path = os.path.join(method_folder, f"{pdf_name}.md")
# Try 'mineru' command (v2.x), fall back to 'magic-pdf' (v1.x)
for cmd_name in ["mineru", "magic-pdf"]:
if shutil.which(cmd_name):
cmd = [cmd_name, "-p", pdf_path, "-o", CONVERTER_OUTPUT_DIR, "-m", "auto"]
print(f"[INFO] Running: {cmd_name} -m auto")
start = time.time()
# Stream output for progress visibility
process = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
bufsize=1
)
# Print progress lines
for line in process.stdout:
line = line.strip()
if line and any(x in line for x in ["Predict", "Processing", "Batch", "INFO"]):
print(f" {line}")
process.wait()
elapsed = time.time() - start
if process.returncode != 0:
print(f"[ERROR] {cmd_name} failed (exit {process.returncode})")
continue
# Check for markdown file
if os.path.exists(expected_md_path):
md_size_kb = os.path.getsize(expected_md_path) / 1024
print(f"[INFO] Converted in {elapsed:.1f}s -> {md_size_kb:.0f} KB markdown")
return expected_md_path
# Fallback: search for any .md file
for root, _, files in os.walk(base_folder):
for f in files:
if f.endswith(".md"):
return os.path.join(root, f)
print(f"[WARN] {cmd_name} completed but no .md file found")
return None
print("[ERROR] Neither 'mineru' nor 'magic-pdf' found in PATH")
return None
Output Quality
For a technical paper, MinerU produces clean Markdown:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar...
Abstract — The dominant sequence transduction models are based on complex
recurrent or convolutional neural networks...
## I. INTRODUCTION
Recurrent neural networks, long short-term memory [1] and gated recurrent [2]
neural networks in particular, have been firmly established as state of the
art approaches in sequence modeling...
| Model | BLEU | Training Cost |
|-------|------|---------------|
| Transformer (base) | 27.3 | $3.3 \times 10^{18}$ |
| Transformer (big) | 28.4 | $2.3 \times 10^{19}$ |
Key observations:
- LaTeX preserved: Equations remain as
$...$and$$...$$ - Tables intact: Converted to Markdown tables
- Structure maintained: Headings, paragraphs, lists preserved
- Citations kept:
[1-3]reference markers remain
Handling Different Document Types
Research Papers
Default settings work well:
1
mineru -p paper.pdf -o ./output -m auto
Textbooks with Complex Layout
For documents with marginal notes, sidebars, or unusual layouts:
1
mineru -p textbook.pdf -o ./output -m auto --layout-model doclayout_yolo
Scanned Documents
MinerU includes OCR support:
1
mineru -p scanned.pdf -o ./output -m ocr
Batch Processing
Process an entire directory:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import os
from pathlib import Path
def batch_convert(input_dir: str, output_dir: str):
"""Convert all PDFs in a directory."""
pdfs = list(Path(input_dir).glob("*.pdf"))
print(f"Found {len(pdfs)} PDFs")
results = {"success": [], "failed": []}
for i, pdf in enumerate(pdfs):
print(f"\n[{i+1}/{len(pdfs)}] {pdf.name}")
md_path = run_mineru(str(pdf))
if md_path:
results["success"].append(pdf.name)
else:
results["failed"].append(pdf.name)
print(f"\n{'='*50}")
print(f"Success: {len(results['success'])}")
print(f"Failed: {len(results['failed'])}")
if results["failed"]:
print("\nFailed PDFs:")
for name in results["failed"]:
print(f" - {name}")
return results
# Usage
batch_convert("./papers", "./converted")
Common Issues and Solutions
Out of Memory
MinerU’s models require significant RAM. For large documents:
1
2
# Reduce batch size (slower but less memory)
mineru -p large.pdf -o ./output -m auto --batch-size 1
Or process page ranges:
1
2
# First 50 pages only
mineru -p large.pdf -o ./output -m auto --start-page 0 --end-page 50
Equation Extraction Failures
Some equation styles confuse the detector. Consider:
- Pre-processing with image enhancement
- Using OCR mode for heavily formatted equations
- Post-processing with regex to fix common patterns
Table Detection Issues
Borderless tables are challenging. For better results:
1
mineru -p doc.pdf -o ./output -m auto --table-model tablemaster
Performance Benchmarks
On an NVIDIA RTX 3090:
| Document Type | Pages | Time | Output Size |
|---|---|---|---|
| Research paper | 12 | 8s | 45 KB |
| Technical spec | 85 | 52s | 320 KB |
| Textbook chapter | 40 | 28s | 180 KB |
| Scanned document | 20 | 35s | 95 KB |
CPU-only processing is 5-10x slower but functional.
Integration with the Pipeline
The extracted Markdown feeds directly into LLM extraction:
1
2
3
4
5
6
7
8
9
10
11
12
def process_pdf(pdf_path: str):
"""Complete PDF processing: convert then extract."""
# Stage 1: PDF to Markdown
md_path = run_mineru(pdf_path)
if not md_path:
return None
# Stage 2: Read markdown for LLM processing
with open(md_path, "r", encoding="utf-8") as f:
markdown_text = f.read()
return markdown_text
The next post covers structured extraction with Instructor.
Summary
PDF extraction quality determines everything downstream. MinerU’s deep learning approach handles complex technical documents that defeat traditional libraries. The clean Markdown output—with preserved equations, tables, and structure—provides the foundation for reliable knowledge extraction.
Key points:
- Use MinerU over PyPDF2/pdfplumber for technical documents
- LaTeX equations survive as extractable notation
- Tables convert to Markdown suitable for LLM processing
- Batch processing scales to large document collections
All posts in this series
- PDF to Knowledge Graph (Part 0): From PDFs to Knowledge Graphs
- PDF to Knowledge Graph (Part 1): PDF Extraction with MinerU
- PDF to Knowledge Graph (Part 2): Structured LLM Extraction with Instructor
- PDF to Knowledge Graph (Part 3): Building Knowledge Graphs with Kuzu
- PDF to Knowledge Graph (Part 4): Automated PDF Pipeline with Watchdog
- PDF to Knowledge Graph (Part 5): Knowledge Graph Visualization with vis.js
- PDF to Knowledge Graph (Part 6): RAG with Knowledge Graphs