Claude OCR with Structured Outputs
claude-structured-output.RmdIntroduction
This vignette explains how Claude’s structured output feature works
in the ohseer package. Claude Opus 4.6 and Sonnet 4.5
support guaranteed JSON schema compliance through constrained decoding
during inference.
Basic Usage
library(ohseer)
# Process a document with Claude OCR using structured outputs
result <- claude_ocr_process_file(
file_path = "document.pdf",
model = "claude-opus-4-6"
)How Structured Outputs Work
Claude’s structured output uses the output_config.format
parameter with a JSON schema to guarantee the response matches your
specified structure.
Key Features
- Guaranteed Compliance: Response will always match the provided JSON schema
- Constrained Decoding: Uses constrained decoding during inference (not post-processing)
- Available Models: Claude Opus 4.6 and Claude Sonnet 4.5
- No Parsing Errors: Eliminates JSON parsing failures from malformed responses
Default Schema: Tensorlake-Compatible Format
By default, claude_ocr_process_file() uses a
Tensorlake-compatible schema for consistency across providers.
Schema Structure
The default schema defines pages with these fragment types:
{
"pages": [
{
"page_number": 1,
"page_fragments": [
{
"fragment_type": "page_header",
"content": { "content": "..." },
"reading_order": 1
},
{
"fragment_type": "section_header",
"content": { "content": "..." },
"reading_order": 2
},
{
"fragment_type": "text",
"content": { "content": "..." },
"reading_order": 3
},
{
"fragment_type": "table",
"content": {
"content": "plain text",
"html": "<table>...</table>",
"markdown": "| col1 | col2 |..."
},
"reading_order": 4
}
]
}
]
}Fragment Types
The schema supports these fragment types:
-
page_header: Headers at top of pages (journal name, volume, etc.) -
page_number: Page numbers -
section_header: Section/chapter headings -
text: Regular paragraph text -
table: Tables with content, html, and markdown representations -
table_caption: Table captions/titles -
figure: Images or figures -
figure_caption: Figure captions -
list: Lists -
footnote: Footnotes -
equation: Mathematical equations -
code: Code blocks -
other: Any other content type
Using the Default Schema
The default behavior provides structured, Tensorlake-compatible output:
library(ohseer)
# Process with default Tensorlake schema
result <- claude_ocr_process_file("paper.pdf")
# Access structured pages
pages <- result$pages
# First page
page1 <- pages[[1]]
page1$page_number # 1
# Get all text fragments
text_fragments <- Filter(function(f) f$fragment_type == "text", page1$page_fragments)
# Get all tables
table_fragments <- Filter(function(f) f$fragment_type == "table", page1$page_fragments)
# Access table in markdown format
if (length(table_fragments) > 0) {
table1_markdown <- table_fragments[[1]]$content$markdown
cat(table1_markdown)
}Custom Schemas
You can provide your own JSON schema for custom output structures.
Example: Simple Schema
# Define a simple schema for extracting just titles and authors
simple_schema <- list(
type = "object",
properties = list(
title = list(type = "string"),
authors = list(
type = "array",
items = list(type = "string")
),
abstract = list(type = "string")
),
required = c("title", "authors", "abstract")
)
# Use with custom prompt
result <- claude_api_call(
prompt = "Extract the title, authors, and abstract from this paper.",
image_data = list(base64_image),
schema = simple_schema,
model = "claude-opus-4-6"
)
# Access results
cat("Title:", result$content[[1]]$text$title, "\n")
cat("Authors:", paste(result$content[[1]]$text$authors, collapse = ", "), "\n")Example: Metadata Extraction
# Schema for scientific paper metadata
metadata_schema <- list(
type = "object",
properties = list(
title = list(type = "string"),
authors = list(
type = "array",
items = list(
type = "object",
properties = list(
name = list(type = "string"),
affiliation = list(type = "string")
)
)
),
journal = list(type = "string"),
year = list(type = "integer"),
doi = list(type = "string"),
keywords = list(
type = "array",
items = list(type = "string")
)
),
required = c("title", "authors", "journal")
)
# Process first 2 pages for metadata
result <- claude_ocr_process_file(
"paper.pdf",
pages = c(1, 2),
schema = metadata_schema,
prompt = "Extract bibliographic metadata from this scientific paper."
)Processing Options
Page Selection
Process specific pages only:
# Process first 3 pages
result <- claude_ocr_process_file(
"paper.pdf",
pages = c(1, 2, 3),
model = "claude-opus-4-6"
)Model Selection
Choose between available models:
# Use Opus 4.6 (most capable, slower, more expensive)
result_opus <- claude_ocr_process_file("paper.pdf", model = "claude-opus-4-6")
# Use Sonnet 4.5 (fast, cost-effective)
result_sonnet <- claude_ocr_process_file("paper.pdf", model = "claude-sonnet-4-5")Image Quality
Adjust image processing:
result <- claude_ocr_process_file(
"paper.pdf",
dpi = 200, # Higher DPI for better quality
model = "claude-opus-4-6"
)Complete Example
Here’s a complete workflow for extracting structured data from a scientific paper:
library(ohseer)
library(jsonlite)
# 1. Process first 2 pages with Claude for metadata extraction
metadata_result <- claude_ocr_process_file(
"paper.pdf",
pages = c(1, 2),
model = "claude-sonnet-4-5" # Fast model for metadata
)
# 2. Access structured data
page1 <- metadata_result$pages[[1]]
# 3. Extract citation information
page_headers <- Filter(function(f) f$fragment_type == "page_header", page1$page_fragments)
section_headers <- Filter(function(f) f$fragment_type == "section_header", page1$page_fragments)
if (length(page_headers) > 0) {
cat("Journal:", page_headers[[1]]$content$content, "\n")
}
if (length(section_headers) > 0) {
cat("Title:", section_headers[[1]]$content$content, "\n")
}
# 4. Extract all tables
all_tables <- list()
for (page in metadata_result$pages) {
table_frags <- Filter(function(f) f$fragment_type == "table", page$page_fragments)
for (table in table_frags) {
all_tables[[length(all_tables) + 1]] <- list(
page = page$page_number,
markdown = table$content$markdown,
html = table$content$html
)
}
}
cat("Found", length(all_tables), "tables\n")
# 5. Convert to JSON for further processing
json_output <- toJSON(metadata_result$pages, auto_unbox = TRUE, pretty = TRUE)Benefits of Structured Outputs
1. Reliability
- Guaranteed Format: No more parsing errors from malformed JSON
- Type Safety: Schema ensures correct data types
- Required Fields: Can specify which fields must be present
Comparison with Other Providers
Tips and Best Practices
-
Model Selection: Use Sonnet 4.5 for most tasks (fast and cost-effective):
result <- claude_ocr_process_file("doc.pdf", model = "claude-sonnet-4-5") -
Page Selection: Process only needed pages to save costs:
# Metadata usually on first 1-2 pages result <- claude_ocr_process_file("doc.pdf", pages = c(1, 2)) -
Schema Design: Keep schemas simple and focused:
-
Error Handling: Always validate the response:
-
Cost Optimization: Balance quality and cost:
- Use Sonnet 4.5 for routine processing
- Use Opus 4.6 for complex documents or when highest accuracy is needed
- Process minimum pages necessary
Further Reading
- Claude Setup Guide
- Tensorlake Output Structure
- Mistral Output Structure
- Package README
- Claude API documentation: https://docs.anthropic.com