Understanding Mistral OCR Output Structure
mistral-output-structure.RmdIntroduction
This vignette explains the native structure of objects returned by
Mistral’s OCR functions. The ohseer package returns
Mistral’s native format without post-processing, allowing applications
to handle transformations as needed.
Basic Usage
library(ohseer)
# Parse a document with Mistral OCR
result <- mistral_ocr("document.pdf",
extract_header = TRUE,
extract_footer = TRUE)
# Extract pages (returns native Mistral format)
pages <- mistral_extract_pages(result)Output Structure
The mistral_ocr() function returns a list with Mistral’s
complete response structure.
Top-Level Fields
str(result, max.level = 1)Key top-level fields include:
-
id: Unique job identifier -
object: Object type (typically “document_ocr”) -
model: Model used (e.g., “mistral-ocr-2512”) -
usage: Token usage statistics -
pages: List of page objects (detailed below) -
created: Timestamp of creation
Page Structure
Each element in result$pages represents one page and
contains:
Page Fields
page1 <- result$pages[[1]]
str(page1, max.level = 1)Each page object has these 8 fields:
-
index: 0-based page number (first page = 0) -
markdown: Full page content in markdown format -
images: Array of extracted images (base64 or URLs) -
tables: Array of table objects (detailed below) -
hyperlinks: Array of hyperlinks detected on the page -
header: Page header text (whenextract_header = TRUE) -
footer: Page footer text (whenextract_footer = TRUE) -
dimensions: Page dimensions object
Table Structure
Tables are extracted and stored in the tables array of
each page.
Table Fields
table1 <- page1$tables[[1]]
str(table1)Each table has 3 fields:
-
id: Unique table identifier (e.g., “tbl-0.md”) -
content: Markdown-formatted table content -
format: Format type (typically “markdown”)
Headers and Footers
When you enable header/footer extraction, Mistral separates them from the main content.
Extraction Options
result <- mistral_ocr("document.pdf",
extract_header = TRUE, # Extract running headers
extract_footer = TRUE) # Extract page numbers/footersHeader/Footer Format
Headers and footers can be:
-
String:
"CHAPTER FIVE\n64 RETENTION OF TREES WITH HOLLOWS" -
Empty object:
{}(when no header/footer detected) - NULL: When extraction is disabled
Dimensions Object
Each page includes dimension information for rendering calculations.
page1$dimensionsExtracting Information
Get Specific Pages
Use mistral_extract_pages() to filter to specific
pages:
# Extract first 3 pages only
first_three <- mistral_extract_pages(result, pages = c(1, 2, 3))
# Note: page numbers in the pages argument are 1-based
# But the 'index' field in each page is 0-basedExtract All Tables
Collect all tables from the document:
all_tables <- list()
for (i in seq_along(result$pages)) {
page <- result$pages[[i]]
if (length(page$tables) > 0) {
for (j in seq_along(page$tables)) {
all_tables[[length(all_tables) + 1]] <- list(
page_number = i, # 1-based for human readability
page_index = page$index, # 0-based as in original
table_id = page$tables[[j]]$id,
content = page$tables[[j]]$content
)
}
}
}
# View table summary
do.call(rbind, lapply(all_tables, function(t) {
data.frame(
page = t$page_number,
table_id = t$table_id,
chars = nchar(t$content)
)
}))Parse Table Markdown
Convert markdown tables to data frames:
library(knitr)
# Get first table
table1 <- result$pages[[1]]$tables[[1]]
# Parse markdown to data frame (requires knitr)
# Note: This is a simple approach; more robust parsing may be needed
lines <- strsplit(table1$content, "\n")[[1]]
# Remove separator line (usually second line with ---)
data_lines <- lines[!grepl("^\\|?[-\\s]+\\|[-\\s]+", lines)]
# You can also send the markdown to an LLM for structured extractionComplete Example
Here’s a complete workflow for processing a scientific paper:
library(ohseer)
library(jsonlite)
# 1. Parse the document with header/footer extraction
result <- mistral_ocr("paper.pdf",
extract_header = TRUE,
extract_footer = TRUE,
table_format = "markdown")
# 2. Extract all pages
pages <- mistral_extract_pages(result)
# 3. Examine first page structure
page1 <- pages[[1]]
cat("Page", page1$index + 1, "\n") # +1 for 1-based display
cat("Header:", page1$header, "\n")
cat("Footer:", if(is.null(page1$footer) || length(page1$footer) == 0) "None" else page1$footer, "\n")
cat("Tables:", length(page1$tables), "\n")
cat("Images:", length(page1$images), "\n")
# 4. Extract all tables with page information
all_tables <- list()
for (page in pages) {
for (table in page$tables) {
all_tables[[length(all_tables) + 1]] <- list(
page = page$index + 1, # Convert to 1-based
id = table$id,
markdown = table$content
)
}
}
cat("Total tables found:", length(all_tables), "\n")
# 5. Convert to JSON for downstream processing
json_output <- toJSON(pages, auto_unbox = TRUE, pretty = TRUE)
# 6. Process with your application
# Applications can implement their own transformations based on needsKey Differences from Other Providers
Index Numbering
-
Mistral: Uses 0-based indexing in the
indexfield (first page = 0) -
Tensorlake: Uses 1-based
page_number(first page = 1)
Always remember this when filtering or displaying page numbers.
Tips and Best Practices
-
Page Numbering: Always be aware of the 0-based
indexvs 1-based page references:# To get "page 1" (human numbering): page1 <- pages[[1]] # R uses 1-based indexing # But page1$index will be 0 -
Header/Footer Extraction: Enable these for cleaner body text:
result <- mistral_ocr("doc.pdf", extract_header = TRUE, extract_footer = TRUE) -
Table Format: Request markdown format for easier parsing:
result <- mistral_ocr("doc.pdf", table_format = "markdown") -
Image Handling: Use
include_image_base64 = TRUEto get images:result <- mistral_ocr("doc.pdf", include_image_base64 = TRUE) # Access with: pages[[1]]$images -
JSON Export: For LLM processing, convert to JSON:
-
Validate Results: Check that OCR completed successfully:
Further Reading
- Mistral Setup Guide
- Tensorlake Output Structure
- Package README
- Mistral AI API documentation: https://docs.mistral.ai