Understanding Tensorlake Output Structure
tensorlake-output-structure.RmdIntroduction
This vignette explains the structure of objects returned by Tensorlake’s OCR functions and shows how to extract different types of information from the results.
Basic Usage
library(ohseer)
# Parse a document with Tensorlake
result <- tensorlake_ocr("document.pdf")Output Structure
The tensorlake_ocr() function returns a list with the
following top-level fields:
Top-Level Fields
str(result, max.level = 1)-
parse_id: Unique identifier for the parse job -
status: Parse job status (e.g., “successful”, “failed”) -
total_pages: Total number of pages in the document -
parsed_pages_count: Number of pages successfully parsed -
pages: List of parsed page data (detailed below) -
chunks: List of text chunks extracted from the document -
page_classes: Classification of page types (if applicable) -
created_at: Timestamp when parse job was created (ISO 8601 format) -
finished_at: Timestamp when parse job completed (ISO 8601 format) -
usage: API usage statistics (tokens, pages, etc.)
Page Structure
Each element in result$pages represents one page and
contains:
Page Fields
page1 <- result$pages[[1]]
str(page1, max.level = 1)-
page_number: Page number (integer) -
page_fragments: List of content fragments found on the page -
dimensions: Page dimensions (width, height) -
page_dimensions: Alternative dimension measurements -
classification_reason: Reason for page classification (if applicable)
Page Fragments
Each page contains multiple fragments representing different content types:
fragment <- page1$page_fragments[[1]]
str(fragment, max.level = 2)Fragment Fields
-
fragment_type: Type of content (see types below) -
content: The actual content (text or structured data)-
content$content: The text content -
content$html: HTML representation (for tables)
-
-
reading_order: Position in reading sequence (integer) -
bbox: Bounding box coordinates-
x1,y1: Top-left corner -
x2,y2: Bottom-right corner
-
Fragment Types
Tensorlake identifies several content types:
| Fragment Type | Description |
|---|---|
page_header |
Headers at top of pages |
page_number |
Page numbers |
section_header |
Section/chapter headings |
text |
Regular paragraph text |
table |
Tables (with optional HTML) |
table_caption |
Table captions/titles |
figure |
Images or figures |
figure_caption |
Figure captions |
Extracting Information
Extract Structured Page Data
Use tensorlake_extract_pages() to get organized content
by fragment type:
# Extract first 2 pages with structured data
pages <- tensorlake_extract_pages(result, pages = c(1, 2))
# Access first page
page1 <- pages[[1]]
# Page structure:
page1$page_number # Page number
page1$page_header # Character vector of page headers (e.g., journal citation)
page1$section_header # Character vector of section headers (e.g., article title)
page1$text # String with all text content (markdown format)
page1$tables # List of tables with content, markdown, html formats
page1$other # Other fragment types
# Example: Get citation info from first page
citation <- page1$page_header
title <- page1$section_header
# Example: Access table data
if (length(page1$tables) > 0) {
table1 <- page1$tables[[1]]
cat(table1$markdown) # Markdown format
cat(table1$html) # HTML format
cat(table1$content) # Plain text
}
# Convert to JSON for LLM processing
library(jsonlite)
json_for_llm <- toJSON(pages, auto_unbox = TRUE, pretty = TRUE)Working with Fragments
Filter by Fragment Type
Find all headers in a document:
# Get all section headers
headers <- list()
for (page in result$pages) {
for (frag in page$page_fragments) {
if (frag$fragment_type == "section_header") {
headers[[length(headers) + 1]] <- list(
page = page$page_number,
text = frag$content$content,
order = frag$reading_order
)
}
}
}
# View headers
do.call(rbind, lapply(headers, as.data.frame))Extract Text in Reading Order
Get text fragments in the order they should be read:
# For a specific page
page_num <- 1
page <- result$pages[[page_num]]
# Sort fragments by reading order
sorted_fragments <- page$page_fragments[order(sapply(page$page_fragments, function(f) f$reading_order))]
# Extract text in order
ordered_text <- sapply(sorted_fragments, function(frag) {
frag$content$content %||% ""
})
cat(paste(ordered_text, collapse = "\n"))Complete Example
Here’s a complete workflow for processing an academic paper:
library(ohseer)
library(jsonlite)
# 1. Parse the document
result <- tensorlake_ocr("paper.pdf")
# 2. Check status
cat("Status:", result$status, "\n")
cat("Processed", result$parsed_pages_count, "pages\n")
# 3. Extract structured data from first 2 pages (for citation metadata)
pages <- tensorlake_extract_pages(result, pages = c(1, 2))
# 4. Access citation information
page1 <- pages[[1]]
cat("Journal:", page1$page_header, "\n")
cat("Title:", page1$section_header, "\n")
# 5. Convert to JSON for LLM processing
json_data <- toJSON(pages, auto_unbox = TRUE, pretty = TRUE)
# Send json_data to Claude or another LLM for metadata extraction
# 6. Access tables if needed
for (page in pages) {
if (length(page$tables) > 0) {
cat("Found", length(page$tables), "tables on page", page$page_number, "\n")
}
}Tips and Best Practices
-
Fragment Types: Different documents may have different fragment types. Always check what’s available:
Reading Order: Use
reading_orderto maintain document flow when extracting text.Bounding Boxes: Use
bboxcoordinates if you need to know where content appears on the page.Page Selection: For citation extraction, processing just the first 1-2 pages is usually sufficient and faster.
-
Error Handling: Always check
result$statusto ensure parsing succeeded:if (result$status != "successful") { stop("Parsing failed or is incomplete") }
Further Reading
- Tensorlake Setup Guide
- Package README
- Tensorlake API documentation: https://docs.tensorlake.ai