Extract Page Content by Fragment Type
tensorlake_extract_pages.RdExtracts content from Tensorlake OCR results organized by fragment type. Returns a simple list structure with fragments grouped by their Tensorlake-assigned types (page_header, section_header, text, table, etc.).
Usage
tensorlake_extract_pages(result, pages = NULL, exclude_types = character(0))Value
List with one element per page, each containing:
- page_number
Integer page number
- page_header
Character vector of page_header fragment contents
- section_header
Character vector of section_header fragment contents
- text
Character string with all text fragments in markdown format
- tables
List of tables, each with markdown, html, and content fields
- figures
List of figures, each with content and summary fields
- charts
List of charts, each with content, chart_type, and data fields
- other
List of other fragment types with type and content
Examples
if (FALSE) { # \dontrun{
result <- tensorlake_ocr("article.pdf")
# Extract all pages
all_pages <- tensorlake_extract_pages(result)
# Extract specific pages
first_two <- tensorlake_extract_pages(result, pages = c(1, 2))
# Access first page data
page1 <- all_pages[[1]]
page1$page_header # Journal citation
page1$section_header # Article title
page1$text # Body text in markdown
page1$tables # List of tables
} # }