Process Document with Tensorlake OCR
tensorlake_ocr.RdThis function processes a document with Tensorlake's high-accuracy parsing service (91.7% accuracy) and returns the OCR results. The function uploads the file, submits a parse job, polls for completion, and returns the final result.
Usage
tensorlake_ocr(
file_path,
pages = NULL,
figure_summarization = TRUE,
chart_extraction = TRUE,
figure_summarization_prompt = NULL,
tensorlake_api_key = Sys.getenv("TENSORLAKE_API_KEY"),
max_wait_seconds = 60,
poll_interval = 2,
output_file = NULL
)Arguments
- file_path
Character string. Path to a local PDF, DOCX, PPTX, image, or text file.
- pages
Integer vector or character string. Optional page range to parse. Can be a vector like c(1, 2) or 1:5, or a string like "1-5" or "1,3,5". If NULL (default), parses entire document.
- figure_summarization
Logical. Enable figure summarization enrichment, returning text descriptions of figures. Default TRUE.
- chart_extraction
Logical. Enable chart extraction enrichment, returning structured data from charts/graphs. Default TRUE.
- figure_summarization_prompt
Character string. Optional custom prompt to guide figure summarization. Default NULL uses Tensorlake's default prompt.
- tensorlake_api_key
Character string. Tensorlake API key. Default retrieves from environment variable "TENSORLAKE_API_KEY".
- max_wait_seconds
Numeric. Maximum seconds to wait for parsing to complete. Default is 60.
- poll_interval
Numeric. Seconds between status checks. Default is 2.
- output_file
Character string. Optional path to save the JSON response to a file. Default is NULL (no file output).
Value
List. The parsed response from Tensorlake containing:
- status
Parse job status
- result
Parsed document content with text, tables, and structured data
- metadata
Document metadata
Note
Tensorlake offers superior accuracy (91.7%) compared to AWS Textract (88.4%) and does not have the 5 MB file size limit of Textract's synchronous API. Pricing is competitive at $0.01 per page.
Examples
if (FALSE) { # \dontrun{
# Process entire PDF with Tensorlake
result <- tensorlake_ocr("document.pdf")
# Process only first 2 pages (faster, cheaper)
result <- tensorlake_ocr("document.pdf", pages = c(1, 2))
# Save output to JSON file
result <- tensorlake_ocr("document.pdf", output_file = "result.json")
# Extract structured data
pages <- tensorlake_extract_pages(result, pages = c(1, 2))
} # }