Tensorlake Setup Guide
This guide will help you get started with Tensorlake, the recommended OCR provider for ohseer.
Why Tensorlake?
Tensorlake offers the best accuracy for document parsing: - 91.7% accuracy for structured data extraction - 86.8% accuracy for table structure preservation - No 5 MB file size limit (unlike AWS Textract synchronous API) - Simple API key authentication (no IAM setup required) - Competitive pricing at $0.01 per page
Getting Started
1. Sign Up
Visit https://cloud.tensorlake.ai and create an account.
2. Get Your API Key
- Log into your Tensorlake dashboard
- Navigate to API Keys section
- Click “Create New API Key”
- Copy your API key
3. Set Environment Variable
Option 1: In R Session
Sys.setenv(TENSORLAKE_API_KEY = "your-api-key-here")Option 2: In .env File (Recommended)
Create a .env file in your project directory:
⚠️ Security: Never commit .env files to version control! Add .env to your .gitignore.
4. Test Your Setup
library(ohseer)
# Process a document
result <- tensorlake_ocr("path/to/document.pdf")
# Check the result
str(result, max.level = 2)Usage Examples
Basic Document Parsing
# Parse entire document
result <- tensorlake_ocr("document.pdf")
# Parse specific pages
result <- tensorlake_ocr("document.pdf", pages = "1-5")
# Save output to file
result <- tensorlake_ocr("document.pdf", output_file = "result.json")Advanced Options
Custom Wait Time
By default, tensorlake_ocr() waits up to 60 seconds for parsing to complete:
# Wait up to 120 seconds
result <- tensorlake_ocr("large-document.pdf", max_wait_seconds = 120)Poll Interval
Adjust how frequently to check parse status:
# Check every 5 seconds instead of default 2
result <- tensorlake_ocr("document.pdf", poll_interval = 5)Low-Level API Access
For more control, use the low-level functions:
# Submit parse job
parse_response <- tensorlake_parse_document(
file_path = "document.pdf",
tensorlake_api_key = Sys.getenv("TENSORLAKE_API_KEY"),
pages = "1-10"
)
parse_id <- parse_response$parse_id
# Wait a bit...
Sys.sleep(5)
# Get results
result <- tensorlake_get_parse_result(
parse_id = parse_id,
tensorlake_api_key = Sys.getenv("TENSORLAKE_API_KEY")
)Supported File Formats
Tensorlake supports: - PDF - Microsoft Word (DOCX, DOC) - Microsoft PowerPoint (PPTX, PPT) - Microsoft Excel (XLSX, XLS) - Images (PNG, JPEG, TIFF) - HTML - Plain text (TXT) - CSV
Pricing
- Document Parsing: $0.01 per page
- No additional fees for storage or bandwidth
- Pay only for what you process
For enterprise or on-premises deployments, contact Tensorlake support.
Troubleshooting
API Key Not Found
Error: Tensorlake API key not found
Solution: Set the TENSORLAKE_API_KEY environment variable:
Sys.setenv(TENSORLAKE_API_KEY = "your-key")Timeout Errors
Error: Parse job timed out after 60 seconds
Solution: Increase max_wait_seconds:
result <- tensorlake_ocr("document.pdf", max_wait_seconds = 120)Support
- Documentation: https://docs.tensorlake.ai
- GitHub: https://github.com/tensorlakeai/tensorlake
- Community Slack: Join via website
- Email: support@tensorlake.ai