A unified R interface to multiple OCR (Optical Character Recognition) APIs. Process documents with Claude (Opus 4.6/Sonnet 4.5), Mistral OCR 3, Tensorlake, or AWS Textract using a single, consistent function.
Documentation
π Full documentation: https://n8layman.github.io/ohseer/
Part of the EcoExtract Suite
OhSeeR is the foundational first step in the EcoExtract Suite, a collection of R packages designed for extracting and structuring ecological data from academic literature.
Workflow: Source PDF Documents β OhSeeR (OCR) β sanitizeR (text cleaning) β whispeR (prompts) β LLM API β structuR (structured data) β auditR (validation) β Structured Dataset
Features
-
Unified interface: Use
ohseer_ocr()with any provider - Provider fallback: Automatic failover if one provider fails
-
Multiple OCR providers:
- Claude Opus 4.6: #1 on OCR Arena leaderboards, structured outputs with JSON schemas
- Tensorlake: Highest accuracy (91.7%), best for tables and forms
- Mistral OCR 3: Native markdown output, cost-effective
- AWS Textract: Reliable option for structured data extraction
- Consistent output: Same interface across all providers
- Lightweight: No heavy dependencies, uses httr2 for all API calls
Installation
# Using pak (recommended)
pak::pak("n8layman/ohseer")
# Using devtools
devtools::install_github("n8layman/ohseer")
# Using remotes
remotes::install_github("n8layman/ohseer")Authentication
Set up API keys as environment variables:
# Set for the current session
Sys.setenv(
ANTHROPIC_API_KEY = "your-claude-key", # For Claude
TENSORLAKE_API_KEY = "your-tensorlake-key", # For Tensorlake
MISTRAL_API_KEY = "your-mistral-key", # For Mistral
AWS_ACCESS_KEY_ID = "your-aws-key", # For AWS Textract
AWS_SECRET_ACCESS_KEY = "your-aws-secret" # For AWS Textract
)Or create a .env file in your project directory:
# .env
ANTHROPIC_API_KEY=your-claude-key
TENSORLAKE_API_KEY=your-tensorlake-key
MISTRAL_API_KEY=your-mistral-key
AWS_ACCESS_KEY_ID=your-aws-key
AWS_SECRET_ACCESS_KEY=your-aws-secretβ οΈ Security: Never commit .env files to version control. Add .env to your .gitignore.
Getting API Keys
- Claude: console.anthropic.com β API Keys
- Tensorlake: cloud.tensorlake.ai β Dashboard β API Key
- Mistral: mistral.ai β Try the API β API keys
-
AWS Textract: aws.amazon.com β IAM β Create access key with
AmazonTextractFullAccess
Quick Start
Basic Usage
library(ohseer)
# Process with default provider (Tensorlake)
result <- ohseer_ocr("document.pdf")
# Access extracted pages
pages <- result$pages
provider_used <- result$providerChoose a Specific Provider
# Use Claude for highest accuracy
result <- ohseer_ocr("document.pdf", provider = "claude")
# Use Mistral for cost-effectiveness
result <- ohseer_ocr("document.pdf", provider = "mistral")
# Use Tensorlake for best table extraction
result <- ohseer_ocr("document.pdf", provider = "tensorlake")Provider Fallback
Automatically try multiple providers in order until one succeeds:
# Try Tensorlake first (highest quality), fall back to Mistral (lower cost)
result <- ohseer_ocr(
"document.pdf",
provider = c("tensorlake", "mistral", "claude")
)
# Check which provider succeeded
message("Used provider: ", result$provider)
# Check if any providers failed
if (!is.na(result$error_log)) {
errors <- jsonlite::fromJSON(result$error_log)
print(errors)
}Select Specific Pages
# Process only first 2 pages
result <- ohseer_ocr("document.pdf", pages = c(1, 2))
# Process specific pages
result <- ohseer_ocr("document.pdf", pages = c(1, 5, 10))Provider-Specific Options
Each provider accepts its own custom parameters via ...:
# Mistral: extract headers and footers separately
result <- ohseer_ocr(
"document.pdf",
provider = "mistral",
extract_header = TRUE,
extract_footer = TRUE
)
# Claude: use Sonnet instead of Opus, custom schema
result <- ohseer_ocr(
"document.pdf",
provider = "claude",
model = "claude-sonnet-4-5",
schema = my_custom_schema
)
# Tensorlake: use different model
result <- ohseer_ocr(
"document.pdf",
provider = "tensorlake",
model = "high-quality-v1"
)Output Format
All providers return a consistent structure when using ohseer_ocr():
result <- ohseer_ocr("document.pdf")
# Result structure:
# $provider - Character: which provider was used
# $pages - List: extracted page data (format varies by provider)
# $raw - List: raw API response
# $error_log - Character (JSON): errors from failed providers, or NANote: Each provider returns pages in its own native format. See provider-specific vignettes for details:
Provider Comparison
| Provider | Accuracy | Speed | Cost | Best For |
|---|---|---|---|---|
| Claude Opus 4.6 | βββββ (#1 OCR Arena) | Medium | High | Structured outputs, custom schemas |
| Tensorlake | βββββ (91.7%) | Fast | $0.01/page | Tables, forms, batch processing |
| Mistral OCR 3 | βββ | Very Fast | Low | Markdown output, cost-sensitive |
| AWS Textract | ββββ (88.4%) | Fast | Medium | AWS ecosystem, reliability |
Advanced Usage
For provider-specific functions and advanced features, see:
- Complete Function Reference
- Unified Interface Guide
- Provider guides: Tensorlake | Mistral | Claude