Multi-Provider OCR API Interface for R • ohseer

A unified R interface to multiple OCR (Optical Character Recognition) APIs. Process documents with Claude (Opus 4.6/Sonnet 4.5), Mistral OCR 3, Tensorlake, or AWS Textract using a single, consistent function.

Documentation

📚 Full documentation: https://n8layman.github.io/ohseer/

Part of the EcoExtract Suite

OhSeeR is the foundational first step in the EcoExtract Suite, a collection of R packages designed for extracting and structuring ecological data from academic literature.

Workflow: Source PDF Documents → OhSeeR (OCR) → sanitizeR (text cleaning) → whispeR (prompts) → LLM API → structuR (structured data) → auditR (validation) → Structured Dataset

Features

Unified interface: Use ohseer_ocr() with any provider
Provider fallback: Automatic failover if one provider fails
Multiple OCR providers:
- Claude Opus 4.6: #1 on OCR Arena leaderboards, structured outputs with JSON schemas
- Tensorlake: Highest accuracy (91.7%), best for tables and forms
- Mistral OCR 3: Native markdown output, cost-effective
- AWS Textract: Reliable option for structured data extraction
Consistent output: Same interface across all providers
Lightweight: No heavy dependencies, uses httr2 for all API calls

Installation

# Using pak (recommended)
pak::pak("n8layman/ohseer")

# Using devtools
devtools::install_github("n8layman/ohseer")

# Using remotes
remotes::install_github("n8layman/ohseer")

Authentication

Set up API keys as environment variables:

# Set for the current session
Sys.setenv(
  ANTHROPIC_API_KEY = "your-claude-key",        # For Claude
  TENSORLAKE_API_KEY = "your-tensorlake-key",   # For Tensorlake
  MISTRAL_API_KEY = "your-mistral-key",         # For Mistral
  AWS_ACCESS_KEY_ID = "your-aws-key",           # For AWS Textract
  AWS_SECRET_ACCESS_KEY = "your-aws-secret"     # For AWS Textract
)

Or create a .env file in your project directory:

# .env
ANTHROPIC_API_KEY=your-claude-key
TENSORLAKE_API_KEY=your-tensorlake-key
MISTRAL_API_KEY=your-mistral-key
AWS_ACCESS_KEY_ID=your-aws-key
AWS_SECRET_ACCESS_KEY=your-aws-secret

⚠️ Security: Never commit .env files to version control. Add .env to your .gitignore.

Getting API Keys

Claude: console.anthropic.com → API Keys
Tensorlake: cloud.tensorlake.ai → Dashboard → API Key
Mistral: mistral.ai → Try the API → API keys
AWS Textract: aws.amazon.com → IAM → Create access key with AmazonTextractFullAccess

Quick Start

Basic Usage

library(ohseer)

# Process with default provider (Tensorlake)
result <- ohseer_ocr("document.pdf")

# Access extracted pages
pages <- result$pages
provider_used <- result$provider

Choose a Specific Provider

# Use Claude for highest accuracy
result <- ohseer_ocr("document.pdf", provider = "claude")

# Use Mistral for cost-effectiveness
result <- ohseer_ocr("document.pdf", provider = "mistral")

# Use Tensorlake for best table extraction
result <- ohseer_ocr("document.pdf", provider = "tensorlake")

Provider Fallback

Automatically try multiple providers in order until one succeeds:

# Try Tensorlake first (highest quality), fall back to Mistral (lower cost)
result <- ohseer_ocr(
  "document.pdf",
  provider = c("tensorlake", "mistral", "claude")
)

# Check which provider succeeded
message("Used provider: ", result$provider)

# Check if any providers failed
if (!is.na(result$error_log)) {
  errors <- jsonlite::fromJSON(result$error_log)
  print(errors)
}

Select Specific Pages

# Process only first 2 pages
result <- ohseer_ocr("document.pdf", pages = c(1, 2))

# Process specific pages
result <- ohseer_ocr("document.pdf", pages = c(1, 5, 10))

Provider-Specific Options

Each provider accepts its own custom parameters via ...:

# Mistral: extract headers and footers separately
result <- ohseer_ocr(
  "document.pdf",
  provider = "mistral",
  extract_header = TRUE,
  extract_footer = TRUE
)

# Claude: use Sonnet instead of Opus, custom schema
result <- ohseer_ocr(
  "document.pdf",
  provider = "claude",
  model = "claude-sonnet-4-5",
  schema = my_custom_schema
)

# Tensorlake: use different model
result <- ohseer_ocr(
  "document.pdf",
  provider = "tensorlake",
  model = "high-quality-v1"
)

Output Format

All providers return a consistent structure when using ohseer_ocr():

result <- ohseer_ocr("document.pdf")

# Result structure:
# $provider  - Character: which provider was used
# $pages     - List: extracted page data (format varies by provider)
# $raw       - List: raw API response
# $error_log - Character (JSON): errors from failed providers, or NA

Note: Each provider returns pages in its own native format. See provider-specific vignettes for details:

Provider Comparison

Provider	Accuracy	Speed	Cost	Best For
Claude Opus 4.6	⭐⭐⭐⭐⭐ (#1 OCR Arena)	Medium	High	Structured outputs, custom schemas
Tensorlake	⭐⭐⭐⭐⭐ (91.7%)	Fast	$0.01/page	Tables, forms, batch processing
Mistral OCR 3	⭐⭐⭐	Very Fast	Low	Markdown output, cost-sensitive
AWS Textract	⭐⭐⭐⭐ (88.4%)	Fast	Medium	AWS ecosystem, reliability

Advanced Usage

For provider-specific functions and advanced features, see:

Complete Function Reference
Unified Interface Guide
Provider guides: Tensorlake | Mistral | Claude

Notes

This package is experimental and the API may change
Large files may take time to process depending on provider
Check provider documentation for pricing and rate limits:

License

MIT License