Structured ecological data extraction and refinement from scientific literature.
EcoExtract automates the extraction of structured data from PDFs using OCR and LLMs. It’s domain-agnostic and works with any JSON schema you define.
Pipeline
graph LR
A[PDF Papers] -->|ohseer| B[OCR]
B -->|ecoextract| C[Metadata]
B -->|ecoextract + Claude| D[Data Extraction]
C --> E[SQLite Database]
D --> E
E -.->|optional| F[Refinement]
F -.-> E
E -->|ecoreview| G[Human Review]
G --> H[Validated Data]
style A fill:#e1f5ff
style H fill:#c8e6c9
style G fill:#fff9c4
style E fill:#f0f0f0
| Package | Purpose | Links |
|---|---|---|
| ohseer | OCR processing via Tensorlake | GitHub |
| ecoextract | AI-powered extraction pipeline | Docs | GitHub |
| ecoreview | Interactive Shiny review app | GitHub |
Quick Links:
- Complete Guide (Installation, workflow, review, accuracy)
- Configuration Guide (Custom schemas and prompts)
- Accuracy Metrics Guide (Understanding accuracy calculations)
Installation
# Install the ecosystem (pak recommended)
pak::pak("n8layman/ohseer") # OCR processing
pak::pak("n8layman/ecoextract") # Data extraction
pak::pak("n8layman/ecoreview") # Review app (optional)See the Complete Guide for alternative installation methods and troubleshooting.
API Key Setup
EcoExtract uses ellmer for LLM interactions and ohseer for OCR.
Required API keys:
- Tensorlake (OCR, default): https://www.tensorlake.ai/
- Anthropic Claude (extraction): https://console.anthropic.com/
Optional OCR providers (via ohseer):
- Mistral: https://console.mistral.ai/
- Claude: https://console.anthropic.com/ (uses same key as extraction)
Create a .env file in your project root (make sure it’s in .gitignore first!):
ANTHROPIC_API_KEY=your_anthropic_api_key_here
TENSORLAKE_API_KEY=your_tensorlake_api_key_here
# Optional for alternative OCR providers
MISTRAL_API_KEY=your_mistral_api_key_hereThe .env file is automatically loaded when R starts in the project directory. See the Complete Guide for detailed setup instructions.
By default, ecoextract uses anthropic/claude-sonnet-4-5 for extraction and tensorlake for OCR. To use different providers, pass the model or ocr_provider parameters to process_documents(). Supported LLM providers: Anthropic (anthropic/), Google Gemini (google_gemini/), OpenAI (openai/), Mistral (mistral/), Groq (groq/).
Quick Start
library(ecoextract)
# Process all PDFs in a folder through the 4-step pipeline:
# OCR -> Metadata -> Extraction -> Refinement (optional)
results <- process_documents(
pdf_path = "path/to/pdfs/",
db_conn = "ecoextract_records.db"
)
# Retrieve your data
records <- get_records()
export_db(filename = "extracted_data.csv")Model Fallback
EcoExtract supports tiered model fallback to handle content refusals (e.g., Claude refusing disease/biosecurity papers). Provide a vector of models to try sequentially:
# Single model (default)
process_documents(
pdf_path = "papers/",
model = "anthropic/claude-sonnet-4-5"
)
# Tiered fallback: try Claude, then Gemini (1M context), then Mistral
process_documents(
pdf_path = "papers/",
model = c(
"anthropic/claude-sonnet-4-5",
"google_gemini/gemini-2.5-flash",
"mistral/mistral-large-latest"
)
)Audit logging: The database tracks which model succeeded for each step (metadata, extraction, refinement) in *_llm_model columns. All failed attempts with error messages and timestamps are logged in *_log columns for debugging.
API keys: Add keys for fallback providers to your .env:
OCR Provider Selection
EcoExtract supports multiple OCR providers through ohseer. By default it uses Tensorlake, but you can switch to Mistral or Claude:
# Use default provider (Tensorlake)
process_documents("papers/")
# Use Mistral OCR
process_documents(
pdf_path = "papers/",
ocr_provider = "mistral"
)
# Use Claude OCR
process_documents(
pdf_path = "papers/",
ocr_provider = "claude"
)
# OCR fallback: try Mistral, then Tensorlake if Mistral fails
process_documents(
pdf_path = "papers/",
ocr_provider = c("mistral", "tensorlake")
)
# Increase OCR timeout for large documents
process_documents(
pdf_path = "papers/",
ocr_timeout = 300 # 5 minutes
)Provider fallback: When multiple providers are specified, ohseer automatically tries each in order until one succeeds. The OCR provider used for each document is tracked in the ocr_provider column of the documents table.
Key Features
-
Smart skip logic – Re-running
process_documents()skips completed steps. Forced re-runs automatically invalidate downstream steps. -
Parallel processing – Process multiple documents simultaneously with
workers = 4(requirescrewpackage). -
Deduplication – Three methods:
"llm"(default),"embedding", or"jaccard". - Human review – Edit, add, and delete records in the ecoreview Shiny app with full audit trail.
- Accuracy metrics – Calculate detection recall, field precision, F1, and edit severity after review.
See the Complete Guide for details on all features.
Custom Schemas
EcoExtract is domain-agnostic and works with any JSON schema:
# Create custom config directory with templates
init_ecoextract()
# Edit the generated files:
# - ecoextract/schema.json # Define your data structure
# - ecoextract/extraction_prompt.md # Describe what to extract
# The package automatically uses these files
process_documents("pdfs/", "records.db")Schema requirements: top-level records property (array of objects), each field with type and description, JSON Schema draft-07 format.
See the Configuration Guide for complete details and examples.
Package Functions
Workflow
-
process_documents()- Complete 4-step workflow (OCR -> Metadata -> Extract -> Refine)
Database Setup
-
init_ecoextract_database()- Initialize database with schema -
init_ecoextract()- Create project config directory with template schema and prompts
Data Access
-
get_documents()- Query documents and their metadata from database -
get_records()- Query extracted records from database -
get_ocr_markdown()- Get OCR markdown text for a document -
get_ocr_html_preview()- Render OCR output with embedded images as HTML -
get_db_stats()- Get document and record counts from database -
export_db()- Export records with metadata to tibble or CSV file
Testing
Integration tests require API keys in a .env file. See CONTRIBUTING.md for details.
File Structure
ecoextract/
├── R/
│ ├── workflow.R # Main process_documents() workflow + skip/cascade logic
│ ├── ocr.R # OCR processing
│ ├── metadata.R # Publication metadata extraction
│ ├── extraction.R # Data extraction functions
│ ├── refinement.R # Data refinement functions
│ ├── deduplication.R # Record deduplication (LLM, embedding, Jaccard)
│ ├── database.R # Database operations
│ ├── getters.R # Data access functions (get_*, export_db)
│ ├── config_loader.R # Configuration file loading + init_ecoextract()
│ ├── prompts.R # Prompt loading
│ ├── utils.R # Utilities
│ ├── config.R # Package configuration
│ └── ecoextract-package.R # Package metadata
├── inst/
│ ├── extdata/ # Schema files
│ │ ├── schema.json
│ │ └── metadata_schema.json
│ └── prompts/ # System prompts
│ ├── extraction_prompt.md
│ ├── extraction_context.md
│ ├── metadata_prompt.md
│ ├── metadata_context.md
│ ├── refinement_prompt.md
│ ├── refinement_context.md
│ └── deduplication_prompt.md
├── tests/testthat/ # Tests
├── vignettes/ # Package vignettes
├── DESCRIPTION
├── NAMESPACE
├── CONTRIBUTING.md # Development guide
└── README.md