Skip to contents

Structured ecological data extraction and refinement from scientific literature.

EcoExtract automates the extraction of structured data from PDFs using OCR and LLMs. It’s domain-agnostic and works with any JSON schema you define.

Pipeline

graph LR
    A[PDF Papers] -->|ohseer| B[OCR]
    B -->|ecoextract| C[Metadata]
    B -->|ecoextract + Claude| D[Data Extraction]
    C --> E[SQLite Database]
    D --> E
    E -.->|optional| F[Refinement]
    F -.-> E
    E -->|ecoreview| G[Human Review]
    G --> H[Validated Data]

    style A fill:#e1f5ff
    style H fill:#c8e6c9
    style G fill:#fff9c4
    style E fill:#f0f0f0


Package Purpose Links
ohseer OCR processing via Tensorlake GitHub
ecoextract AI-powered extraction pipeline Docs | GitHub
ecoreview Interactive Shiny review app GitHub

Quick Links:

Installation

# Install the ecosystem (pak recommended)
pak::pak("n8layman/ohseer")      # OCR processing
pak::pak("n8layman/ecoextract")  # Data extraction
pak::pak("n8layman/ecoreview")   # Review app (optional)

See the Complete Guide for alternative installation methods and troubleshooting.

API Key Setup

EcoExtract uses ellmer for LLM interactions and ohseer for OCR.

Required API keys:

Optional OCR providers (via ohseer):

Create a .env file in your project root (make sure it’s in .gitignore first!):

ANTHROPIC_API_KEY=your_anthropic_api_key_here
TENSORLAKE_API_KEY=your_tensorlake_api_key_here
# Optional for alternative OCR providers
MISTRAL_API_KEY=your_mistral_api_key_here

The .env file is automatically loaded when R starts in the project directory. See the Complete Guide for detailed setup instructions.

By default, ecoextract uses anthropic/claude-sonnet-4-5 for extraction and tensorlake for OCR. To use different providers, pass the model or ocr_provider parameters to process_documents(). Supported LLM providers: Anthropic (anthropic/), Google Gemini (google_gemini/), OpenAI (openai/), Mistral (mistral/), Groq (groq/).

Quick Start

library(ecoextract)

# Process all PDFs in a folder through the 4-step pipeline:
# OCR -> Metadata -> Extraction -> Refinement (optional)
results <- process_documents(
  pdf_path = "path/to/pdfs/",
  db_conn = "ecoextract_records.db"
)

# Retrieve your data
records <- get_records()
export_db(filename = "extracted_data.csv")

Model Fallback

EcoExtract supports tiered model fallback to handle content refusals (e.g., Claude refusing disease/biosecurity papers). Provide a vector of models to try sequentially:

# Single model (default)
process_documents(
  pdf_path = "papers/",
  model = "anthropic/claude-sonnet-4-5"
)

# Tiered fallback: try Claude, then Gemini (1M context), then Mistral
process_documents(
  pdf_path = "papers/",
  model = c(
    "anthropic/claude-sonnet-4-5",
    "google_gemini/gemini-2.5-flash",
    "mistral/mistral-large-latest"
  )
)

Audit logging: The database tracks which model succeeded for each step (metadata, extraction, refinement) in *_llm_model columns. All failed attempts with error messages and timestamps are logged in *_log columns for debugging.

API keys: Add keys for fallback providers to your .env:

ANTHROPIC_API_KEY=your_anthropic_key
GOOGLE_API_KEY=your_google_key
OPENAI_API_KEY=your_openai_key
MISTRAL_API_KEY=your_mistral_key

OCR Provider Selection

EcoExtract supports multiple OCR providers through ohseer. By default it uses Tensorlake, but you can switch to Mistral or Claude:

# Use default provider (Tensorlake)
process_documents("papers/")

# Use Mistral OCR
process_documents(
  pdf_path = "papers/",
  ocr_provider = "mistral"
)

# Use Claude OCR
process_documents(
  pdf_path = "papers/",
  ocr_provider = "claude"
)

# OCR fallback: try Mistral, then Tensorlake if Mistral fails
process_documents(
  pdf_path = "papers/",
  ocr_provider = c("mistral", "tensorlake")
)

# Increase OCR timeout for large documents
process_documents(
  pdf_path = "papers/",
  ocr_timeout = 300  # 5 minutes
)

Provider fallback: When multiple providers are specified, ohseer automatically tries each in order until one succeeds. The OCR provider used for each document is tracked in the ocr_provider column of the documents table.

Key Features

  • Smart skip logic – Re-running process_documents() skips completed steps. Forced re-runs automatically invalidate downstream steps.
  • Parallel processing – Process multiple documents simultaneously with workers = 4 (requires crew package).
  • Deduplication – Three methods: "llm" (default), "embedding", or "jaccard".
  • Human review – Edit, add, and delete records in the ecoreview Shiny app with full audit trail.
  • Accuracy metrics – Calculate detection recall, field precision, F1, and edit severity after review.

See the Complete Guide for details on all features.

Custom Schemas

EcoExtract is domain-agnostic and works with any JSON schema:

# Create custom config directory with templates
init_ecoextract()

# Edit the generated files:
# - ecoextract/schema.json          # Define your data structure
# - ecoextract/extraction_prompt.md # Describe what to extract

# The package automatically uses these files
process_documents("pdfs/", "records.db")

Schema requirements: top-level records property (array of objects), each field with type and description, JSON Schema draft-07 format.

See the Configuration Guide for complete details and examples.

Package Functions

Workflow

Database Setup

Data Access

Testing

devtools::test()   # Run all tests
devtools::check()  # Run package checks

Integration tests require API keys in a .env file. See CONTRIBUTING.md for details.

File Structure

ecoextract/
├── R/
│   ├── workflow.R          # Main process_documents() workflow + skip/cascade logic
│   ├── ocr.R               # OCR processing
│   ├── metadata.R          # Publication metadata extraction
│   ├── extraction.R        # Data extraction functions
│   ├── refinement.R        # Data refinement functions
│   ├── deduplication.R     # Record deduplication (LLM, embedding, Jaccard)
│   ├── database.R          # Database operations
│   ├── getters.R           # Data access functions (get_*, export_db)
│   ├── config_loader.R     # Configuration file loading + init_ecoextract()
│   ├── prompts.R           # Prompt loading
│   ├── utils.R             # Utilities
│   ├── config.R            # Package configuration
│   └── ecoextract-package.R # Package metadata
├── inst/
│   ├── extdata/            # Schema files
│   │   ├── schema.json
│   │   └── metadata_schema.json
│   └── prompts/            # System prompts
│       ├── extraction_prompt.md
│       ├── extraction_context.md
│       ├── metadata_prompt.md
│       ├── metadata_context.md
│       ├── refinement_prompt.md
│       ├── refinement_context.md
│       └── deduplication_prompt.md
├── tests/testthat/         # Tests
├── vignettes/              # Package vignettes
├── DESCRIPTION
├── NAMESPACE
├── CONTRIBUTING.md         # Development guide
└── README.md

Tech Stack

R Packages

  • ellmer - Structured LLM outputs
  • ohseer - OCR processing
  • dplyr - Data manipulation
  • DBI & RSQLite - Database operations
  • jsonlite - JSON handling
  • glue - String interpolation
  • stringr & stringi - String manipulation
  • digest - Hashing
  • tidyllm - LLM deduplication

External APIs

  • Tensorlake - OCR processing (via ohseer)
  • Anthropic Claude / OpenAI / other LLM providers - Data extraction and refinement (via ellmer)

License

GPL (>= 3)