Process PDFs through the complete pipeline: OCR → Metadata → Extract → Refine
Usage
process_documents(
pdf_path,
db_conn = "ecoextract_records.db",
schema_file = NULL,
extraction_prompt_file = NULL,
refinement_prompt_file = NULL,
model = "anthropic/claude-sonnet-4-5",
ocr_provider = "tensorlake",
ocr_timeout = 60,
force_reprocess_ocr = NULL,
force_reprocess_metadata = NULL,
force_reprocess_extraction = NULL,
run_extraction = TRUE,
run_refinement = NULL,
min_similarity = 0.9,
embedding_provider = "openai",
similarity_method = "llm",
recursive = FALSE,
workers = NULL,
log = FALSE,
...
)Arguments
- pdf_path
Path to a single PDF file or directory of PDFs
- db_conn
Database connection (any DBI backend) or path to SQLite database file. If a path is provided, creates SQLite database if it doesn't exist. If a connection is provided, tables must already exist (use
init_ecoextract_database()first).- schema_file
Optional custom schema file
- extraction_prompt_file
Optional custom extraction prompt
- refinement_prompt_file
Optional custom refinement prompt
- model
LLM model(s) to use for metadata extraction, record extraction, and refinement. Can be a single model name (character string) or a vector of models for tiered fallback. When a vector is provided, models are tried sequentially until one succeeds. Default: "anthropic/claude-sonnet-4-5". Examples: "openai/gpt-4o", c("anthropic/claude-sonnet-4-5", "mistral/mistral-large-latest")
- ocr_provider
OCR provider to use (default: "tensorlake"). Options: "tensorlake", "mistral", "claude"
- ocr_timeout
Maximum seconds to wait for OCR completion (default: 60)
- force_reprocess_ocr
Controls OCR reprocessing. NULL (default) uses normal skip logic, TRUE forces all documents, or an integer vector of document_ids to force specific documents.
- force_reprocess_metadata
Controls metadata reprocessing. NULL (default) uses normal skip logic, TRUE forces all documents, or an integer vector of document_ids to force specific documents.
- force_reprocess_extraction
Controls extraction reprocessing. NULL (default) uses normal skip logic, TRUE forces all documents, or an integer vector of document_ids to force specific documents.
- run_extraction
If TRUE, run extraction step to find new records. Default TRUE.
- run_refinement
Controls refinement step. NULL (default) skips refinement, TRUE runs on all documents with records, or an integer vector of document_ids to refine only specific documents.
- min_similarity
Minimum similarity for deduplication (default: 0.9)
- embedding_provider
Provider for embeddings when using embedding method (default: "openai")
- similarity_method
Method for deduplication similarity: "embedding", "jaccard", or "llm" (default: "llm")
- recursive
If TRUE and pdf_path is a directory, search for PDFs in all subdirectories. Default FALSE.
- workers
Number of parallel workers. NULL (default) or 1 for sequential processing. Values > 1 require the crew package and db_conn must be a file path (not a connection object).
- log
If TRUE and using parallel processing (workers > 1), write detailed output to an auto-generated log file (e.g., ecoextract_20240129_143052.log). Default FALSE. Ignored for sequential processing. Useful for troubleshooting errors.
- ...
Additional arguments (deprecated: use explicit parameters instead)
Examples
if (FALSE) { # \dontrun{
# Basic usage - process new PDFs
process_documents("pdfs/")
process_documents("paper.pdf", "my_interactions.db")
# Remote database (Supabase, PostgreSQL, etc.)
library(RPostgres)
con <- dbConnect(Postgres(),
dbname = "your_db",
host = "db.xxx.supabase.co",
user = "postgres",
password = Sys.getenv("SUPABASE_PASSWORD")
)
# Initialize schema first
init_ecoextract_database(con)
# Then process documents
process_documents("pdfs/", db_conn = con)
dbDisconnect(con)
# Force re-run OCR for all documents (cascades to metadata and extraction)
process_documents("pdfs/", force_reprocess_ocr = TRUE)
# Force re-run OCR for specific documents only
process_documents("pdfs/", force_reprocess_ocr = c(5L, 12L))
# Force re-run metadata only (cascades to extraction)
process_documents("pdfs/", force_reprocess_metadata = TRUE)
# With custom schema and prompts
process_documents("pdfs/", "interactions.db",
schema_file = "ecoextract/schema.json",
extraction_prompt_file = "ecoextract/extraction_prompt.md")
# With refinement for all documents
process_documents("pdfs/", run_refinement = TRUE)
# Refinement for specific documents only
process_documents("pdfs/", run_refinement = c(5L, 12L))
# Skip extraction, refinement only on existing records
process_documents("pdfs/", run_extraction = FALSE, run_refinement = TRUE)
# Search for PDFs in all subdirectories
process_documents("research_papers/", recursive = TRUE)
# Process in parallel with 4 workers (requires crew package)
process_documents("pdfs/", workers = 4)
# Parallel with logging for troubleshooting
process_documents("pdfs/", workers = 4, log = TRUE)
# Use different OCR provider
process_documents("pdfs/", ocr_provider = "mistral")
# Increase OCR timeout to 5 minutes for large documents
process_documents("pdfs/", ocr_timeout = 300)
} # }