Understanding Accuracy Metrics in EcoExtract
Source:ACCURACY.md
Overview
EcoExtract calculates accuracy by analyzing human edits to model-extracted records. When you review documents using ecoreview, all edits are tracked in an audit table (record_edits). The calculate_accuracy() function analyzes these edits to provide comprehensive accuracy metrics.
Philosophy: Two Separate Questions
Traditional accuracy metrics treat extraction as all-or-nothing: a record is either “correct” or “incorrect.” But in practice, extraction quality has two independent dimensions:
- Record Detection: Did the model find the record, or did it miss it? Did it hallucinate records that don’t exist?
- Field Accuracy: Of the records the model found, how accurate were the individual fields?
EcoExtract separates these concerns to give you a more nuanced understanding of model performance.
Core Concepts
Ground Truth
Ground truth is established through human review: - Model extractions (records with added_by_user = 0 or NULL) - Human additions (records with added_by_user = 1) indicate records the model missed - Soft deletes (records with deleted_by_user set) indicate hallucinations - Field edits (tracked in record_edits table) indicate which specific fields were wrong
Only documents marked as reviewed (reviewed_at IS NOT NULL) are included in accuracy calculations.
Metrics Explained
Raw Counts
These are the fundamental measurements from which all other metrics derive:
- verified_documents: Number of documents that have been reviewed
- verified_records: Total records across all verified documents
- model_extracted: Records the model extracted (may include hallucinations)
- human_added: Records humans added because model missed them (false negatives)
- deleted: Records humans soft-deleted because model hallucinated them (false positives)
- records_with_edits: Records that had at least one field corrected
- column_edits: Count of edits per column (named vector)
Record Detection Metrics
These answer: “Did the model find the records?”
Records Classification
-
records_found:
model_extracted - deleted- True positives: Records the model found (even if imperfect)
-
records_missed:
human_added- False negatives: Records that exist but model didn’t find
-
records_hallucinated:
deleted- False positives: Records model made up that don’t exist
Detection Performance
-
detection_precision:
records_found / model_extracted- Of everything the model extracted, what fraction was real (not hallucinated)?
- High precision means few hallucinations
-
detection_recall:
records_found / (records_found + records_missed)- Of all true records, what fraction did the model find?
- High recall means few missed records
-
perfect_record_rate:
(records_found - records_with_edits) / records_found- Of records the model found, what fraction had zero errors?
- Measures extraction quality for found records
Field-Level Metrics
These answer: “How accurate were the individual fields?”
Field-level metrics provide partial credit. If a record has 10 fields and the model got 9 correct, it’s credited with 9 correct fields rather than being treated as completely wrong.
Field Calculations
-
total_fields:
model_extracted × num_fields- Total fields the model attempted to extract
-
correct_fields:
total_fields - deleted_fields - edited_fields- Fields that were extracted correctly
- deleted_fields =
deleted × num_fields(all fields in deleted records are wrong) - edited_fields =
sum(column_edits)(fields that humans corrected)
Field Performance
-
field_precision:
correct_fields / total_fields- Of all fields the model extracted, what fraction was correct?
-
field_recall:
correct_fields / true_fields- true_fields =
(records_found × num_fields) + (human_added × num_fields) - Of all true fields that should exist, what fraction did we extract correctly?
- true_fields =
-
field_f1: Harmonic mean of field precision and recall
2 × precision × recall / (precision + recall)- Balanced measure combining both metrics
Per-Column Accuracy
-
column_accuracy: Named vector showing accuracy for each column
- Formula:
1 - (column_edits / model_extracted) - Example: If 100 records extracted and “species” field edited 15 times, species accuracy = 0.85
- Shows which fields are hardest for the model to extract
- Helps identify where to improve prompts or schemas
- Formula:
Edit Severity Metrics
Not all edits are equal. EcoExtract classifies edits based on the schema:
Major vs Minor Edits
Major edits affect fields that identify or are required for the record: - Unique fields (x-unique-fields in schema): Fields that identify the record - Example: species names, dates, locations that distinguish one record from another - Required fields (required in schema): Fields mandatory for record validity - Example: supporting sentences, organisms_identifiable
Minor edits affect optional descriptive fields: - Fields that provide additional detail but aren’t core to record identity - Example: publication_year, page_number
Severity Metrics
- major_edits: Count of edits to unique or required fields
- minor_edits: Count of edits to other fields
-
major_edit_rate:
major_edits / (major_edits + minor_edits)- Closer to 1.0 means most errors are in critical fields
- Closer to 0.0 means most errors are in minor details
-
avg_edits_per_document:
total_edits / verified_documents- On average, how many field corrections per document?
- Lower is better
Example Scenario
Say you review a document with 10 true records:
Model Performance: - Extracts 12 records - 7 are perfect (no edits needed) - 1 has 2 minor field edits - 1 has 1 major field edit (wrong species name) - 1 is missing (you add it) - 2 are hallucinations (you delete them)
Each record has 8 fields total: - 4 unique fields (species names, dates) - 2 required fields (supporting sentences) - 2 optional fields (page number, publication year)
Calculated Metrics
Raw Counts: - model_extracted = 12 - human_added = 1 - deleted = 2 - records_with_edits = 2 - total_edits = 3 (2 minor + 1 major)
Record Detection: - records_found = 12 - 2 = 10 - records_missed = 1 - records_hallucinated = 2 - detection_precision = 10/12 = 0.833 (83.3%) - detection_recall = 10/11 = 0.909 (90.9%) - perfect_record_rate = 7/10 = 0.70 (70%)
Field-Level: - total_fields = 12 × 8 = 96 fields attempted - deleted_fields = 2 × 8 = 16 fields (in hallucinated records) - edited_fields = 3 fields corrected - correct_fields = 96 - 16 - 3 = 77 fields - field_precision = 77/96 = 0.802 (80.2%) - true_fields = (10 × 8) + (1 × 8) = 88 fields should exist - field_recall = 77/88 = 0.875 (87.5%)
Edit Severity: - major_edits = 1 (the species name) - minor_edits = 2 (the two optional field edits) - major_edit_rate = 1/3 = 0.333 (33.3% of errors are major) - avg_edits_per_document = 3/1 = 3.0
Interpretation
This model: - Good detection: Found 91% of records, only hallucinated 17% - Decent extraction: 80% of fields correct, giving partial credit for mostly-correct records - Room for improvement: 30% of found records need corrections - Encouraging error profile: Only 33% of errors are in critical identifying fields
Usage
library(ecoextract)
# Calculate accuracy for all verified documents
accuracy <- calculate_accuracy("my_database.db")
# View key metrics
accuracy$detection_recall # Did we find the records?
accuracy$field_precision # How accurate were the fields?
accuracy$major_edit_rate # How serious were the errors?
# Calculate accuracy for specific documents
accuracy <- calculate_accuracy("my_database.db", document_ids = c(1, 2, 3))
# Use custom schema location
accuracy <- calculate_accuracy("my_database.db", schema_file = "custom_schema.json")Visualizing Accuracy
Accuracy visualizations (confusion matrices, field accuracy heatmaps) are available in the ecoreview Shiny app, which provides an interactive interface for reviewing extractions and viewing accuracy metrics.
Design Decisions
Why Field-Level Instead of Record-Level?
Problem with record-level: A record with 1 wrong field out of 10 is treated the same as a completely wrong record (0% vs 90% correct).
Field-level solution: Gives partial credit. The 90%-correct record contributes 9 correct fields and 1 incorrect field to the metrics.
This is more informative and better reflects the actual quality of extraction.
Why Separate Detection from Accuracy?
Finding a record is different from extracting it correctly: - A model might find all records but extract many fields incorrectly (high recall, low field precision) - A model might only extract records it’s very confident about (low recall, high field precision)
Separating these dimensions helps diagnose where to improve.
Technical Notes
Assumptions
- All records have the same number of fields (defined by schema)
- Nullable fields still count as fields for accuracy calculation
- Edit severity requires schema with
x-unique-fieldsandrequiredspecifications
Related Documentation
- ecoreview README - Review and edit records
- SCHEMA_GUIDE.md - Define custom schemas
- Vignette - Complete workflow tutorial
Questions?
Open an issue on GitHub.