Skip to contents

This convenience function parses AWS Textract output to extract citation metadata and other structured information from document headers. Useful for extracting titles, authors, DOIs, journal names, dates, etc. from academic papers.

Usage

textract_extract_metadata(textract_response)

Arguments

textract_response

List. Response from textract_ocr().

Value

List with the following structure:

text

Character string. Full document text with line breaks.

key_value_pairs

Data frame with columns: key, value, confidence. Contains extracted metadata like "Title:", "Author:", etc.

tables

List of data frames, one per table.

pages

Integer. Number of pages processed.

Author

Nathan C. Layman

Examples

if (FALSE) { # \dontrun{
# Process document with Textract
result <- textract_ocr("paper.pdf")

# Extract citation metadata
metadata <- textract_extract_metadata(result)

# Access extracted key-value pairs (e.g., Title, Authors, DOI)
metadata$key_value_pairs
} # }