Extract Metadata from AWS Textract Response
textract_extract_metadata.RdThis convenience function parses AWS Textract output to extract citation metadata and other structured information from document headers. Useful for extracting titles, authors, DOIs, journal names, dates, etc. from academic papers.
Value
List with the following structure:
- text
Character string. Full document text with line breaks.
- key_value_pairs
Data frame with columns: key, value, confidence. Contains extracted metadata like "Title:", "Author:", etc.
- tables
List of data frames, one per table.
- pages
Integer. Number of pages processed.
Examples
if (FALSE) { # \dontrun{
# Process document with Textract
result <- textract_ocr("paper.pdf")
# Extract citation metadata
metadata <- textract_extract_metadata(result)
# Access extracted key-value pairs (e.g., Title, Authors, DOI)
metadata$key_value_pairs
} # }