Single Vector Element Standardization

Standardizing data vectors is an important step in data science. For example, different data sources often represent the same entity, such as country names, in different ways or with different spellings. In addition, small errors or formatting inconsistencies can complicate joins and lead to difficulties in downstream analysis. While excellent resources do exist to solve this problem for individual use cases, such as the countries package in R, it can be useful to understand general methods to self-standardize categorical vectors. There are three approaches that I’ve used to standardize character vectors. I’ve written scripts below to illustrate each method.

The vector to standardize

Note: this data is fictional!

disease_name	outbreaks
Influenza	15
Inflenza	10
COVID-19	95
sars-covid-19	90
Malaria	20
Maleria	22
malaria	20
Diabetes	30
Diabetis	28
HIV/AIDS	75
HIV	70
AIDS	65
Tuberculosis	40
Tuberclosis	38
Alzheimers	23
Alzheimers Disease	27
Heart Disease	60
Heart Diease	58

Standardization by similarity matching

The most straight forward way to standardize a character vector is to use string distance. String distance is a measure of how different two strings are. The method I tend to use is Jaro–Winkler distance. The distance returned is normalized so that a score of 0 represents an exact match between two strings and a 1 represents maximal difference. In R, the main package for calculating string distance is the stringdist package. For this use case we need to come up with an algorithm that will sequentially move down a vector, finding the first match within a pre-specified string distance and adopting it as the new standard. Graphically the algorithm should essentially do the following.

To walk down the vector we could either use for loops or some kind of recursive function. Fortunately there is a function in the purrr package that is perfect for the job, reduce! In order to use either approach we first need to define the function we want to apply between each individual element and the vector. I’m going to call the function fuzzy_match though that name may already be in use elsewhere.

# This is pretty cool!
fuzzy_match <- function(vector, element, max_dist = 0.1) {
  
  # Return the closest matching element in the vector 
  # Add in a few guards against failure to match
  key <- vector[vector != element]
  dist <- stringdist::stringdist(element, key, method = "jw")
  if(!is.null(element)) {
    if(!is.na(element)) {
      if(length(dist) > 0 & any(!is.na(dist))) {
        if(min(dist, na.rm = T) < max_dist) return(c(vector, key[which.min(dist)]))
      }
    }
  }

  # If no match within max_dist is found, return the original element
  return(c(vector, element))
}

Now we just need to invoke it using the reduce function as so:

# Slick right?
outbreak_data <- outbreak_data |> 
  mutate(Similarity = reduce(disease_name, fuzzy_match, max_dist = 0.1))

which produces the following:

disease_name	Similarity	outbreaks
Influenza	Influenza	15
Inflenza	Influenza	10
COVID-19	COVID-19	95
sars-covid-19	sars-covid-19	90
Malaria	Malaria	20
Maleria	Malaria	22
malaria	Malaria	20
Diabetes	Diabetes	30
Diabetis	Diabetes	28
HIV/AIDS	HIV/AIDS	75
HIV	HIV	70
AIDS	AIDS	65
Tuberculosis	Tuberculosis	40
Tuberclosis	Tuberculosis	38
Alzheimers	Alzheimers	23
Alzheimers Disease	Alzheimers Disease	27
Heart Disease	Heart Disease	60
Heart Diease	Heart Disease	58

Notice how not everything was standardized? That’s because we chose a relatively stringent maximum string distance. If we instead set max_dist = 0.3 we get the following:

disease_name	Similarity	Similarity_0.3	outbreaks
Influenza	Influenza	Influenza	15
Inflenza	Influenza	Influenza	10
COVID-19	COVID-19	COVID-19	95
sars-covid-19	sars-covid-19	sars-covid-19	90
Malaria	Malaria	Malaria	20
Maleria	Malaria	Malaria	22
malaria	Malaria	Malaria	20
Diabetes	Diabetes	Diabetes	30
Diabetis	Diabetes	Diabetes	28
HIV/AIDS	HIV/AIDS	HIV/AIDS	75
HIV	HIV	HIV/AIDS	70
AIDS	AIDS	AIDS	65
Tuberculosis	Tuberculosis	Tuberculosis	40
Tuberclosis	Tuberculosis	Tuberculosis	38
Alzheimers	Alzheimers	Alzheimers	23
Alzheimers Disease	Alzheimers Disease	Alzheimers	27
Heart Disease	Heart Disease	Heart Disease	60
Heart Diease	Heart Disease	Heart Disease	58

The weakness of this approach is that it is not always clear what the best maximum distance to use is. If you set the maximum distance high enough you might be able to match difficult cases, such as COVID-19 and sars-covid-19 but you run the risk of accidentally changing something too far, like matching Heart Disease to Malaria. It isn’t always clear where that border should be drawn. Another problem is that this algorithm just chooses the first match to standardize on, not the best match. It could just as easily settle on converting everything to melaria instead of Malaria. It doesn’t know which of the two strings is better just how different two strings are from each other.

Standardization by heirarchical clustering

Another way to use string distances to group like elements is through hierarchical clustering. Hierarchical clustering iteratively splits a vector into progressively finer clusters based on some measure of distance. This method operates on a matrix of distances. Looking at the matrix can give you a sense of why some of the disease names were or were not standardized by the approach above.

dist_matrix <- stringdistmatrix(outbreak_data$disease_name, method = "jw")
mat <- dist_matrix |> as.matrix()
rownames(mat) <- outbreak_data$disease_name
colnames(mat) <- outbreak_data$disease_name
knitr::kable(round(mat, 3))

	Influenza	Inflenza	COVID-19	sars-covid-19	Malaria	Maleria	malaria	Diabetes	Diabetis	HIV/AIDS	HIV	AIDS	Tuberculosis	Tuberclosis	Alzheimers	Alzheimers Disease	Heart Disease	Heart Diease
Influenza	0.000	0.037	0.588	1.000	0.497	0.413	0.497	0.588	0.588	0.588	0.519	0.546	0.639	0.631	0.526	0.528	0.541	0.537
Inflenza	0.037	0.000	0.583	1.000	0.488	0.399	0.488	0.583	0.583	0.583	0.514	0.542	0.694	0.689	0.553	0.509	0.532	0.528
COVID-19	0.588	0.583	0.000	0.465	1.000	1.000	1.000	1.000	1.000	0.528	0.528	0.417	1.000	1.000	1.000	0.606	0.599	0.597
sars-covid-19	1.000	1.000	0.465	0.000	0.447	0.447	0.447	0.532	0.576	1.000	1.000	1.000	0.453	0.487	0.608	0.701	0.545	0.506
Malaria	0.497	0.488	1.000	0.447	0.000	0.095	0.095	0.577	0.488	1.000	1.000	1.000	0.552	0.544	0.535	0.579	0.374	0.365
Maleria	0.413	0.399	1.000	0.447	0.095	0.000	0.190	0.488	0.399	1.000	1.000	1.000	0.490	0.480	0.426	0.485	0.367	0.356
malaria	0.497	0.488	1.000	0.447	0.095	0.190	0.000	0.577	0.488	1.000	1.000	1.000	0.552	0.544	0.535	0.569	0.374	0.365
Diabetes	0.588	0.583	1.000	0.532	0.577	0.488	0.577	0.000	0.083	1.000	1.000	0.542	0.458	0.451	0.450	0.509	0.463	0.386
Diabetis	0.588	0.583	1.000	0.576	0.488	0.399	0.488	0.083	0.000	1.000	1.000	0.542	0.472	0.462	0.553	0.593	0.397	0.386
HIV/AIDS	0.588	0.583	0.528	1.000	1.000	1.000	1.000	1.000	1.000	0.000	0.208	0.542	1.000	1.000	0.592	0.546	0.532	0.528
HIV	0.519	0.514	0.528	1.000	1.000	1.000	1.000	1.000	1.000	0.208	0.000	0.472	1.000	1.000	1.000	1.000	0.530	0.528
AIDS	0.546	0.542	0.417	1.000	1.000	1.000	1.000	0.542	0.542	0.542	0.472	0.000	1.000	1.000	0.550	0.565	0.558	0.556
Tuberculosis	0.639	0.694	1.000	0.453	0.552	0.490	0.552	0.458	0.472	1.000	1.000	1.000	0.000	0.028	0.547	0.528	0.466	0.528
Tuberclosis	0.631	0.689	1.000	0.487	0.544	0.480	0.544	0.451	0.462	1.000	1.000	1.000	0.028	0.000	0.537	0.513	0.454	0.518
Alzheimers	0.526	0.553	1.000	0.608	0.535	0.426	0.535	0.450	0.553	0.592	1.000	0.550	0.547	0.537	0.000	0.148	0.505	0.461
Alzheimers Disease	0.528	0.509	0.606	0.701	0.579	0.485	0.569	0.509	0.593	0.546	1.000	0.565	0.528	0.513	0.148	0.000	0.375	0.398
Heart Disease	0.541	0.532	0.599	0.545	0.374	0.367	0.374	0.463	0.397	0.532	0.530	0.558	0.466	0.454	0.505	0.375	0.000	0.067
Heart Diease	0.537	0.528	0.597	0.506	0.365	0.356	0.365	0.386	0.386	0.528	0.528	0.556	0.528	0.518	0.461	0.398	0.067	0.000

Now that we have a matrix containing the distances between every element we need to group them. For that we can use the hclust package in R.

hc <- hclust(dist_matrix, "ward.D2")
hc$labels <- outbreak_data$disease_name
ggdendro::ggdendrogram(hc)

There are several clustering algorithms. The one used here is average which is based on the average distance between points in each cluster. Another alternative for this task might be ward.D2 which minimizes the total within-cluster variance. Unfortunately, as with the similarity method a choice must be made for a cutoff threshold, this time based on tree distance instead of maximum distance.

clusters <- cutree(hc, h = 0.2) |> 
  enframe(name = "disease_name", value = "cluster") |> 
  group_by(cluster) |>
  mutate(HC = disease_name[1]) |> 
  ungroup()

outbreak_data <- outbreak_data |> left_join(clusters |> select(-cluster))

## Joining with `by = join_by(disease_name)`

knitr::kable(outbreak_data |> select(-outbreaks, everything(), outbreaks))

disease_name	Similarity	Similarity_0.3	HC	outbreaks
Influenza	Influenza	Influenza	Influenza	15
Inflenza	Influenza	Influenza	Influenza	10
COVID-19	COVID-19	COVID-19	COVID-19	95
sars-covid-19	sars-covid-19	sars-covid-19	sars-covid-19	90
Malaria	Malaria	Malaria	Malaria	20
Maleria	Malaria	Malaria	Malaria	22
malaria	Malaria	Malaria	Malaria	20
Diabetes	Diabetes	Diabetes	Diabetes	30
Diabetis	Diabetes	Diabetes	Diabetes	28
HIV/AIDS	HIV/AIDS	HIV/AIDS	HIV/AIDS	75
HIV	HIV	HIV/AIDS	HIV	70
AIDS	AIDS	AIDS	AIDS	65
Tuberculosis	Tuberculosis	Tuberculosis	Tuberculosis	40
Tuberclosis	Tuberculosis	Tuberculosis	Tuberculosis	38
Alzheimers	Alzheimers	Alzheimers	Alzheimers	23
Alzheimers Disease	Alzheimers Disease	Alzheimers	Alzheimers	27
Heart Disease	Heart Disease	Heart Disease	Heart Disease	60
Heart Diease	Heart Disease	Heart Disease	Heart Disease	58

Natural Language Processing (NLP)

Finally we can turn to Natural Language Processing to standardize things for us. This approach has both advantages and dis-advantages. One of the advantages is that large language models like chatGPT may have an idea of which of two strings is better, given the context. That’s a pretty big bonus. A disadvantage is that the results you get from these models are not as reproducible as a purely rules-based, or algorithmic approach. These advantages and disadvantages represent a trade-off. In order to benefit from the model’s knowledge about proper nomenclature, you give up a clear understanding of why each element was standardized in the way it was.

There is also an element of trust that the results returned were not altered or biased in any way by the model. This isn’t an issue with small datasets but could be a real concern when dealing with tables with millions of rows. Then it can be very difficult to tell if the model just happens to be biased against some of the results. Perhaps it is much more stringent about matching names starting with the letter ‘J’ than those that start with ‘S’. Who knows what evil lurks in the hearts of LLM’s? The Shadow knows.

To perform vector harmonization using NLP in R we can turn to the openai package, which is a wrapper around OpenAI’s API. The next step is to acquire an OpenAI API key which is available here. The API key needs to be loaded as a system variable using Sys.setenv(OPENAI_API_KEY=…) or in a project .env file. If you do choose to store it there it would be worth looking into git-crypt to secure your .env file. Make sure to do this before you add your key!

Once the API key is acquired, we have still have to set some more things up. This includes choosing which model we want to use, telling the model how it should act and formatting our query appropriately. In R this involves generating a nested list.

messages <-
    list(
      list("role" = "system",
           "content" = "you act as a function that standardizes a provided vector and returns a vector of equal length formated as a JSON object."),
      list("role" = "user",
           "content" = "The following vector contains disease names. Please alter the following disease names to remove minor errors and formatting inconsistencies and to standardize on the appropriate disease name."),
      list("role" = "user",
           "content" = paste(outbreak_data$disease_name, collapse = ","))
      )

Now we can submit our query, extract the data from the response and add a column of standardized disease names. Sometimes however, the model will return malformed JSON, or will reply as a paragraph of text instead of a structured object. In those cases the whole thing won’t work. It’s also not easy to tell the model how aggressive it should be in standardizing. But hey, if it doesn’t work we can always re-submit the query until it does!

response <- create_chat_completion(
    model = "gpt-3.5-turbo",
    messages = messages) |>
  bind_cols() 

# If the model returns a malformed response just keep trying until it gets it right.
for(try in 1:100) {
  tryCatch({
    cat(paste("Attempt:", try, "\n"))
    nlp_standardized <- response$message.content |>
      jsonlite::fromJSON() |>
      unlist()
    outbreak_data <- outbreak_data |> mutate(NLP = nlp_standardized)
    cat("Success!\n")
    break()
  }, error = function(e) {
    cat("Error encountered:", conditionMessage(e),"\nRetrying...\n")
    response <<- create_chat_completion(
      model = "gpt-3.5-turbo",
      messages = messages) |>
      bind_cols() 
  })
}

## Attempt: 1 
## Success!

knitr::kable(outbreak_data |> select(-outbreaks, everything(), outbreaks))

disease_name	Similarity	Similarity_0.3	HC	NLP	outbreaks
Influenza	Influenza	Influenza	Influenza	Influenza	15
Inflenza	Influenza	Influenza	Influenza	Influenza	10
COVID-19	COVID-19	COVID-19	COVID-19	COVID-19	95
sars-covid-19	sars-covid-19	sars-covid-19	sars-covid-19	COVID-19	90
Malaria	Malaria	Malaria	Malaria	Malaria	20
Maleria	Malaria	Malaria	Malaria	Malaria	22
malaria	Malaria	Malaria	Malaria	Malaria	20
Diabetes	Diabetes	Diabetes	Diabetes	Diabetes	30
Diabetis	Diabetes	Diabetes	Diabetes	Diabetes	28
HIV/AIDS	HIV/AIDS	HIV/AIDS	HIV/AIDS	HIV/AIDS	75
HIV	HIV	HIV/AIDS	HIV	HIV/AIDS	70
AIDS	AIDS	AIDS	AIDS	HIV/AIDS	65
Tuberculosis	Tuberculosis	Tuberculosis	Tuberculosis	Tuberculosis	40
Tuberclosis	Tuberculosis	Tuberculosis	Tuberculosis	Tuberculosis	38
Alzheimers	Alzheimers	Alzheimers	Alzheimers	Alzheimer’s Disease	23
Alzheimers Disease	Alzheimers Disease	Alzheimers	Alzheimers	Alzheimer’s Disease	27
Heart Disease	Heart Disease	Heart Disease	Heart Disease	Heart Disease	60
Heart Diease	Heart Disease	Heart Disease	Heart Disease	Heart Disease	58

Tags: data R tricks