Single Vector Element Standardization

Finding harmony within oneself

Standardizing data vectors is an important step in data science. For example, different data sources often represent the same entity, such as country names, in different ways or with different spellings. In addition, small errors or formatting inconsistencies can complicate joins and lead to difficulties in downstream analysis. While excellent resources do exist to solve this problem for individual use cases, such as the countries package in R, it can be useful to understand general methods to self-standardize categorical vectors. There are three approaches that I’ve used to standardize character vectors. I’ve written scripts below to illustrate each method.

The vector to standardize

Note: this data is fictional!

disease_name outbreaks
Influenza 15
Inflenza 10
COVID-19 95
sars-covid-19 90
Malaria 20
Maleria 22
malaria 20
Diabetes 30
Diabetis 28
HIV/AIDS 75
HIV 70
AIDS 65
Tuberculosis 40
Tuberclosis 38
Alzheimers 23
Alzheimers Disease 27
Heart Disease 60
Heart Diease 58

Standardization by similarity matching

The most straight forward way to standardize a character vector is to use string distance. String distance is a measure of how different two strings are. The method I tend to use is Jaro–Winkler distance. The distance returned is normalized so that a score of 0 represents an exact match between two strings and a 1 represents maximal difference. In R, the main package for calculating string distance is the stringdist package. For this use case we need to come up with an algorithm that will sequentially move down a vector, finding the first match within a pre-specified string distance and adopting it as the new standard. Graphically the algorithm should essentially do the following.

To walk down the vector we could either use for loops or some kind of recursive function. Fortunately there is a function in the purrr package that is perfect for the job, reduce! In order to use either approach we first need to define the function we want to apply between each individual element and the vector. I’m going to call the function fuzzy_match though that name may already be in use elsewhere.

# This is pretty cool!
fuzzy_match <- function(vector, element, max_dist = 0.1) {
  
  # Return the closest matching element in the vector 
  # Add in a few guards against failure to match
  key <- vector[vector != element]
  dist <- stringdist::stringdist(element, key, method = "jw")
  if(!is.null(element)) {
    if(!is.na(element)) {
      if(length(dist) > 0 & any(!is.na(dist))) {
        if(min(dist, na.rm = T) < max_dist) return(c(vector, key[which.min(dist)]))
      }
    }
  }

  # If no match within max_dist is found, return the original element
  return(c(vector, element))
}

Now we just need to invoke it using the reduce function as so:

# Slick right?
outbreak_data <- outbreak_data |> 
  mutate(Similarity = reduce(disease_name, fuzzy_match, max_dist = 0.1))

which produces the following:

disease_name Similarity outbreaks
Influenza Influenza 15
Inflenza Influenza 10
COVID-19 COVID-19 95
sars-covid-19 sars-covid-19 90
Malaria Malaria 20
Maleria Malaria 22
malaria Malaria 20
Diabetes Diabetes 30
Diabetis Diabetes 28
HIV/AIDS HIV/AIDS 75
HIV HIV 70
AIDS AIDS 65
Tuberculosis Tuberculosis 40
Tuberclosis Tuberculosis 38
Alzheimers Alzheimers 23
Alzheimers Disease Alzheimers Disease 27
Heart Disease Heart Disease 60
Heart Diease Heart Disease 58

Notice how not everything was standardized? That’s because we chose a relatively stringent maximum string distance. If we instead set max_dist = 0.3 we get the following:

disease_name Similarity Similarity_0.3 outbreaks
Influenza Influenza Influenza 15
Inflenza Influenza Influenza 10
COVID-19 COVID-19 COVID-19 95
sars-covid-19 sars-covid-19 sars-covid-19 90
Malaria Malaria Malaria 20
Maleria Malaria Malaria 22
malaria Malaria Malaria 20
Diabetes Diabetes Diabetes 30
Diabetis Diabetes Diabetes 28
HIV/AIDS HIV/AIDS HIV/AIDS 75
HIV HIV HIV/AIDS 70
AIDS AIDS AIDS 65
Tuberculosis Tuberculosis Tuberculosis 40
Tuberclosis Tuberculosis Tuberculosis 38
Alzheimers Alzheimers Alzheimers 23
Alzheimers Disease Alzheimers Disease Alzheimers 27
Heart Disease Heart Disease Heart Disease 60
Heart Diease Heart Disease Heart Disease 58

The weakness of this approach is that it is not always clear what the best maximum distance to use is. If you set the maximum distance high enough you might be able to match difficult cases, such as COVID-19 and sars-covid-19 but you run the risk of accidentally changing something too far, like matching Heart Disease to Malaria. It isn’t always clear where that border should be drawn. Another problem is that this algorithm just chooses the first match to standardize on, not the best match. It could just as easily settle on converting everything to melaria instead of Malaria. It doesn’t know which of the two strings is better just how different two strings are from each other.

Standardization by heirarchical clustering

Another way to use string distances to group like elements is through hierarchical clustering. Hierarchical clustering iteratively splits a vector into progressively finer clusters based on some measure of distance. This method operates on a matrix of distances. Looking at the matrix can give you a sense of why some of the disease names were or were not standardized by the approach above.

dist_matrix <- stringdistmatrix(outbreak_data$disease_name, method = "jw")
mat <- dist_matrix |> as.matrix()
rownames(mat) <- outbreak_data$disease_name
colnames(mat) <- outbreak_data$disease_name
knitr::kable(round(mat, 3))
  Influenza Inflenza COVID-19 sars-covid-19 Malaria Maleria malaria Diabetes Diabetis HIV/AIDS HIV AIDS Tuberculosis Tuberclosis Alzheimers Alzheimers Disease Heart Disease Heart Diease
Influenza 0.000 0.037 0.588 1.000 0.497 0.413 0.497 0.588 0.588 0.588 0.519 0.546 0.639 0.631 0.526 0.528 0.541 0.537
Inflenza 0.037 0.000 0.583 1.000 0.488 0.399 0.488 0.583 0.583 0.583 0.514 0.542 0.694 0.689 0.553 0.509 0.532 0.528
COVID-19 0.588 0.583 0.000 0.465 1.000 1.000 1.000 1.000 1.000 0.528 0.528 0.417 1.000 1.000 1.000 0.606 0.599 0.597
sars-covid-19 1.000 1.000 0.465 0.000 0.447 0.447 0.447 0.532 0.576 1.000 1.000 1.000 0.453 0.487 0.608 0.701 0.545 0.506
Malaria 0.497 0.488 1.000 0.447 0.000 0.095 0.095 0.577 0.488 1.000 1.000 1.000 0.552 0.544 0.535 0.579 0.374 0.365
Maleria 0.413 0.399 1.000 0.447 0.095 0.000 0.190 0.488 0.399 1.000 1.000 1.000 0.490 0.480 0.426 0.485 0.367 0.356
malaria 0.497 0.488 1.000 0.447 0.095 0.190 0.000 0.577 0.488 1.000 1.000 1.000 0.552 0.544 0.535 0.569 0.374 0.365
Diabetes 0.588 0.583 1.000 0.532 0.577 0.488 0.577 0.000 0.083 1.000 1.000 0.542 0.458 0.451 0.450 0.509 0.463 0.386
Diabetis 0.588 0.583 1.000 0.576 0.488 0.399 0.488 0.083 0.000 1.000 1.000 0.542 0.472 0.462 0.553 0.593 0.397 0.386
HIV/AIDS 0.588 0.583 0.528 1.000 1.000 1.000 1.000 1.000 1.000 0.000 0.208 0.542 1.000 1.000 0.592 0.546 0.532 0.528
HIV 0.519 0.514 0.528 1.000 1.000 1.000 1.000 1.000 1.000 0.208 0.000 0.472 1.000 1.000 1.000 1.000 0.530 0.528
AIDS 0.546 0.542 0.417 1.000 1.000 1.000 1.000 0.542 0.542 0.542 0.472 0.000 1.000 1.000 0.550 0.565 0.558 0.556
Tuberculosis 0.639 0.694 1.000 0.453 0.552 0.490 0.552 0.458 0.472 1.000 1.000 1.000 0.000 0.028 0.547 0.528 0.466 0.528
Tuberclosis 0.631 0.689 1.000 0.487 0.544 0.480 0.544 0.451 0.462 1.000 1.000 1.000 0.028 0.000 0.537 0.513 0.454 0.518
Alzheimers 0.526 0.553 1.000 0.608 0.535 0.426 0.535 0.450 0.553 0.592 1.000 0.550 0.547 0.537 0.000 0.148 0.505 0.461
Alzheimers Disease 0.528 0.509 0.606 0.701 0.579 0.485 0.569 0.509 0.593 0.546 1.000 0.565 0.528 0.513 0.148 0.000 0.375 0.398
Heart Disease 0.541 0.532 0.599 0.545 0.374 0.367 0.374 0.463 0.397 0.532 0.530 0.558 0.466 0.454 0.505 0.375 0.000 0.067
Heart Diease 0.537 0.528 0.597 0.506 0.365 0.356 0.365 0.386 0.386 0.528 0.528 0.556 0.528 0.518 0.461 0.398 0.067 0.000

Now that we have a matrix containing the distances between every element we need to group them. For that we can use the hclust package in R.

hc <- hclust(dist_matrix, "ward.D2")
hc$labels <- outbreak_data$disease_name
ggdendro::ggdendrogram(hc)

There are several clustering algorithms. The one used here is average which is based on the average distance between points in each cluster. Another alternative for this task might be ward.D2 which minimizes the total within-cluster variance. Unfortunately, as with the similarity method a choice must be made for a cutoff threshold, this time based on tree distance instead of maximum distance.

clusters <- cutree(hc, h = 0.2) |> 
  enframe(name = "disease_name", value = "cluster") |> 
  group_by(cluster) |>
  mutate(HC = disease_name[1]) |> 
  ungroup()

outbreak_data <- outbreak_data |> left_join(clusters |> select(-cluster))
## Joining with `by = join_by(disease_name)`
knitr::kable(outbreak_data |> select(-outbreaks, everything(), outbreaks))
disease_name Similarity Similarity_0.3 HC outbreaks
Influenza Influenza Influenza Influenza 15
Inflenza Influenza Influenza Influenza 10
COVID-19 COVID-19 COVID-19 COVID-19 95
sars-covid-19 sars-covid-19 sars-covid-19 sars-covid-19 90
Malaria Malaria Malaria Malaria 20
Maleria Malaria Malaria Malaria 22
malaria Malaria Malaria Malaria 20
Diabetes Diabetes Diabetes Diabetes 30
Diabetis Diabetes Diabetes Diabetes 28
HIV/AIDS HIV/AIDS HIV/AIDS HIV/AIDS 75
HIV HIV HIV/AIDS HIV 70
AIDS AIDS AIDS AIDS 65
Tuberculosis Tuberculosis Tuberculosis Tuberculosis 40
Tuberclosis Tuberculosis Tuberculosis Tuberculosis 38
Alzheimers Alzheimers Alzheimers Alzheimers 23
Alzheimers Disease Alzheimers Disease Alzheimers Alzheimers 27
Heart Disease Heart Disease Heart Disease Heart Disease 60
Heart Diease Heart Disease Heart Disease Heart Disease 58

Natural Language Processing (NLP)

Finally we can turn to Natural Language Processing to standardize things for us. This approach has both advantages and dis-advantages. One of the advantages is that large language models like chatGPT may have an idea of which of two strings is better, given the context. That’s a pretty big bonus. A disadvantage is that the results you get from these models are not as reproducible as a purely rules-based, or algorithmic approach. These advantages and disadvantages represent a trade-off. In order to benefit from the model’s knowledge about proper nomenclature, you give up a clear understanding of why each element was standardized in the way it was.

There is also an element of trust that the results returned were not altered or biased in any way by the model. This isn’t an issue with small datasets but could be a real concern when dealing with tables with millions of rows. Then it can be very difficult to tell if the model just happens to be biased against some of the results. Perhaps it is much more stringent about matching names starting with the letter ‘J’ than those that start with ‘S’. Who knows what evil lurks in the hearts of LLM’s? The Shadow knows.

To perform vector harmonization using NLP in R we can turn to the openai package, which is a wrapper around OpenAI’s API. The next step is to acquire an OpenAI API key which is available here. The API key needs to be loaded as a system variable using Sys.setenv(OPENAI_API_KEY=…) or in a project .env file. If you do choose to store it there it would be worth looking into git-crypt to secure your .env file. Make sure to do this before you add your key!

Once the API key is acquired, we have still have to set some more things up. This includes choosing which model we want to use, telling the model how it should act and formatting our query appropriately. In R this involves generating a nested list.

messages <-
    list(
      list("role" = "system",
           "content" = "you act as a function that standardizes a provided vector and returns a vector of equal length formated as a JSON object."),
      list("role" = "user",
           "content" = "The following vector contains disease names. Please alter the following disease names to remove minor errors and formatting inconsistencies and to standardize on the appropriate disease name."),
      list("role" = "user",
           "content" = paste(outbreak_data$disease_name, collapse = ","))
      )

Now we can submit our query, extract the data from the response and add a column of standardized disease names. Sometimes however, the model will return malformed JSON, or will reply as a paragraph of text instead of a structured object. In those cases the whole thing won’t work. It’s also not easy to tell the model how aggressive it should be in standardizing. But hey, if it doesn’t work we can always re-submit the query until it does!

response <- create_chat_completion(
    model = "gpt-3.5-turbo",
    messages = messages) |>
  bind_cols() 

# If the model returns a malformed response just keep trying until it gets it right.
for(try in 1:100) {
  tryCatch({
    cat(paste("Attempt:", try, "\n"))
    nlp_standardized <- response$message.content |>
      jsonlite::fromJSON() |>
      unlist()
    outbreak_data <- outbreak_data |> mutate(NLP = nlp_standardized)
    cat("Success!\n")
    break()
  }, error = function(e) {
    cat("Error encountered:", conditionMessage(e),"\nRetrying...\n")
    response <<- create_chat_completion(
      model = "gpt-3.5-turbo",
      messages = messages) |>
      bind_cols() 
  })
}
## Attempt: 1 
## Success!
knitr::kable(outbreak_data |> select(-outbreaks, everything(), outbreaks))
disease_name Similarity Similarity_0.3 HC NLP outbreaks
Influenza Influenza Influenza Influenza Influenza 15
Inflenza Influenza Influenza Influenza Influenza 10
COVID-19 COVID-19 COVID-19 COVID-19 COVID-19 95
sars-covid-19 sars-covid-19 sars-covid-19 sars-covid-19 COVID-19 90
Malaria Malaria Malaria Malaria Malaria 20
Maleria Malaria Malaria Malaria Malaria 22
malaria Malaria Malaria Malaria Malaria 20
Diabetes Diabetes Diabetes Diabetes Diabetes 30
Diabetis Diabetes Diabetes Diabetes Diabetes 28
HIV/AIDS HIV/AIDS HIV/AIDS HIV/AIDS HIV/AIDS 75
HIV HIV HIV/AIDS HIV HIV/AIDS 70
AIDS AIDS AIDS AIDS HIV/AIDS 65
Tuberculosis Tuberculosis Tuberculosis Tuberculosis Tuberculosis 40
Tuberclosis Tuberculosis Tuberculosis Tuberculosis Tuberculosis 38
Alzheimers Alzheimers Alzheimers Alzheimers Alzheimer’s Disease 23
Alzheimers Disease Alzheimers Disease Alzheimers Alzheimers Alzheimer’s Disease 27
Heart Disease Heart Disease Heart Disease Heart Disease Heart Disease 60
Heart Diease Heart Disease Heart Disease Heart Disease Heart Disease 58
Tags: data R tricks
Share: LinkedIn