How to Calculate Levenshtein Distance in R (With Examples)


The Levenshtein distance between two strings is the minimum number of single-character edits required to turn one word into the other.

The word “edits” includes substitutions, insertions, and deletions.

For example, suppose we have the following two words:

  • PARTY
  • PARK

The Levenshtein distance between the two words (i.e. the number of edits we have to make to turn one word into the other) would be 2:

Levenshtein distance example

In practice, the Levenshtein distance is used in many different applications including approximate string matching, spell-checking, and natural language processing.

This tutorial explains how to calculate the Levenshtein distance between strings in R by using the function from the stringdist package in R.

This function uses the following basic syntax:

#load stringdist package
library(stringdist)

#calculate Levenshtein distance between two strings
stringdist("string1", "string2", method = "lv")

Note that this function can calculate many different distance metrics. By specifying method = “lv”, we tell the function to calculate the Levenshtein distance.

Example 1: Levenshtein Distance Between Two Strings

The following code shows how to calculate the Levenshtein distance between the two strings “party” and “park” using the stringdist() function:

#load stringdist package
library(stringdist)

#calculate Levenshtein distance between two strings
stringdist('party', 'park', method = 'lv')

[1] 2

The Levenshtein distance turns out to be 2.

Example 2: Levenshtein Distance Between Two Vectors

The following code shows how to calculate the Levenshtein distance between every pairwise combination of strings in two different vectors:

#load stringdist package
library(stringdist)

#define vectors
a <- c('Mavs', 'Spurs', 'Lakers', 'Cavs')
b <- c('Rockets', 'Pacers', 'Warriors', 'Celtics')

#calculate Levenshtein distance between two vectors
stringdist(a, b, method='lv')

[1] 6 4 5 5

The way to interpret the output is as follows:

  • The Levenshtein distance between ‘Mavs’ and ‘Rockets’ is 6.
  • The Levenshtein distance between ‘Spurs’ and ‘Pacers’ is 4.
  • The Levenshtein distance between ‘Lakers’ and ‘Warriors’ is 5.
  • The Levenshtein distance between ‘Cavs’ and ‘Celtics’ is 5.

Example 3: Levenshtein Distance Between Data Frame Columns

The following code shows how to calculate the Levenshtein distance between every pairwise combination of strings in two different columns of a data frame:

#load stringdist package
library(stringdist)

#define data
data <- data.frame(a = c('Mavs', 'Spurs', 'Lakers', 'Cavs'),
                   b = c('Rockets', 'Pacers', 'Warriors', 'Celtics'))

#calculate Levenshtein distance
stringdist(data$a, data$b, method='lv')

[1] 6 4 5 5

We could then append the Levenshtein distance as a new column in the data frame if we’d like:

#save Levenshtein distance as vector
lev <- stringdist(data$a, data$b, method='lv')

#append Levenshtein distance as new column 
data$lev <- lev

#view data frame
data

       a        b lev
1   Mavs  Rockets   6
2  Spurs   Pacers   4
3 Lakers Warriors   5
4   Cavs  Celtics   5

x