This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

editDistance

Find edit distance between two strings or documents

Syntax

d = editDistance(str1,str2)
d = editDistance(document1,document2)
d = editDistance(___,Name,Value)

Description

example

d = editDistance(str1,str2) returns the lowest number of grapheme (Unicode term for human-perceived characters) insertions, deletions, and substitutions required to convert str1 to str2.

example

d = editDistance(document1,document2) returns the lowest number of token insertions, deletions, and substitutions required to convert document1 to document2.

example

d = editDistance(___,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Find the edit distance between the strings "Text analytics" and "Text analysis". The edit distance, by default, is the total number of grapheme insertions, deletions, and substitutions required to change one string to another.

str1 = "Text analytics";
str2 = "Text analysis";

Find the edit distance.

d = editDistance(str1,str2)
d = 2

This means changing the first string to the second requires two edits. For example:

  1. Substitution – Substitute the character "t" to an "s": "Text analytics" to "Text analysics".

  2. Deletion – Delete the character "c": "Text analysics" to "Text analysis".

Find the edit distance between two tokenized documents. For tokenized document input, the edit distance, by default, is the total number of token insertions, deletions, and substitutions required to change one document to another.

str1 = "It's time for breakfast.";
document1 = tokenizedDocument(str1);

str2 = "It's now time to sleep.";
document2 = tokenizedDocument(str2);

Find the edit distance.

d = editDistance(document1,document2)
d = 3

This means changing the first document to the second requires three edits. For example:

  1. Insertion – Insert the word "now".

  2. Substitution – Substitute the word "for" with "to".

  3. Substitution – Substitute the word "breakfast" with "sleep".

The editDistance function, by default, returns the lowest number of grapheme insertions, deletions, and substitutions required to change one string to another. To also include the swap action in the calculation, use the 'SwapCost' option.

First, find the edit distance between the strings "MATALB" and "MATLAB".

str1 = "MATALB";
str2 = "MATLAB";
d = editDistance(str1,str2)
d = 2

One possible edit is:

  1. Substitute the second "A" with "L": ("MATALB" to "MATLLB").

  2. Substitute the second "L" with "A": ("MATLLB" to "MATLAB").

The default value for the swap cost (the cost of swapping two adjacent graphemes) is Inf. This means that swaps do not count towards the edit distance. To include swaps, set the 'SwapCost' option to 1.

d = editDistance(str1,str2,'SwapCost',1)
d = 1

This means there is one action. For example, swap the adjacent characters "A" and "L".

To compute the edit distance between two words and specify that the edits are case-insensitive, specify a custom substitute cost function.

First, compute the edit distance between the strings "MATLAB" and "MathWorks".

d = editDistance("MATLAB","MathWorks")
d = 8

This means changing the first string to the second requires eight edits. For example:

  1. Substitution – Substitute the character "A" with "a". ("MATLAB" to "MaTLAB")

  2. Substitution – Substitute the character "T" with "t". ("MaTLAB" to "MatLAB")

  3. Substitution – Substitute the character "L" with "h". ("MatLAB" to "MathAB")

  4. Substitution – Substitute the character "A" with "W". ("MathAB" to "MathWB")

  5. Substitution – Substitute the character "B" with "o". ("MathWB" to "MathWo")

  6. Insertion – Insert the character "r". ("MathWo" to "MathWor")

  7. Insertion – Insert the character "k". ("MathWor" to "MathWork")

  8. Insertion – Insert the character "s". ("MathWork" to "MathWorks")

Compute the edit distance and specify the custom substitution cost function caseInsensitiveSubstituteCost, listed at the end of the example. The custom function caseInsensitiveSubstituteCost returns 0 if the two inputs are the same or differ only by case and returns 1 otherwise.

d = editDistance("MATLAB","MathWorks",'SubstituteCost',@caseInsensitiveSubstituteCost)
d = 6

This means the total cost for changing the first string to the second is 6. For example:

  1. Substitution (cost 0) – Substitute the character "A" with "a". ("MATLAB" to "MaTLAB")

  2. Substitution (cost 0) – Substitute the character "T" with "t". ("MaTLAB" to "MatLAB")

  3. Substitution (cost 1) – Substitute the character "L" with "h". ("MatLAB" to "MathAB")

  4. Substitution (cost 1) – Substitute the character "A" with "W". ("MathAB" to "MathWB")

  5. Substitution (cost 1) – Substitute the character "B" with "o". ("MathWB" to "MathWo")

  6. Insert (cost 1) – Insert the character "r". ("MathWo" to "MathWor")

  7. Insert (cost 1) – Insert the character "k". ("MathWor" to "MathWork")

  8. Insert (cost 1) – Insert the character "s". ("MathWork" to "MathWorks")

Custom Cost Function

The custom function caseInsensitiveSubstituteCost returns 0 if the two inputs are the same or differ only by case and returns 1 otherwise.

function cost = caseInsensitiveSubstituteCost(grapheme1,grapheme2)

if lower(grapheme1) == lower(grapheme2)
    cost = 0;
else
    cost = 1;
end

end

Input Arguments

collapse all

Source string, specified as a string array, character vector, or a cell array of character vectors.

If str1 contains multiple strings, then str2 must be the same size as str1 or scalar.

Data Types: char | string | cell

Target string, specified as a string array, character vector, or a cell array of character vectors.

If str2 contains multiple strings, then str1 must be the same size as str2 or scalar.

Data Types: char | string | cell

Source document, specified as a tokenizedDocument array.

If document1 contains multiple documents, then document2 must be the same size as document1 or scalar.

Target document, specified as a tokenizedDocument array.

If document2 contains multiple documents, then document1 must be the same size as document2 or scalar.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: editDistance("MATALB","MATLAB",'SwapCost',1) returns the edit distance between the strings "MATALB" and "MATLAB" and sets the cost to swap two adjacent graphemes to 1.

Cost to insert a grapheme or token, specified as the comma-separated pair consisting of 'InsertCost' and a nonnegative scalar or a function handle.

If 'InsertCost' is a function handle, then the function must accept a single input and return the cost of inserting the input to the source. For example:

  • For string input to editDistance, the cost function must have the form cost = func(grapheme), where the function returns the cost of inserting grapheme into str1.

  • For document input to editDistance, the cost function must have the form cost = func(token), where the function returns the cost of inserting token into document1.

Example: 'InsertCost',2

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | function_handle

Cost to delete grapheme or token, specified as the comma-separated pair consisting of 'DeleteCost' and a nonnegative scalar or a function handle.

If 'DeleteCost' is a function handle, then the function must accept a single input and return the cost of deleting the input from the source. For example:

  • For string input to editDistance, the cost function must have the form cost = func(grapheme), where the function returns the cost of deleting grapheme from str1.

  • For document input to editDistance, the cost function must have the form cost = func(token), where the function returns the cost of deleting token from document1.

Example: 'DeleteCost',2

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | function_handle

Cost to substitute a grapheme or token, specified as the comma-separated pair consisting of 'SubstituteCost' and a nonnegative scalar or a function handle.

If 'SubstituteCost' is a function handle, then the function must accept exactly two inputs and return the cost of substituting the first input with the second in the source. For example:

  • For string input to editDistance, the cost function must have the form cost = func(grapheme1,grapheme2), where the function returns the cost of substituting grapheme1 with grapheme2 in str1.

  • For document input to editDistance, the cost function must have the form cost = func(token1,token2), where the function returns the cost of substituting token1 with token2 in document1.

Example: 'SubstituteCost',2

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | function_handle

Cost to swap two adjacent graphemes or tokens, specified as the comma-separated pair consisting of 'SwapCost' and a nonnegative scalar or a function handle.

If 'SwapCost' is a function handle, then the function must accept exactly two inputs and return the cost of swapping the first input with the second in the source. For example:

  • For string input to editDistance, the cost function must have the form cost = func(grapheme1,grapheme2), where the function returns the cost of swapping the adjacent graphemes grapheme1 and grapheme2 in str1.

  • For document input to editDistance, the cost function must have the form cost = func(token1,token2), where the function returns the cost of swapping the adjacent tokens token1 and token2 in document1.

Example: 'SwapCost',2

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | function_handle

Output Arguments

collapse all

Edit distance, returned as a nonnegative scalar.

Algorithms

collapse all

Edit Distance

The function, by default, uses the Levenshtein distance: the lowest number of insertions, deletions, and substitutions required to convert one string to another.

For other commonly used edit distances, use these options:

DistanceDescriptionOptions
Levenshtein (default)lowest number of insertions, deletions, and substitutionsDefault
Damerau-Levenshteinlowest number of insertions, deletions, substitutions, and swaps'SwapCost',1
Hamminglowest number of substitutions only'InsertCost',Inf,'DeleteCost',Inf

Introduced in R2019a