This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

replaceNgrams

Replace n-grams in documents

Syntax

newDocuments = replaceNgrams(documents,oldNgrams,newNgrams)

Description

example

newDocuments = replaceNgrams(documents,oldNgrams,newNgrams) updates the specified documents by replacing the n-grams in oldNgrams documents with the corresponding n-grams in newNgrams.

Examples

collapse all

Use the replaceNgrams function to replace abbreviations with their corresponding expanded forms.

Create an array of tokenized documents.

str = [ ...
    "Currently in Cambridge, MA."
    "Next stop, NY!"];
documents = tokenizedDocument(str)
documents = 
  2x1 tokenizedDocument:

    6 tokens: Currently in Cambridge , MA .
    5 tokens: Next stop , NY !

Replace the tokens "MA" and "NY" with "Massachusetts" and ["New" "York"] respectively. If the n-grams have differnt lengths, you must pad the rows with the empty string "". In this case, you must pad "Massachusetts" with a single empty string "".

oldNgrams = [
    "MA"
    "NY"];
newNgrams = [
    "Massachusetts" ""
    "New" "York"];
documents = replaceNgrams(documents,oldNgrams,newNgrams)
documents = 
  2x1 tokenizedDocument:

    6 tokens: Currently in Cambridge , Massachusetts .
    6 tokens: Next stop , New York !

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

N-grams to replace, specified as a string array, character vector, or a cell array of character vectors.

If oldNgrams is a string array or cell array, then it has size NumNgrams-by-maxN , where NumNgrams is the number of n-grams, and maxN is the length of the largest n-gram. If oldNgrams is a character vector, then it represents a single word (unigram).

The value of oldNgrams(i,j) is the jth word of the ith n-gram. If the number of words in the ith n-gram is less than maxN, then the remaining entries of the ith row of oldNgrams must be padded with the empty string "".

For example, to specify both the unigram "Massachusetts", and the bigram ["New" "York"], specify the 2-by-2 string array ["Massachusetts" ""; "New" "York"], where "Massachusetts" is padded with a single empty string "".

Data Types: string | char | cell

New n-grams, specified as a string array, character vector, or a cell array of character vectors.

If newNgrams is a string array or cell array, then it has size NumNgrams-by-maxN , where NumNgrams is the number of n-grams, and maxN is the length of the largest n-gram. If newNgrams is a character vector, then it represents a single word (unigram).

The value of newNgrams(i,j) is the jth word of the ith n-gram. If the number of words in the ith n-gram is less than maxN, then the remaining entries of the ith row of newNgrams are empty.

newNgrams must have one row, or the same number of rows as oldNgrams.

For example, to specify both the unigram "Massachusetts", and the bigram ["New" "York"], specify the 2-by-2 string array ["Massachusetts" ""; "New" "York"], where "Massachusetts" is padded with a single empty string "".

Data Types: string | char | cell

Output Arguments

collapse all

Output documents, returned as a tokenizedDocument array.

Introduced in R2019a