normalizeWords

Stem or lemmatize words

Description

Use normalizeWords to reduce words to a root form. To lemmatize English words (reduce them to their dictionary forms), set the 'Style' option to 'lemma'.

The function supports English, Japanese, German, and Korean text.

example

updatedDocuments = normalizeWords(documents) reduces the words in documents to a root form. For English and German text, the function, by default, stems the words using the Porter stemmer for English and German text respectively. For Japanese and Korean text, the function, by default, lemmatizes the words using the MeCab tokenizer.

example

updatedWords = normalizeWords(words) reduces each word in the string array words to a root form.

updatedWords = normalizeWords(words,'Language',language) reduces the words and also specifies the word language.

example

___ = normalizeWords(___,'Style',style) also specifies normalization style. For example, normalizeWords(documents,'Style','lemma') lemmatizes the words in the input documents.

Examples

collapse all

Stem the words in a document array using the Porter stemmer.

documents = tokenizedDocument([
    "a strongly worded collection of words"
    "another collection of words"]);
newDocuments = normalizeWords(documents)
newDocuments = 
  2x1 tokenizedDocument:

    6 tokens: a strongli word collect of word
    4 tokens: anoth collect of word

Stem the words in a string array using the Porter stemmer. Each element of the string array must be a single word.

words = ["a" "strongly" "worded" "collection" "of" "words"];
newWords = normalizeWords(words)
newWords = 1x6 string array
    "a"    "strongli"    "word"    "collect"    "of"    "word"

Lemmatize the words in a document array.

documents = tokenizedDocument([
    "I am building a house."
    "The building has two floors."]);
newDocuments = normalizeWords(documents,'Style','lemma')
newDocuments = 
  2x1 tokenizedDocument:

    6 tokens: i be build a house .
    6 tokens: the build have two floor .

To improve the lemmatization, first add part-of-speech details to the documents using the addPartOfSpeechDetails function. For example, if the documents contain part-of-speech details, then normalizeWords reduces the only verb "building" and not the noun "building".

documents = addPartOfSpeechDetails(documents);
newDocuments = normalizeWords(documents,'Style','lemma')
newDocuments = 
  2x1 tokenizedDocument:

    6 tokens: i be build a house .
    6 tokens: the building have two floor .

Tokenize Japanese text using the tokenizedDocument function. The function automatically detects Japanese text.

str = [
    "空に星が輝き、瞬いている。"
    "空の星が輝きを増している。"
    "駅までは遠くて、歩けない。"
    "遠くの駅まで歩けない。"];
documents = tokenizedDocument(str);

Lemmatize the tokens using normalizeWords.

documents = normalizeWords(documents)
documents = 
  4x1 tokenizedDocument:

    10 tokens: 空 に 星 が 輝く 、 瞬く て いる 。
    10 tokens: 空 の 星 が 輝き を 増す て いる 。
     9 tokens: 駅 まで は 遠い て 、 歩ける ない 。
     7 tokens: 遠く の 駅 まで 歩ける ない 。

Tokenize German text using the tokenizedDocument function. The function automatically detects German text.

str = [
    "Guten Morgen. Wie geht es dir?"
    "Heute wird ein guter Tag."];
documents = tokenizedDocument(str);

Stem the tokens using normalizeWords.

documents = normalizeWords(documents)
documents = 
  2x1 tokenizedDocument:

    8 tokens: gut morg . wie geht es dir ?
    6 tokens: heut wird ein gut tag .

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Input words, specified as a string vector, character vector, or cell array of character vectors. If you specify words as a character vector, then the function treats the argument as a single word.

Data Types: string | char | cell

Normalization style, specified as one of the following:

  • 'stem' – Stem words using the Porter stemmer. This option supports English and German text only. For English and German text, this value is the default.

  • 'lemma' – Extract the dictionary form of each word. This option supports English, Japanese, and Korean text only. If a word is not in the internal dictionary, then the function outputs the word unchanged. For English text, the output is lowercase. For Japanese and Korean text, this value is the default.

The function only normalizes tokens with type 'letters' and 'other'. For more information on token types, see tokenDetails.

Tip

For English text, to improve lemmatization of words in documents, first add part-of-speech details using the addPartOfSpeechDetails function.

Word language, specified as one of the following:

  • 'en' – English language

  • 'de' – German language

If you do not specify language, then the software detects the language automatically. To lemmatize Japanese or Korean text, use tokenizedDocument input.

Data Types: char | string

Output Arguments

collapse all

Updated documents, returned as a tokenizedDocument array.

Updated words, returned as a string array, character vector, or cell array of character vectors. words and updatedWords have the same data type.

Algorithms

collapse all

Language Details

tokenizedDocument objects contain details about the tokens including language details. The language details of the input documents determine the behavior of normalizeWords. The tokenizedDocument function, by default, automatically detects the language of the input text. To specify the language details manually, use the 'Language' name-value pair argument of tokenizedDocument. To view the token details, use the tokenDetails function.

Compatibility Considerations

expand all

Behavior changed in R2018b

Introduced in R2017b