Main Content

removeShortWords

Remove short words from documents or bag-of-words model

Description

example

newDocuments = removeShortWords(documents,len) removes words of length len or less from documents.

example

newBag = removeShortWords(bag,len) removes words of length len or less from the bagOfWords object bag.

Examples

collapse all

Remove the words with two or fewer characters from a document.

document = tokenizedDocument("An example of a short sentence");
newDocument = removeShortWords(document,2)
newDocument = 
  tokenizedDocument:

   3 tokens: example short sentence

Remove the words with two or fewer characters from a bag-of-words model.

documents = tokenizedDocument([ ...
    "an example of a short sentence"
    "a second short sentence"]);
bag = bagOfWords(documents);
newBag = removeShortWords(bag,2)
newBag = 
  bagOfWords with properties:

          Counts: [2x4 double]
      Vocabulary: ["example"    "short"    "sentence"    "second"]
        NumWords: 4
    NumDocuments: 2

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Input bag-of-words model, specified as a bagOfWords object.

Maximum length of words to remove, specified as a positive integer. The function removes words with len or fewer characters.

Output Arguments

collapse all

Output documents, returned as a tokenizedDocument array.

Output bag-of-words model, returned as a bagOfWords object.

Version History

Introduced in R2017b