Prepare Text Data for Analysis

This example shows how to create a function which cleans and preprocesses text data for analysis.

Text data can be large and can contain lots of noise which negatively affects statistical analysis. For example, text data can contain the following:

  • Variations in case, for example "new" and "New"

  • Variations in word forms, for example "walk" and "walking"

  • Words which add noise, for example stop words such as "the" and "of"

  • Punctuation and special characters

  • HTML and XML tags

These word clouds illustrate word frequency analysis applied to some raw text data from weather reports, and a preprocessed version of the same text data.

Load and Extract Text Data

Load the example data. The file weatherReports.csv contains weather reports, including a text description and categorical labels for each event.

filename = "weatherReports.csv";
data = readtable(filename,'TextType','string');

Extract the text data from the field event_narrative, and the label data from the field event_type.

textData = data.event_narrative;
labels = data.event_type;
textData(1:10)
ans = 10×1 string array
    "Large tree down between Plantersville and Nettleton."
    "One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water."
    "NWS Columbia relayed a report of trees blown down along Tom Hall St."
    "Media reported two trees blown down along I-40 in the Old Fort area."
    ""
    "A few tree limbs greater than 6 inches down on HWY 18 in Roseland."
    "Awning blown off a building on Lamar Avenue. Multiple trees down near the intersection of Winchester and Perkins."
    "Quarter size hail near Rosemark."
    "Tin roof ripped off house on Old Memphis Road near Billings Drive. Several large trees down in the area."
    "Powerlines down at Walnut Grove and Cherry Lane roads."

Create Tokenized Documents

Create an array of tokenized documents.

cleanedDocuments = tokenizedDocument(textData);
cleanedDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

     8 tokens: Large tree down between Plantersville and Nettleton .
    39 tokens: One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour . One vehicle was stalled in the water .
    14 tokens: NWS Columbia relayed a report of trees blown down along Tom Hall St .
    14 tokens: Media reported two trees blown down along I-40 in the Old Fort area .
     0 tokens:
    15 tokens: A few tree limbs greater than 6 inches down on HWY 18 in Roseland .
    20 tokens: Awning blown off a building on Lamar Avenue . Multiple trees down near the intersection of Winchester and Perkins .
     6 tokens: Quarter size hail near Rosemark .
    21 tokens: Tin roof ripped off house on Old Memphis Road near Billings Drive . Several large trees down in the area .
    10 tokens: Powerlines down at Walnut Grove and Cherry Lane roads .

To improve lemmatization, add part of speech details to the documents using addPartOfSpeechDetails. Use the addPartOfSpeech function before removing stop words and lemmatizing.

cleanedDocuments = addPartOfSpeechDetails(cleanedDocuments);

Words like "a", "and", "to", and "the" (known as stop words) can add noise to data. Remove a list of stop words using the removeStopWords function. Use the removeStopWords function before using the normalizeWords function.

cleanedDocuments = removeStopWords(cleanedDocuments);
cleanedDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

     6 tokens: Large tree down Plantersville Nettleton .
    20 tokens: two feet deep standing water developed street Winthrop University campus inch rain fell less hour . vehicle stalled water .
    11 tokens: NWS Columbia relayed report trees blown down Tom Hall St .
    11 tokens: Media reported two trees blown down I-40 Old Fort area .
     0 tokens:
    11 tokens: few tree limbs greater 6 inches down HWY 18 Roseland .
    15 tokens: Awning blown off building Lamar Avenue . Multiple trees down near intersection Winchester Perkins .
     6 tokens: Quarter size hail near Rosemark .
    18 tokens: Tin roof ripped off house Old Memphis Road near Billings Drive . Several large trees down area .
     8 tokens: Powerlines down Walnut Grove Cherry Lane roads .

Lemmatize the words using normalizeWords.

cleanedDocuments = normalizeWords(cleanedDocuments,'Style','lemma');
cleanedDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

     6 tokens: large tree down plantersville nettleton .
    20 tokens: two foot deep standing water develop street winthrop university campus inch rain fall less hour . vehicle stall water .
    11 tokens: nws columbia relay report tree blow down tom hall st .
    11 tokens: medium report two tree blow down i-40 old fort area .
     0 tokens:
    11 tokens: few tree limb great 6 inch down hwy 18 roseland .
    15 tokens: awning blow off building lamar avenue . multiple tree down near intersection winchester perkins .
     6 tokens: quarter size hail near rosemark .
    18 tokens: tin roof rip off house old memphis road near billings drive . several large tree down area .
     8 tokens: powerlines down walnut grove cherry lane road .

Erase the punctuation from the documents.

cleanedDocuments = erasePunctuation(cleanedDocuments);
cleanedDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

     5 tokens: large tree down plantersville nettleton
    18 tokens: two foot deep standing water develop street winthrop university campus inch rain fall less hour vehicle stall water
    10 tokens: nws columbia relay report tree blow down tom hall st
    10 tokens: medium report two tree blow down i40 old fort area
     0 tokens:
    10 tokens: few tree limb great 6 inch down hwy 18 roseland
    13 tokens: awning blow off building lamar avenue multiple tree down near intersection winchester perkins
     5 tokens: quarter size hail near rosemark
    16 tokens: tin roof rip off house old memphis road near billings drive several large tree down area
     7 tokens: powerlines down walnut grove cherry lane road

Remove words with 2 or fewer characters, and words with 15 or greater characters.

cleanedDocuments = removeShortWords(cleanedDocuments,2);
cleanedDocuments = removeLongWords(cleanedDocuments,15);
cleanedDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

     5 tokens: large tree down plantersville nettleton
    18 tokens: two foot deep standing water develop street winthrop university campus inch rain fall less hour vehicle stall water
     9 tokens: nws columbia relay report tree blow down tom hall
    10 tokens: medium report two tree blow down i40 old fort area
     0 tokens:
     8 tokens: few tree limb great inch down hwy roseland
    13 tokens: awning blow off building lamar avenue multiple tree down near intersection winchester perkins
     5 tokens: quarter size hail near rosemark
    16 tokens: tin roof rip off house old memphis road near billings drive several large tree down area
     7 tokens: powerlines down walnut grove cherry lane road

Create Bag-of-Words Model

Create a bag-of-words model.

cleanedBag = bagOfWords(cleanedDocuments)
cleanedBag = 
  bagOfWords with properties:

          Counts: [36176×18478 double]
      Vocabulary: [1×18478 string]
        NumWords: 18478
    NumDocuments: 36176

Remove words that do not appear more than two times in the bag-of-words model.

cleanedBag = removeInfrequentWords(cleanedBag,2)
cleanedBag = 
  bagOfWords with properties:

          Counts: [36176×6978 double]
      Vocabulary: [1×6978 string]
        NumWords: 6978
    NumDocuments: 36176

Some preprocessing steps such as removeInfrequentWords leaves empty documents in the bag-of-words model. To ensure that no empty documents remain in the bag-of-words model after preprocessing, use removeEmptyDocuments as the last step.

Remove empty documents from the bag-of-words model and the corresponding labels from labels.

[cleanedBag,idx] = removeEmptyDocuments(cleanedBag);
labels(idx) = [];
cleanedBag
cleanedBag = 
  bagOfWords with properties:

          Counts: [28137×6978 double]
      Vocabulary: [1×6978 string]
        NumWords: 6978
    NumDocuments: 28137

Create a Preprocessing Function

It can be useful to create a function which performs preprocessing so you can prepare different collections of text data in the same way. For example, you can use a function so that you can preprocess new data using the same steps as the training data.

Create a function which tokenizes and preprocesses the text data so it can be used for analysis. The function preprocessWeatherNarratives, performs the following steps:

  1. Tokenize the text using tokenizedDocument.

  2. Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.

  3. Lemmatize the words using normalizeWords.

  4. Erase punctuation using erasePunctuation.

  5. Remove words with 2 or fewer characters using removeShortWords.

  6. Remove words with 15 or more characters using removeLongWords.

Use the example preprocessing function preprocessWeatherNarratives to prepare the text data.

newText = "A tree is downed outside Apple Hill Drive, Natick";
newDocuments = preprocessWeatherNarratives(newText)
newDocuments = 
  tokenizedDocument:

   7 tokens: tree down outside apple hill drive natick

Compare with Raw Data

Compare the preprocessed data with the raw data.

rawDocuments = tokenizedDocument(textData);
rawBag = bagOfWords(rawDocuments)
rawBag = 
  bagOfWords with properties:

          Counts: [36176×23302 double]
      Vocabulary: [1×23302 string]
        NumWords: 23302
    NumDocuments: 36176

Calculate the reduction in data.

numWordsCleaned = cleanedBag.NumWords;
numWordsRaw = rawBag.NumWords;
reduction = 1 - numWordsCleaned/numWordsRaw
reduction = 0.7005

Compare the raw data and the cleaned data by visualizing the two bag-of-words models using word clouds.

figure
subplot(1,2,1)
wordcloud(rawBag);
title("Raw Data")
subplot(1,2,2)
wordcloud(cleanedBag);
title("Cleaned Data")

Preprocessing Function

The function preprocessWeatherNarratives, performs the following steps in order:

  1. Tokenize the text using tokenizedDocument.

  2. Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.

  3. Lemmatize the words using normalizeWords.

  4. Erase punctuation using erasePunctuation.

  5. Remove words with 2 or fewer characters using removeShortWords.

  6. Remove words with 15 or more characters using removeLongWords.

function documents = preprocessWeatherNarratives(textData)

% Tokenize the text.
documents = tokenizedDocument(textData);

% Remove a list of stop words then lemmatize the words. To improve
% lemmatization, first use addPartOfSpeechDetails.
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = normalizeWords(documents,'Style','lemma');

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove words with 2 or fewer characters, and words with 15 or more
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

end

See Also

| | | | | | | | | |

Related Topics