Main Content

Preprocess Text Data in Live Editor

Text data can be large and can contain lots of noise which negatively affects statistical analysis. For example, text data can contain the following:

  • Variations in case, for example "new" and "New"

  • Variations in word forms, for example "walk" and "walking"

  • Words which add noise, for example "stop words" such as "the" and "of"

  • Punctuation and special characters

  • HTML and XML tags

These word clouds illustrate word frequency analysis applied to some raw text data from factory reports, and a preprocessed version of the same text data.

Two word clouds showing words in different font sizes. Larger font sizes indicate more frequent words in the data. The word cloud on the left has title "Raw Data" and highlights punctuation and words like "the" and "in". Words like "assembler" and "mixer" have relatively small font size. The word cloud on the right has title "Clean Data" and highlights words like "assembler" and "mixer". Words like "the" and "in" do not appear in the word cloud.

Most workflows require a preprocessing function to easily prepare different collections of text data in the same way. For example, when you train a model, you can use the same function to preprocess the training data and new data using the same steps.

You can interactively preprocess text data using the Preprocess Text Data Live Editor task and visualize the results. This example uses the Preprocess Text Data Live Editor task to generate code that preprocesses text data and creates a function that you can reuse. For more information on Live Editor tasks, see Add Interactive Tasks to a Live Script.

Insert Task

The Preprocess Text Data Live Editor task supports preprocessing string arrays, character vectors, and tables.

First, load the factory reports data. The data contains textual descriptions of factory failure events.

tbl = readtable("factoryReports.csv")

A table with variables "Description", "Category", and "Urgency". The "Description" variable contains textual descriptions such as "Items are occasionally getting stuck in the scanner spools". The "Category" variable contains categorical labels such as "Mechanical Failure", and the "Urgency" variable contains categorical labels such as "Medium".

In the Preprocess Text Data Live Editor task, set the Data option to tbl. Because the input is a table, set the second drop down box to Description.

Open the Preprocess Text Data Live Editor task. To open the task, begin typing the task name and select Preprocess Text Data from the suggested command completions. Alternatively, on the Live Editor tab, select Task > Preprocess Text Data.

Drop down list showing suggested command completions. The only suggestion in the list is for the Preprocess Text Data task, and is selected.

Select Data

Select data using the Data option.

In the Data option, select tbl as the input data and select the table variable Description. To ensure that the task includes tbl in the drop down, run the script so that tbl is in the MATLAB® workspace.

Data options with drop down lists with "tbl" and "Description" selected

By default, the Preprocess Text Data task does not automatically run when you modify the task parameters. To have the task run automatically after any change, select the Autorun checkbox at the top-right of the task. If your data set is large, do not enable this option.

Preprocess the text data using the default options by running the live script.

Preprocess Text Data Task in Live Editor with the default options

These word clouds show the effect of the preprocessing options.

Two word clouds showing words in different font sizes. Larger font sizes indicate more frequent words in the data. The word cloud on the left has title "Raw Data" and highlights punctuation and words like "the" and "in". Words like "assembler" and "mixer" have relatively small font size. The word cloud on the right has title "Clean Data" and highlights words like "assembler" and "mixer". Words like "the" and "in" do not appear in the word cloud.

Clean Up HTML

Use the Clean up HTML options to preprocess HTML code.

Read HTML code from a web page using the webread function and view the first 300 characters.

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);
code(1:300)
ans =

    '<!DOCTYPE HTML>
     <html lang="en">
     <head>
     <title>Text Analytics Toolbox
     Documentation</title>
     <meta charset="utf-8">
     <meta name="viewport" content="width=device-width, initial-scale=1.0">
     <meta http-equiv="X-UA-Compatible" content="IE=edge">
     <link rel="canonical" href="https://www.mathworks.com/help/t'

Extract the text from the HTML by setting the Data option to code and selecting the Extract HTML text option. To ensure that the task includes code in the drop down, run the script so that code is in the MATLAB workspace.

"Data" option with "code" selected and "Extract HTML text" check box selected

These word clouds show the effect of the preprocessing options.

Two word clouds showing words in different font sizes. Larger font sizes indicate more frequent words in the data. The word cloud on the left has title "Raw Data" and highlights special characters like ">" and "=". Words like "text" and "model" have relatively small font size. The word cloud on the right has title "Clean Data" and highlights words like "text" and "model". Special characters like ">" and "=" do not appear in the word cloud.

Tokenize

Use the Tokenize options to manually specify the text language or to split the text into paragraphs or sentences.

Read the text data sonnets.txt using the extractFileText function. The output is a string scalar.

str = extractFileText("sonnets.txt");

Split the text into paragraphs by setting the Data option to str and setting the Split option to Sentences. To display the split text, select the Show Tokenized Text option.

"Data" option with "str" selected, "Split" option with "Sentences" selected, and "Show tokenized text" check box selected

Inspect the str and preprocessedText in the workspace. The variable str is a string scalar. The variable preprocessedText is a 558-by-1 tokenizedDocument array of sentences.

Worskpace showing sizes and data type of the variables "str" and "preprocessedText".

Add Token Details

tokenizedDocument objects store information about the tokens such as named entities and part-of-speech tags. You can specify which details to add using the Preprocess Text Data Live Editor task.

Create a string array that contains names and locations.

str = "William Shakespeare was born in Stratford-upon-Avon, England.";

Set the Data option to str. To ensure that the task includes str in the drop down, run the script so that str is in the MATLAB workspace. To detected named entities such as person names and locations, select the Detect named entities option.

"Data" option with "str" selected, and "Add part-of-speech" and "Detect named entities" check boxes selected

To view the entities in context, unselect and disable the preprocessing options that change and remove words.

"Data" option with "str" selected, and "Add part-of-speech" and "Detect named entities" check boxes selected

To view the table of token details, under Display results, select Show token details. The tokens and the detected named entities are in the Token and Entity variables of the table, respectively.

"Show token details" check box selected and table of token details. The token "William Shakespeare" has entity "person" and the tokens "Stratford-on-Avon" and "England" have entity "location".

Change and Remove Words

Normalize and remove words that the word normalization and stop word removal options do not support by specifying custom word lists in the Replace words and Remove words options.

Load the factory reports data.

tbl = readtable("factoryReports.csv")

In the Data option, select tbl as the input data and select the table variable Description. To ensure that the task includes tbl in the drop down, run the script so that tbl is in the MATLAB workspace.

Data options with drop down lists with "tbl" and "Description" selected

Replace the word "mixer" with "blender" using the Replace words option. Remove the word "scanner" using the Remove words option.

"Replace words" option with "mixer" and "blender" specified in the "Source" and "Target" options, respectively. "Remove words" option with "scanner" specified.

These word clouds show the effect of the preprocessing options.

Two word clouds showing words in different font sizes. Larger font sizes indicate more frequent words in the data. The word cloud on the left has title "Without Replacement and Removal" and highlights words like "mixer" and "scanner". The word "blender" is relatively smaUll. The word cloud on the right has title "With Replacement and Removal" and highlights the word "blender". The words "mixer" and "scanner" do not appear in the word cloud.

Generate Code

Use the Preprocess Text Data live task to generate code to use in text preprocessing functions.

The Preprocess Text Data Live Editor task generates code in your live script. The generated code reflects the options that you select and includes code to generate the display. To see the generated code, click Show code at the bottom of the task parameter area. The task expands to display the generated code.

MATLAB code generated by Preprocess Text Data task

To reuse the same steps in your code, create a function that takes as input the text data and outputs the preprocessed text data. You can include the function at the end of a script or as a separate file. The preprocessTextData function listed at the end of the example, uses the code generated by the Preprocess Text Data Live Editor task.

To use the function, specify the table as input to the preprocessTextData function.

documents = preprocessTextData(tbl);

Preprocess Text Data Function

The preprocessTextData function uses the code generated by the Preprocess Text Data Live Editor task. The function takes as input the table tbl and returns the preprocessed text preprocessedText. The function performs these steps:

  1. Extract the text data from the Description variable of the input table.

  2. Tokenize the text using tokenizedDocument.

  3. Add part-of-speech details using addPartOfSpeechDetails.

  4. Lemmatize the words using normalizeWords.

  5. Remove words with 2 or fewer characters using removeShortWords.

  6. Remove words with 15 or more characters using removeLongWords.

  7. Remove stop words (such as "and", "of", and "the") using removeStopWords.

  8. Erase punctuation using erasePunctuation.

function preprocessedText = preprocessTextData(tbl)

%% Preprocess Text
preprocessedText = tbl.Description;

% Tokenize
preprocessedText = tokenizedDocument(preprocessedText);

% Add token details
preprocessedText = addPartOfSpeechDetails(preprocessedText);

% Change and remove words
preprocessedText = normalizeWords(preprocessedText,Style="lemma");
preprocessedText = removeShortWords(preprocessedText,2);
preprocessedText = removeLongWords(preprocessedText,15);
preprocessedText = removeStopWords(preprocessedText,IgnoreCase=false);
preprocessedText = erasePunctuation(preprocessedText);

end

See Also

| | | | | | | | | |

Related Topics