This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

erasePunctuation

Erase punctuation from text and documents

Syntax

newStr = erasePunctuation(str)
newDocuments = erasePunctuation(documents)
newDocuments = erasePunctuation(documents,'TokenTypes',types)

Description

example

newStr = erasePunctuation(str) erases punctuation and symbols from the elements of str. The function removes characters that belong to the Unicode punctuation or symbol classes.

example

newDocuments = erasePunctuation(documents) erases punctuation and symbols from documents. If a word is empty after removing punctuation and symbol characters, then the function removes it. For tokenized document input, the function erases punctuation from tokens with type 'punctuation' and 'other'. For example, the function does not erase punctuation and symbol characters from URLs and email addresses.

example

newDocuments = erasePunctuation(documents,'TokenTypes',types) erases punctuation and symbols from only the specified token types.

Examples

collapse all

Erase the punctuation from the text in str.

str = "it's one and/or two.";
newStr = erasePunctuation(str)
newStr = 
"its one andor two"

To insert a space where the "/" symbol is, first use the replace function.

newStr = replace(str,"/"," ")
newStr = 
"it's one and or two."
newStr = erasePunctuation(newStr)
newStr = 
"its one and or two"

Erase the punctuation from an array of documents.

documents = tokenizedDocument([ ...
    "An example of a short sentence." 
    "Another example... with a URL: https://www.mathworks.com"])
documents = 
  2x1 tokenizedDocument:

     7 tokens: An example of a short sentence .
    10 tokens: Another example . . . with a URL : https://www.mathworks.com

newDocuments = erasePunctuation(documents)
newDocuments = 
  2x1 tokenizedDocument:

    6 tokens: An example of a short sentence
    6 tokens: Another example with a URL https://www.mathworks.com

Here, the function does not erase the punctuation symbols from the URL.

Input Arguments

collapse all

Input text, specified as a string array, character vector, or cell array of character vectors.

Example: ["An example of a short sentence."; "A second short sentence."]

Data Types: string | char | cell

Input documents, specified as a tokenizedDocument array.

Token types to erase punctuation from, specified as a character vector, string array, or a cell array of character vectors containing one or more of the following token types:

  • 'letters' – string of letter characters only

  • 'digits' – string of digits only

  • 'punctuation' – string of punctuation and symbol characters only

  • 'email-address' – detected email address

  • 'web-address' – detected web address

  • 'hashtag' – detected hashtag (starts with "#" character followed by a letter)

  • 'at-mention' – detected at-mention (starts with "@" character)

  • 'emoticon' – detected emoticon

  • 'emoji' – detected emoji

  • 'other' – does not belong to previous types

Data Types: string | char | cell

Output Arguments

collapse all

Output text, returned as a string array, a character vector, or cell array of character vectors. str and newStr have the same data type.

Output documents, returned as a tokenizedDocument array.

More About

collapse all

Unicode Character Categories

Each Unicode character is assigned a category. The following table summarizes the Unicode punctuation and symbol categories and provides an example character from each category:

CategoryCategory CodeNumber of CharactersExample Character
Punctuation, Connector[Pc]10_
Punctuation, Dash[Pd]24-
Punctuation, Close[Pe]73)
Punctuation, Final quote[Pf]10
Punctuation, Initial quote[Pi]12
Punctuation, Other[Po]566!
Punctuation, Open[Ps]75(
Symbol, Currency[Sc]54$
Symbol, Modifier[Sk]121^
Symbol, Math[Sm]948+
Symbol, Other[So]5855¦

For more information, see [1].

Tips

  • For string input, erasePunctuation removes punctuation characters from URLs and HTML tags. This behavior can prevent the functions eraseTags, eraseURLs, and decodeHTMLEntities from working as expected. If you want to use these functions to preprocess your text, then use these functions before using erasePunctuation.

Compatibility Considerations

expand all

Behavior changed in R2018b

References

Introduced in R2017b