Main Content

wordTokenize

Tokenize text into words using tokenizer

Since R2023b

    Description

    words = wordTokenize(tokenizer,str) tokenizes the text in str into words using the specified tokenizer.

    example

    Examples

    collapse all

    Load a pretrained BERT-Base neural network and corresponding tokenizer using the bert function.

    [net,tokenizer] = bert;

    View the tokenizer.

    tokenizer
    tokenizer = 
      bertTokenizer with properties:
    
            IgnoreCase: 1
          StripAccents: 1
          PaddingToken: "[PAD]"
           PaddingCode: 1
            StartToken: "[CLS]"
             StartCode: 102
          UnknownToken: "[UNK]"
           UnknownCode: 101
        SeparatorToken: "[SEP]"
         SeparatorCode: 103
           ContextSize: 512
    
    

    Tokenize the text "Bidirectional Encoder Representations from Transformers" into words using the wordTokenize function.

    str = "Bidirectional Encoder Representations from Transformers";
    words = wordTokenize(tokenizer,str)
    words = 1×1 cell array
        {["Bidirectional"    "Encoder"    "Representations"    "from"    "Transformers"]}
    
    

    Input Arguments

    collapse all

    Tokenizer, specified as a bertTokenizer or bpeTokenizer object.

    Input text, specified as a string array, character vector, or cell array of character vectors.

    Example: ["An example of a short sentence."; "A second short sentence."]

    Data Types: string | char | cell

    Output Arguments

    collapse all

    Tokenized words, returned as a cell array of string arrays.

    Data Types: cell

    Algorithms

    collapse all

    References

    [1] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding" Preprint, submitted May 24, 2019. https://doi.org/10.48550/arXiv.1810.04805.

    [2] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun et al. "Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation." Preprint, submitted October 8, 2016. https://doi.org/10.48550/arXiv.1609.08144

    Version History

    Introduced in R2023b