Main Content

mecabOptions

Options for MeCab tokenization

Since R2019b

Description

A mecabOptions object specifies additional options for tokenizing Japanese and Korean text.

To tokenize using the specified MeCab tokenization options, use the 'TokenizeMethod' option of tokenizedDocument.

Creation

Description

example

options = mecabOptions creates a MeCab tokenization option set with the default values for tokenizing Japanese.

example

options = mecabOptions(Name,Value) additionally sets additional Properties using one or more name-value pair arguments.

Properties

expand all

Path to trained model (MeCab dictionary), specified as a string scalar or a character vector.

The default value is a path to the internal dictionary for Japanese tokenization.

Example: "C:\myDict"

Data Types: char | string

Files containing model extensions (MeCab user dictionary .dic files), specified as a string array, a character vector, or a cell array of character vectors.

Example: "C:\myFile.dic"

Data Types: char | string | cell

Function extracting lemma from MeCab reply, specified as a function handle.

The function must have the form lemmata = fun(words,info), where words is a string vector of tokens and info is a struct with the following fields:

  • Feature – String vector of tokens of the same size as words containing the MeCab output lines in ChaSen format without the split tokens themselves.

  • PartOfSpeech – Numerical code used inside the dictionary for the part-of-speech classification.

The output lemmata is a string array of the same size as words containing the extracted lemmata.

The default lemma extractor is the textanalytics.ja.mecabToLemma function.

Data Types: function_handle

Function extracting part-of-speech information from MeCab reply, specified as a function handle.

The function must have the form posTags = fun(words,info), where words is a string vector of tokens and info is a struct with the following fields:

  • Feature – String vector of tokens of the same size as words containing the MeCab output lines in ChaSen format without the split tokens themselves.

  • PartOfSpeech – Numerical code used inside the dictionary for the part-of-speech classification.

The output posTags is a categorical array of the same size as words containing the extracted part-of-speech tags from the following categories:

  • adjective

  • adposition

  • adverb

  • auxiliary-verb

  • coord-conjunction

  • determiner

  • interjection

  • noun

  • numeral

  • pronoun

  • proper-noun

  • punctuation

  • symbol

  • verb

  • other

The default part-of-speech information extractor is the textanalytics.ja.mecabToPOS function.

Data Types: function_handle

Function extracting named entity information from MeCab reply, specified as a function handle.

The function must have the form entities = fun(words,info), where words is a string vector of tokens and info is a struct with the following fields:

  • Feature – String vector of tokens of the same size as words containing the MeCab output lines in ChaSen format without the split tokens themselves.

  • PartOfSpeech – Numerical code used inside the dictionary for the part-of-speech classification.

The output entities is a categorical array of the same size as words containing the extracted entities from the following categories:

  • non-entity

  • person

  • organization

  • location

  • other

The default part-of-speech information extractor is the textanalytics.ja.mecabToNER function.

Data Types: function_handle

Examples

collapse all

Create a MecabOptions object containing the default options for Japanese tokenization.

options = mecabOptions
options = 
  MecabOptions with properties:

             Model: "C:\Program Files\MATLAB\R2023a\sys\share\dict-ipadic"
         UserModel: ""
    LemmaExtractor: @textanalytics.ja.mecabToLemma
      POSExtractor: @textanalytics.ja.mecabToPOS
      NERExtractor: @textanalytics.ja.mecabToNER

Tokenize Japanese text using custom MeCab options.

Create a string array of Japanese text.

str = [
    "恋に悩み、苦しむ。"
    "恋の悩みで苦しむ。"
    "空に星が輝き、瞬いている。"
    "空の星が輝きを増している。"];

Create a MecabOptions object and specify a user model as a .dic file using the 'UserModel' option.

options = mecabOptions('UserModel','myFile.dic')
options = 
  MecabOptions with properties:

             Model: "C:\Program Files\MATLAB\R2023a\sys\share\dict-ipadic"
         UserModel: "myFile.dic"
    LemmaExtractor: @textanalytics.ja.mecabToLemma
      POSExtractor: @textanalytics.ja.mecabToPOS
      NERExtractor: @textanalytics.ja.mecabToNER

Tokenize the text using the specified options using the 'TokenizeMethod' option.

documents = tokenizedDocument(str,'TokenizeMethod',options)
documents = 
  4×1 tokenizedDocument:

     6 tokens: 恋 に 悩み 、 苦しむ 。
     6 tokens: 恋 の 悩み で 苦しむ 。
    10 tokens: 空 に 星 が 輝き 、 瞬い て いる 。
    10 tokens: 空 の 星 が 輝き を 増し て いる 。

Version History

Introduced in R2019b