# embed

## Description

The embed operation converts numeric indices to numeric vectors, where the indices correspond to discrete data. Use embeddings to map discrete data such as categorical values or words to numeric vectors.

**Note**

This function applies the embed operation to `dlarray`

data. If
you want to apply the embed operation within a `dlnetwork`

object, use `wordEmbeddingLayer`

(Text Analytics Toolbox).

## Examples

### Embed Categorical Data

Embed a mini-batch of categorical features.

Create an array of categorical features containing 5 observations with values `"Male"`

or `"Female"`

.

X = categorical(["Male" "Female" "Male" "Female" "Female"])';

Initialize the embedding weights. Specify an embedding dimension of 10, and a vocabulary corresponding to the number of categories of the input data plus one.

embeddingDimension = 10; vocabularySize = numel(categories(X)); weights = rand(embeddingDimension,vocabularySize+1);

To embed the categorical data, first convert it to mini-batch of numeric indices.

X = double(X)

`X = `*5×1*
2
1
2
1
1

For formatted `dlarray`

input, the embed function expands into a singleton `'C'`

(channel) dimension with size 1. Create a formatted `dlarray`

object containing the data. To specify that the rows correspond to observations, specify the format `'BC'`

(batch, channel).

`dlX = dlarray(X,'BC')`

dlX = 1(C) x 5(B) dlarray 2 1 2 1 1

Embed the numeric indices using the `embed`

function. The embed function expands into the `'C'`

dimension.

dlY = embed(dlX,weights)

dlY = 10(C) x 5(B) dlarray 0.1576 0.8147 0.1576 0.8147 0.8147 0.9706 0.9058 0.9706 0.9058 0.9058 0.9572 0.1270 0.9572 0.1270 0.1270 0.4854 0.9134 0.4854 0.9134 0.9134 0.8003 0.6324 0.8003 0.6324 0.6324 0.1419 0.0975 0.1419 0.0975 0.0975 0.4218 0.2785 0.4218 0.2785 0.2785 0.9157 0.5469 0.9157 0.5469 0.5469 0.7922 0.9575 0.7922 0.9575 0.9575 0.9595 0.9649 0.9595 0.9649 0.9649

In this case, the output is an `embeddingDimension`

-by-`N`

matrix with format `'CB'`

(channel, batch), where `N`

is the number of observations. Each column contains the embedding vectors.

### Embed Text Data

Embed a mini-batch of text data.

textData = [ "Items are occasionally getting stuck in the scanner spools." "Loud rattling and banging sounds are coming from assembler pistons."];

Create an array of tokenized documents.

documents = tokenizedDocument(textData);

To encode text data as sequences of numeric indices, create a `wordEncoding`

object.

enc = wordEncoding(documents);

Initialize the embedding weights. Specify an embedding dimension of 100, and a vocabulary size to be consistent with the vocabulary size corresponding to the number of words in the word encoding plus one.

embeddingDimension = 100; vocabularySize = enc.NumWords; weights = rand(embeddingDimension,vocabularySize+1);

Convert the tokenized documents to sequences of word vectors using the `doc2sequence`

function. The `doc2sequence`

function, by default, discards out-of-vocabulary tokens in the input data. To map out-of-vocabulary tokens to the last vector of embedding weights, set the `'UnknownWord'`

option to `'nan'`

. The `doc2sequence`

function, by default, left-pads the input sequences with zeros to have the same length

sequences = doc2sequence(enc,documents,'UnknownWord','nan')

`sequences=`*2×1 cell array*
{[ 0 1 2 3 4 5 6 7 8 9 10]}
{[11 12 13 14 15 2 16 17 18 19 10]}

The output is a cell array, where each element corresponds to an observation. Each element is a row vector with elements representing the individual tokens in the corresponding observation including the padding values.

Convert the cell array to a numeric array by vertically concatenating the rows.

X = cat(1,sequences{:})

`X = `*2×11*
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 2 16 17 18 19 10

Convert the numeric indices to `dlarray`

. Because the rows and columns of `X`

correspond to observations and time steps, respectively, specify the format `'BT'`

.

`dlX = dlarray(X,'BT')`

dlX = 2(B) x 11(T) dlarray 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 16 17 18 19 10

Embed the numeric indices using the `embed`

function. The `embed`

function maps the padding tokens (tokens with index 0) and any other out-of-vocabulary tokens to the same out-of-vocabulary embedding vector.

dlY = embed(dlX,weights);

In this case, the output is an `embeddingDimension`

-by-`N`

-by-`S`

matrix with format `'CBT'`

, where `N`

and `S`

are the number of observations and the number of time steps, respectively. The vector `dlY(:,n,t)`

corresponds to the embedding vector of time-step `t`

of observation `n`

.

## Input Arguments

`X`

— Input data

`dlarray`

object | numeric array

Input data, specified as a formatted `dlarray`

, an unformatted
`dlarray`

, or a numeric array. The elements of `X`

must be nonnegative integers or `NaN`

.

The function returns the embedding vectors in `weights`

corresponding to
the numeric indices in `X`

. If any values in
`X`

are zero, `NaN`

, or greater than the
vocabulary size, then the function returns the out-of-vocabulary vector for that
element.

When `X`

is not a formatted `dlarray`

object, you
must specify the dimension label format using the `'DataFormat'`

option. Also, if `X`

is a numeric array, then
`weights`

must be a `dlarray`

object.

The embed operation expands into a singleton channel dimension of the input data
specified by the `'C'`

dimension label. If the data has no specified
channel dimension, then the function assumes an unspecified singleton channel
dimension.

`weights`

— Embedding weights

`dlarray`

object | numeric array

Embedding weights, specified as a formatted `dlarray`

, an
unformatted `dlarray`

, or a numeric array.

The matrix `weights`

specifies the dimension of the embedding,
the vocabulary size, and the embedding vectors.

The embedding dimension is the number of components `K`

of the
embedding. That is, the embedding maps numeric indices to vectors of length
`K`

. The vocabulary size is the number of discrete elements
`V`

in the embedding. That is, the number of discrete elements of the
underlying data that the embedding supports. The embedding maps out-of-vocabulary
indices to the same out-of-vocabulary embedding vector.

If `weights`

is a formatted `dlarray`

object, then
it must have format `'CU'`

or `'UC'`

. The dimensions
corresponding to the labels `'C'`

and `'U'`

must have
size `K`

and `V`

+1, respectively, where
`K`

and `V`

represent the embedding dimension and
the vocabulary size, respectively. The extra vector corresponds to the out-of-vocabulary
embedding vector.

If `weights`

is not a formatted `dlarray`

object,
then `weights`

must be a
`K`

-by-(`V`

+1) matrix, where `K`

and `V`

represent the embedding dimension and vocabulary size,
respectively.

The function returns the embedding vectors in `weights`

corresponding to
the numeric indices in `X`

. If any values in
`X`

are zero, `NaN`

, or greater than the
vocabulary size, then the function returns the out-of-vocabulary vector for that
element.

`FMT`

— Description of data dimensions

character vector | string scalar

Description of the data dimensions, specified as a character vector or string scalar.

A data format is a string of characters, where each character describes the type of the corresponding data dimension.

The characters are:

`"S"`

— Spatial`"C"`

— Channel`"B"`

— Batch`"T"`

— Time`"U"`

— Unspecified

For example, consider an array containing a batch of sequences where the first, second,
and third dimensions correspond to channels, observations, and time steps, respectively. You
can specify that this array has the format `"CBT"`

(channel, batch,
time).

You can specify multiple dimensions labeled `"S"`

or `"U"`

.
You can use the labels `"C"`

, `"B"`

, and
`"T"`

once each, at most. The software ignores singleton trailing
`"U"`

dimensions after the second dimension.

If the input data is not a formatted `dlarray`

object, then you must
specify the `FMT`

option.

For more information, see Deep Learning Data Formats.

**Data Types: **`char`

| `string`

## Output Arguments

`Y`

— Embedding vectors

`dlarray`

Embedding vectors, returned as a `dlarray`

object. The output
`Y`

has the same underlying data type as the input
`X`

.

The function returns the embedding vectors in `weights`

corresponding to
the numeric indices in `X`

. If any values in
`X`

are zero, `NaN`

, or greater than the
vocabulary size, then the function returns the out-of-vocabulary vector for that
element.

The embedding vectors have `K`

elements, where `K`

is the embedding dimension. The size of dimensions `Y`

depend on
the input data:

If

`X`

is a formatted`dlarray`

with a`'C'`

dimension label, then the embed operation expands into that dimension. That is, the output has the same dimension format as the input, the`'C'`

dimension has size`K`

, the other dimensions have the same size as the corresponding dimensions of the input.If

`X`

is a formatted`dlarray`

without a`'C'`

dimension. Then the operation assumes a singleton channel dimension. The output has a`'C'`

dimension and all other dimensions have the same size and dimension labels. That is, the output has the same format as the input and also a`'C'`

dimension, the`'C'`

dimension has size`K`

, the other dimensions have the same size as the corresponding dimensions of the input.If

`X`

is not a formatted`dlarray`

object and`'DataFormat'`

contains a`'C'`

dimension, then the embed operation expands into that dimension. That is, the output has the number of dimensions as the input, the dimension corresponding to the`'C'`

dimension has size`K`

, the other dimensions have the same size as the corresponding dimensions of the input.If

`X`

is not a formatted`dlarray`

object and`'DataFormat'`

does not contain a 'C' dimension, then the embed operation inserts a new dimension at the beginning. That is, the output has one more dimension as the input, the first dimension corresponding to the`'C'`

dimension has size`K`

, the other dimensions have the same size as the corresponding dimensions of the input.

## Algorithms

### Deep Learning Array Formats

Most deep learning networks and functions operate on different dimensions of the input data in different ways.

For example, an LSTM operation iterates over the time dimension of the input data, and a batch normalization operation normalizes over the batch dimension of the input data.

To provide input data with labeled dimensions or input data with additional layout information, you can use *data formats*.

A data format is a string of characters, where each character describes the type of the corresponding data dimension.

The characters are:

`"S"`

— Spatial`"C"`

— Channel`"B"`

— Batch`"T"`

— Time`"U"`

— Unspecified

For example, consider an array containing a batch of sequences where the first, second,
and third dimensions correspond to channels, observations, and time steps, respectively. You
can specify that this array has the format `"CBT"`

(channel, batch,
time).

To create formatted input data, create a `dlarray`

object and specify the format using the second argument.

To provide additional layout information with unformatted data, specify the format using the `FMT`

argument.

For more information, see Deep Learning Data Formats.

## Extended Capabilities

### GPU Arrays

Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

The `embed`

function
supports GPU array input with these usage notes and limitations:

When at least one of the following input arguments is a

`gpuArray`

or a`dlarray`

with underlying data of type`gpuArray`

, this function runs on the GPU.`X`

`weights`

For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

## Version History

**Introduced in R2020b**

## MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)