Embed a mini-batch of text data.
Create an array of tokenized documents.
To encode text data as sequences of numeric indices, create a wordEncoding
object.
Initialize the embedding weights. Specify an embedding dimension of 100, and a vocabulary size to be consistent with the vocabulary size corresponding to the number of words in the word encoding plus one.
Convert the tokenized documents to sequences of word vectors using the doc2sequence
function. The doc2sequence
function, by default, discards out-of-vocabulary tokens in the input data. To map out-of-vocabulary tokens to the last vector of embedding weights, set the 'UnknownWord'
option to 'nan'
. The doc2sequence
function, by default, left-pads the input sequences with zeros to have the same length
sequences=2×1 cell array
{1x11 double}
{1x11 double}
The output is a cell array, where each element corresponds to an observation. Each element is a row vector with elements representing the individual tokens in the corresponding observation including the padding values.
Convert the cell array to a numeric array by vertically concatenating the rows.
X = 2×11
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 2 16 17 18 19 10
Convert the numeric indices to dlarray
. Because the rows and columns of X
correspond to observations and time steps, respectively, specify the format 'BT'
.
dlX =
2(B) x 11(T) dlarray
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 2 16 17 18 19 10
Embed the numeric indices using the embed
function. The embed
function maps the padding tokens (tokens with index 0) and any other out-of-vocabulary tokens to the same out-of-vocabulary embedding vector.
In this case, the output is an embeddingDimension
-by-N
-by-S
matrix with format 'CBT'
, where N
and S
are the number of observations and the number of time steps, respectively. The vector dlY(:,n,t)
corresponds to the embedding vector of time-step t
of observation n
.