Define Text Encoder Model Function
This example shows how to define a text encoder model function.
In the context of deep learning, an encoder is the part of a deep learning network that maps the input to some latent space. You can use these vectors for various tasks. For example,
Classification by applying a softmax operation to the encoded data and using cross entropy loss.
Sequence-to-sequence translation by using the encoded vector as a context vector.
Load Data
The file sonnets.txt
contains all of Shakespeare's sonnets in a single text file.
Read the Shakespeare's Sonnets data from the file "sonnets.txt"
.
filename = "sonnets.txt";
textData = fileread(filename);
The sonnets are indented by two whitespace characters. Remove the indentations using replace
and split the text into separate lines using the split
function. Remove the header from the first nine elements and the short sonnet titles.
textData = replace(textData," ",""); textData = split(textData,newline); textData(1:9) = []; textData(strlength(textData)<5) = [];
Prepare Data
Create a function that tokenizes and preprocesses the text data. The function preprocessText
, listed at the end of the example, performs these steps:
Prepends and appends each input string with the specified start and stop tokens, respectively.
Tokenize the text using
tokenizedDocument
.
Preprocess the text data and specify the start and stop tokens "<start>"
and "<stop>"
, respectively.
startToken = "<start>"; stopToken = "<stop>"; documents = preprocessText(textData,startToken,stopToken);
Create a word encoding object from the tokenized documents.
enc = wordEncoding(documents);
When training a deep learning model, the input data must be a numeric array containing sequences of a fixed length. Because the documents have different lengths, you must pad the shorter sequences with a padding value.
Recreate the word encoding to also include a padding token and determine the index of that token.
paddingToken = "<pad>";
newVocabulary = [enc.Vocabulary paddingToken];
enc = wordEncoding(newVocabulary);
paddingIdx = word2ind(enc,paddingToken)
paddingIdx = 3595
Initialize Model Parameters
The goal of the encoder is to map sequences of word indices to vectors in some latent space.
Initialize the parameters for the following model.
This model uses three operations:
The embedding maps word indices in the range 1 though
vocabularySize
to vectors of dimensionembeddingDimension
, wherevocabularySize
is the number of words in the encoding vocabulary andembeddingDimension
is the number of components learned by the embedding.The LSTM operation takes as input sequences of word vectors and outputs 1-by-
numHiddenUnits
vectors, wherenumHiddenUnits
is the number of hidden units in the LSTM operation.The fully connected operation multiplies the input by a weight matrix adding bias and outputs vectors of size
latentDimension
, wherelatentDimension
is the dimension of the latent space.
Specify the dimensions of the parameters.
embeddingDimension = 100; numHiddenUnits = 150; latentDimension = 50; vocabularySize = enc.NumWords;
Create a struct for the parameters.
parameters = struct;
Initialize the weights of the embedding using the Gaussian using the initializeGaussian
function which is attached to this example as a supporting file. Specify a mean of 0 and a standard deviation of 0.01. To learn more, see Gaussian Initialization.
mu = 0; sigma = 0.01; parameters.emb.Weights = initializeGaussian([embeddingDimension vocabularySize],mu,sigma);
Initialize the learnable parameters for the encoder LSTM operation:
Initialize the input weights with the Glorot initializer using the
initializeGlorot
function which is attached to this example as a supporting file. To learn more, see Glorot Initialization.Initialize the recurrent weights with the orthogonal initializer using the
initializeOrthogonal
function which is attached to this example as a supporting file. To learn more, see Orthogonal Initialization.Initialize the bias with the unit forget gate initializer using the
initializeUnitForgetGate
function which is attached to this example as a supporting file. To learn more, see Unit Forget Gate Initialization.
The sizes of the learnable parameters depend on the size of the input. Because the inputs to the LSTM operation are sequences of word vectors from the embedding operation, the number of input channels is embeddingDimension
.
The input weight matrix has size
4*numHiddenUnits
-by-inputSize
, whereinputSize
is the dimension of the input data.The recurrent weight matrix has size
4*numHiddenUnits
-by-numHiddenUnits
.The bias vector has size
4*numHiddenUnits
-by-1.
sz = [4*numHiddenUnits embeddingDimension]; numOut = 4*numHiddenUnits; numIn = embeddingDimension; parameters.lstmEncoder.InputWeights = initializeGlorot(sz,numOut,numIn); parameters.lstmEncoder.RecurrentWeights = initializeOrthogonal([4*numHiddenUnits numHiddenUnits]); parameters.lstmEncoder.Bias = initializeUnitForgetGate(numHiddenUnits);
Initialize the learnable parameters for the encoder fully connected operation:
Initialize the weights with the Glorot initializer.
Initialize the bias with zeros using the
initializeZeros
function which is attached to this example as a supporting file. To learn more, see Zeros Initialization.
The sizes of the learnable parameters depend on the size of the input. Because the inputs to the fully connected operation are the outputs of the LSTM operation, the number of input channels is numHiddenUnits
. To make the fully connected operation output vectors with size latentDimension
, specify an output size of latentDimension
.
The weights matrix has size
outputSize
-by-inputSize
, whereoutputSize
andinputSize
correspond to the output and input dimensions, respectively.The bias vector has size
outputSize
-by-1.
sz = [latentDimension numHiddenUnits]; numOut = latentDimension; numIn = numHiddenUnits; parameters.fcEncoder.Weights = initializeGlorot(sz,numOut,numIn); parameters.fcEncoder.Bias = initializeZeros([latentDimension 1]);
Define Model Encoder Function
Create the function modelEncoder
, listed in the Encoder Model Function section of the example, that computes the output of the encoder model. The modelEncoder
function, takes as input sequences of word indices, the model parameters, and the sequence lengths, and returns the corresponding latent feature vector.
Prepare Mini-Batch of Data
To train the model using a custom training loop, you must iterate over mini-batches of data and convert it into the format required for the encoder model and the model gradients functions. This section of the example illustrates the steps needed for preparing a mini-batch of data inside the custom training loop.
Prepare an example mini-batch of data. Select a mini-batch of 32 documents from documents
. This represents the mini-batch of data used in an iteration of a custom training loop.
miniBatchSize = 32; idx = 1:miniBatchSize; documentsBatch = documents(idx);
Convert the documents to sequences using the doc2sequence
function and specify to right-pad the sequences with the word index corresponding to the padding token.
X = doc2sequence(enc,documentsBatch, ... PaddingDirection="right", ... PaddingValue=paddingIdx);
The output of the doc2sequence
function is a cell array, where each element is a row vector of word indices. Because the encoder model function requires numeric input, concatenate the rows of the data using the cat
function and specify to concatenate along the first dimension. The output has size miniBatchSize
-by-sequenceLength
, where sequenceLength
is the length of the longest sequence in the mini-batch.
X = cat(1,X{:}); size(X)
ans = 1×2
32 14
Convert the data to a dlarray
with format "BTC"
(batch, time, channel). The software automatically rearranges the output to have format "CTB"
so the output has size 1
-by-miniBatchSize
-by-sequenceLength
.
X = dlarray(X,'BTC');
size(X)
ans = 1×3
1 32 14
For masking, calculate the unpadded sequence lengths of the input data using the doclength
function with the mini-batch of documents as input.
sequenceLengths = doclength(documentsBatch);
This code snippet shows an example of preparing a mini-batch in a custom training loop.
iteration = 0; % Loop over epochs. for epoch = 1:numEpochs % Loop over mini-batches. for i = 1:numIterationsPerEpoch iteration = iteration + 1; % Read mini-batch. idx = (i-1)*miniBatchSize+1:i*miniBatchSize; documentsBatch = documents(idx); % Convert to sequences. X = doc2sequence(enc,documentsBatch, ... PaddingDirection="right", ... PaddingValue=paddingIdx); X = cat(1,X{:}); % Convert to dlarray. X = dlarray(X,"BTC"); % Calculate sequence lengths. sequenceLengths = doclength(documentsBatch); % Evaluate model gradients. % ... % Update learnable parameters. % ... end end
Use Model Function in Model Loss Function
When training a deep learning model with a custom training loop, you must calculate the loss and the gradients of the loss with respect to the learnable parameters. This calculation depends on the output of a forward pass of the model function.
To perform a forward pass of the encoder, use the modelEncoder
function directly with the parameters, data, and sequence lengths as input. The output is a latentDimension
-by-miniBatchSize
matrix.
Z = modelEncoder(parameters,X,sequenceLengths); size(Z)
ans = 1×2
50 32
This code snippet shows an example of using a model encoder function inside the model gradients function.
function [loss,gradients] = modelLoss(parameters,X,sequenceLengths) Z = modelEncoder(parameters,X,sequenceLengths); % Calculate loss. % ... % Calculate gradients. % ... end
This code snippet shows an example of evaluating the model gradients in a custom training loop.
iteration = 0; % Loop over epochs. for epoch = 1:numEpochs % Loop over mini-batches. for i = 1:numIterationsPerEpoch iteration = iteration + 1; % Prepare mini-batch. % ... % Evaluate model gradients. [loss,gradients] = dlfeval(@modelLoss, parameters, X, sequenceLengths); % Update learnable parameters. [parameters,trailingAvg,trailingAvgSq] = adamupdate(parameters,gradients, ... trailingAvg,trailingAvgSq,iteration); end end
Encoder Model Function
The modelEncoder
function, takes as input the model parameters, sequences of word indices, and the sequence lengths, and returns the corresponding latent feature vector.
Because the input data contains padded sequences of different lengths, the padding can have adverse effects on loss calculations. For the LSTM operation, instead of returning the output of the last time step of the sequence (which likely corresponds to the LSTM state after processing lots of padding values), determine the actual last time step given by the sequenceLengths
input.
function Z = modelEncoder(parameters,X,sequenceLengths) % Embedding. weights = parameters.emb.Weights; Z = embed(X,weights); % LSTM. inputWeights = parameters.lstmEncoder.InputWeights; recurrentWeights = parameters.lstmEncoder.RecurrentWeights; bias = parameters.lstmEncoder.Bias; numHiddenUnits = size(recurrentWeights,2); hiddenState = zeros(numHiddenUnits,1,"like",X); cellState = zeros(numHiddenUnits,1,"like",X); Z1 = lstm(Z,hiddenState,cellState,inputWeights,recurrentWeights,bias); % Output mode "last" with masking. miniBatchSize = size(Z1,2); Z = zeros(numHiddenUnits,miniBatchSize,"like",Z1); Z = dlarray(Z,"CB"); for n = 1:miniBatchSize t = sequenceLengths(n); Z(:,n) = Z1(:,n,t); end % Fully connect. weights = parameters.fcEncoder.Weights; bias = parameters.fcEncoder.Bias; Z = fullyconnect(Z,weights,bias); end
Preprocessing Function
The function preprocessText
performs these steps:
Prepends and appends each input string with the specified start and stop tokens, respectively.
Tokenize the text using
tokenizedDocument
.
function documents = preprocessText(textData,startToken,stopToken) % Add start and stop tokens. textData = startToken + textData + stopToken; % Tokenize the text. documents = tokenizedDocument(textData,'CustomTokens',[startToken stopToken]); end
See Also
dlfeval
| dlgradient
| dlarray