How can i make vision transformer model that recives input, multiple images

133 views (last 30 days)
Is it possible to create or learn a deep learning model in Matlab that receives multiple images as input and has one sequence as output?
For example, I wonder how to receive 20 consecutive images as inputs and output a sequence such as '11153'.
Thanks for reading

Answers (2)

Debraj Maji
Debraj Maji on 27 Dec 2023
Edited: Debraj Maji on 27 Dec 2023
I understand that you are trying to create a Vision Transformer(ViT) Model which takes multiple images as input and generates a sequence.
Creating a Vision Transformer (ViT) model that receives multiple images as input in MATLAB involves adapting the ViT architecture to handle sequences of images. The original ViT architecture is designed for single-image classification tasks. To modify it for sequential multi-image input, you would treat each image as a token in the sequence and process these tokens in a way similar to how transformers process sequential data in natural language processing (NLP).
Here is a conceptual outline of how you could approach this:
Step 1: Preprocessing:
  • Resize all images to a fixed size.
  • Flatten each image into a 1D vector or use patches as tokens, as done in ViT.
  • Optionally, add positional encoding to retain the order of the images.
Step 2: Transformer Encoder:
  • Use a series of transformer encoder layers to process the sequence of image tokens.
  • Each transformer encoder layer would include multi-head self-attention and feedforward neural networks.
Step 3: Sequence Decoder:
  • After processing the images through the transformer encoder, you need to decode the output into a sequence.
  • You can use an RNN, LSTM, or another transformer decoder to generate the output sequence.
Step 4: Output Layer:
  • The output layer would produce the final sequence, which could be a series of classification layers, one for each position in the output sequence.
In MATLAB, you can use Deep Learning Toolbox to create custom layers and models. Currently MATLAB does support a pre-defined ViT. However this scenario would require you to implement the transformer layers manually. You can follow this documentation for steps on how to define custom Deep Learning Layers:
For the Pretrained ViT available in MATLAB you can refer to the following documentation:
For additional info on pre-defined Deep Learning Layers in MATLAB you can refer to the following link:
I hope this resolves your query.
With regards,

Shubham on 27 Dec 2023
Yes, it is possible to create a deep learning model in MATLAB that takes multiple images as input and outputs a sequence of numbers. This can be done by using a Convolutional Neural Network (CNN) for image feature extraction combined with a Recurrent Neural Network (RNN) or Long Short-Term Memory (LSTM) network for sequence prediction.
Here's a high-level overview of how you might approach this:
  1. Data Preparation:
  • Organize your images and corresponding sequence labels.
  • Preprocess the images (resizing, normalization, etc.).
  • Split the data into training, validation, and test sets.
2. Model Architecture:
  • Use a CNN as the feature extractor for the images. You can use pre-trained networks like VGG, ResNet, or create your own.
  • Flatten the output of the CNN or use global pooling to reduce the dimensionality.
  • Feed the output into an RNN or LSTM layer(s) to handle the sequence prediction.
  • The final output layer should have the number of units corresponding to the length of your output sequence with a softmax activation if you are treating each position as a classification problem.
3. Training:
  • Compile the model with an appropriate loss function (e.g., categorical cross-entropy if you're treating the sequence prediction as a classification problem).
  • Train the model using the training data with validation data to monitor performance.
4. Evaluation and Testing:
  • Evaluate the model's performance on the test set.
  • Adjust hyperparameters or model architecture as needed based on performance.
5. Prediction:
  • Use the trained model to predict sequences from new sets of images.
I hope this helps!




Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!