How can i make vision transformer model that recives input, multiple images

Question

수민 안 on 26 Dec 2023

1
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/2064127-how-can-i-make-vision-transformer-model-that-recives-input-multiple-images

Edited: Debraj Maji on 27 Dec 2023

Is it possible to create or learn a deep learning model in Matlab that receives multiple images as input and has one sequence as output?

For example, I wonder how to receive 20 consecutive images as inputs and output a sequence such as '11153'.

Thanks for reading

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Debraj Maji on 27 Dec 2023

2
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/2064127-how-can-i-make-vision-transformer-model-that-recives-input-multiple-images#answer_1378312

Edited: Debraj Maji on 27 Dec 2023

Hi @수민 안

I understand that you are trying to create a Vision Transformer(ViT) Model which takes multiple images as input and generates a sequence.

Creating a Vision Transformer (ViT) model that receives multiple images as input in MATLAB involves adapting the ViT architecture to handle sequences of images. The original ViT architecture is designed for single-image classification tasks. To modify it for sequential multi-image input, you would treat each image as a token in the sequence and process these tokens in a way similar to how transformers process sequential data in natural language processing (NLP).

Here is a conceptual outline of how you could approach this:

Step 1: Preprocessing:

Resize all images to a fixed size.
Flatten each image into a 1D vector or use patches as tokens, as done in ViT.
Optionally, add positional encoding to retain the order of the images.

Step 2: Transformer Encoder:

Use a series of transformer encoder layers to process the sequence of image tokens.
Each transformer encoder layer would include multi-head self-attention and feedforward neural networks.

Step 3: Sequence Decoder:

After processing the images through the transformer encoder, you need to decode the output into a sequence.
You can use an RNN, LSTM, or another transformer decoder to generate the output sequence.

Step 4: Output Layer:

The output layer would produce the final sequence, which could be a series of classification layers, one for each position in the output sequence.

In MATLAB, you can use Deep Learning Toolbox to create custom layers and models. Currently MATLAB does support a pre-defined ViT. However this scenario would require you to implement the transformer layers manually. You can follow this documentation for steps on how to define custom Deep Learning Layers:

https://www.mathworks.com/help/deeplearning/ug/define-custom-deep-learning-layer.html

For the Pretrained ViT available in MATLAB you can refer to the following documentation: https://www.mathworks.com/help/vision/ref/visiontransformer.html

For additional info on pre-defined Deep Learning Layers in MATLAB you can refer to the following link:

https://www.mathworks.com/help/deeplearning/ug/list-of-deep-learning-layers.html

I hope this resolves your query.

With regards,

Debraj.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 2

Shubham on 27 Dec 2023

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/2064127-how-can-i-make-vision-transformer-model-that-recives-input-multiple-images#answer_1378252

Hi 수민 안,

Yes, it is possible to create a deep learning model in MATLAB that takes multiple images as input and outputs a sequence of numbers. This can be done by using a Convolutional Neural Network (CNN) for image feature extraction combined with a Recurrent Neural Network (RNN) or Long Short-Term Memory (LSTM) network for sequence prediction.

Here's a high-level overview of how you might approach this:

Data Preparation:

Organize your images and corresponding sequence labels.
Preprocess the images (resizing, normalization, etc.).
Split the data into training, validation, and test sets.

2. Model Architecture:

Use a CNN as the feature extractor for the images. You can use pre-trained networks like VGG, ResNet, or create your own.
Flatten the output of the CNN or use global pooling to reduce the dimensionality.
Feed the output into an RNN or LSTM layer(s) to handle the sequence prediction.
The final output layer should have the number of units corresponding to the length of your output sequence with a softmax activation if you are treating each position as a classification problem.

3. Training:

Compile the model with an appropriate loss function (e.g., categorical cross-entropy if you're treating the sequence prediction as a classification problem).
Train the model using the training data with validation data to monitor performance.

4. Evaluation and Testing:

Evaluate the model's performance on the test set.
Adjust hyperparameters or model architecture as needed based on performance.

5. Prediction:

Use the trained model to predict sequences from new sets of images.

I hope this helps!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

How can i make vision transformer model that recives input, multiple images

0 Comments
Show -2 older commentsHide -2 older comments

Answers (2)

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

How can i make vision transformer model that recives input, multiple images

0 Comments Show -2 older commentsHide -2 older comments

Answers (2)

0 Comments Show -2 older commentsHide -2 older comments

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments