| Algorithm ViT-Base |
| Input: An Image, the number of epochs J, batch size b, the number of the layers L |
| Output: Predicted class |
| Initialize model parameter Θ |
| for j ← 1, …, J do |
| for each batch B do |
| Use token_embed to get global representation of the entire image is needed |
| for l ← 1, …, L do |
| Use transformer_embed: handling the processing of input tensors through a single transformer encoder layer |
| end for |
| end for |
| end for |