Algorithm ViT-Base

Input: An Image, the number of epochs J, batch size b, the number of the layers L

Output: Predicted class

Initialize model parameter Θ

for j ← 1, …, J do

for each batch B do

Use token_embed to get global representation of the entire image is needed

for l ← 1, …, L do

Use transformer_embed: handling the processing of input tensors through a single transformer encoder layer

end for

end for

end for