VOLTRON

LLaVA

BLIP-2

1). VOLTRON combines the strengths of the Pretrained LLM model (LLaMA2) and Vision-object detection Model (YoloV8-n).

2). Integration of the two models into a unified pipeline involves converting object probabilities to sentences, passing them through a single/simple linear layer, and transforming them into embeddings using the LLaMA architecture, thus reducing complexity without the use of transformers.

1). LLaVA’s architecture combines LLaMA for language tasks and CLIP visual encoder ViT-L/14 for visual understanding, enhancing multimodal interactions. It can fine-tune LLaMA using machine-generated instruction-following data. For visual content processing, LLaVA relies on the pre-trained CLIP visual encoder ViT-L/14, which excels in visual comprehension. The encoder connects visual features to language embeddings, bridging the gap between text and images.

2). Emphasis on generating instruction-oriented responses.

1). BLIP-2 effectively combines frozen pre-trained image models and language models for outstanding performance on various vision-language tasks. To bridge the modality gap, BLIP-2 employs a Q-Former model pre-trained in two stages: representation learning and generative learning. It extracts a fixed number of output features from the image encoder, regardless of input image resolution.

2). Focus on describing the image, prioritizing image understanding over specific user-generated instructions.