VOLTRON | LLaVA | BLIP-2 |
1). VOLTRON combines the strengths of the Pretrained LLM model (LLaMA2) and Vision-object detection Model (YoloV8-n). 2). Integration of the two models into a unified pipeline involves converting object probabilities to sentences, passing them through a single/simple linear layer, and transforming them into embeddings using the LLaMA architecture, thus reducing complexity without the use of transformers. | 1). LLaVA’s architecture combines LLaMA for language tasks and CLIP visual encoder ViT-L/14 for visual understanding, enhancing multimodal interactions. It can fine-tune LLaMA using machine-generated instruction-following data. For visual content processing, LLaVA relies on the pre-trained CLIP visual encoder ViT-L/14, which excels in visual comprehension. The encoder connects visual features to language embeddings, bridging the gap between text and images. 2). Emphasis on generating instruction-oriented responses. | 1). BLIP-2 effectively combines frozen pre-trained image models and language models for outstanding performance on various vision-language tasks. To bridge the modality gap, BLIP-2 employs a Q-Former model pre-trained in two stages: representation learning and generative learning. It extracts a fixed number of output features from the image encoder, regardless of input image resolution. 2). Focus on describing the image, prioritizing image understanding over specific user-generated instructions. |