Home OpenAI LLaVA-Critic: An Open-Source Large Multimodal Model Designed to Assess Model Performance Across Diverse Multimodal Tasks
OpenAI

LLaVA-Critic: An Open-Source Large Multimodal Model Designed to Assess Model Performance Across Diverse Multimodal Tasks

Share
LLaVA-Critic: An Open-Source Large Multimodal Model Designed to Assess Model Performance Across Diverse Multimodal Tasks
Share


The ability of learning to evaluate is increasingly taking on a pivotal role in the development of modern large multimodal models (LMMs). As pre-training on existing web data reaches its limits, researchers are shifting towards post-training with AI-enhanced synthetic data. This transition highlights the growing importance of learning to evaluate in modern LMMs. Reliable AI evaluation is important for human labor in complex task assessments, generating effective reward signals in reinforcement learning, and guiding inference-time search. Despite the progress in single-image, multi-image, and video scenarios, the development of open LMMs capable of evaluating the performance of other multimodal models presents a gap in the field.

Existing attempts to address the challenge of AI evaluation have primarily focused on using proprietary LMMs like GPT-4V as generalist evaluators for vision-language tasks. These models have been used in evaluation benchmarks for complex scenarios such as visual chat and detailed captioning. Moreover, open-source alternatives like Prometheus-Vision have emerged as evaluators for specific user-designed scoring criteria. In the preference learning for LMMs, techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been applied to align models with human intentions. Recent research has expanded these concepts to the multimodal space, exploring various strategies to improve visual chat abilities and reduce hallucinations in vision-language models.

Researchers from ByteDance and the University of Maryland, College Park have proposed LLaVA-Critic, the first LMM specifically designed for evaluation tasks. This approach focuses on curating instruction-following data tailored for evaluation purposes. It addresses two primary scenarios: serving as an LMM-as-a-Judge and facilitating Preference Learning. It aims to provide reliable evaluation scores comparable to proprietary models like GPT-4V, offering a free alternative for various evaluation benchmarks in the first scenario. It presents a scalable solution for generating effective reward signals, reducing dependence on costly human feedback collection in the second scenario. The LLaVA-Critic shows a high correlation with commercial GPT models in evaluation tasks and superior performance in preference learning.

LLaVA-Critic is developed by fine-tuning a pre-trained LMM, capable of following diverse instructions. This approach ensures the model can handle a range of high-quality vision tasks. The training process involves using an evaluation prompt that combines multimodal instruction input, model response(s), and an optional reference response. LLaVA-Critic is trained to predict quantitative pointwise scores or pairwise rankings based on specified criteria and provides detailed justifications for its judgments. The model uses standard cross-entropy loss for judgments and justifications. The researchers start with the LLaVA-OneVision(OV) 7B/72B pre-trained checkpoint and fine-tune it on the LLaVA-Critic-113k dataset for one epoch.

The results demonstrate significant improvements in both pointwise scoring and pairwise ranking capabilities of LLaVA-Critic compared to baseline models. The LLaVA-Critic-72B achieves the highest average Pearson-r (0.754) and Kendall’s Tau (0.933) in pointwise scoring, outperforming the baseline LLaVA-OV-72B. In pairwise ranking, LLaVA-Critic-72B outperforms GPT-4o and GPT-4V in comparisons without tie, achieving 73.6% accuracy. LLaVA-Critic-7B outperforms most baselines compared to commercial models and other open-source LMMs in the MLLM-as-a-Judge scenario. These results highlight the effectiveness of LLaVA-Critic as an open-source alternative for multimodal model evaluation.

In conclusion, researchers have proposed LLaVA-Critic, the first LMM specifically designed for evaluation tasks. The researchers have used a high-quality, diverse instruction-following dataset to develop this model that excels in two critical areas. First, as a generalized evaluator, LLaVA-Critic shows remarkable alignment with human and GPT-4o preferences across various evaluation tasks, offering a viable open-source alternative to commercial models. Secondly, in preference learning scenarios, LLaVA-Critic functions as a reliable reward model, outperforming human feedback-based approaches in enhancing the visual chat capabilities of LMMs. This research is a key step toward building self-critiquing capabilities in open-source LMMs, enabling future advancements in scalable, superhuman AI alignment feedback.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

Interested in promoting your company, product, service, or event to over 1 Million AI developers and researchers? Let’s collaborate!


Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.





Source link

Share

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

By submitting this form, you are consenting to receive marketing emails and alerts from: techaireports.com. You can revoke your consent to receive emails at any time by using the Unsubscribe link, found at the bottom of every email.

Latest Posts

Related Articles
Meta AI Introduces VideoJAM: A Novel AI Framework that Enhances Motion Coherence in AI-Generated Videos
OpenAI

Meta AI Introduces VideoJAM: A Novel AI Framework that Enhances Motion Coherence in AI-Generated Videos

Despite recent advancements, generative video models still struggle to represent motion realistically....

Creating an AI Agent-Based System with LangGraph: Putting a Human in the Loop
OpenAI

Creating an AI Agent-Based System with LangGraph: Putting a Human in the Loop

In our previous tutorial, we built an AI agent capable of answering...

ByteDance Proposes OmniHuman-1: An End-to-End Multimodality Framework Generating Human Videos based on a Single Human Image and Motion Signals
OpenAI

ByteDance Proposes OmniHuman-1: An End-to-End Multimodality Framework Generating Human Videos based on a Single Human Image and Motion Signals

Despite progress in AI-driven human animation, existing models often face limitations in...

Meet Crossfire: An Elastic Defense Framework for Graph Neural Networks under Bit Flip Attacks
OpenAI

Meet Crossfire: An Elastic Defense Framework for Graph Neural Networks under Bit Flip Attacks

Graph Neural Networks (GNNs) have found applications in various domains, such as...