Home OpenAI Google DeepMind Researchers Propose Human-Centric Alignment for Vision Models to Boost AI Generalization and Interpretation
OpenAI

Google DeepMind Researchers Propose Human-Centric Alignment for Vision Models to Boost AI Generalization and Interpretation

Share
Google DeepMind Researchers Propose Human-Centric Alignment for Vision Models to Boost AI Generalization and Interpretation
Share


Deep learning has made significant strides in artificial intelligence, particularly in natural language processing and computer vision. However, even the most advanced systems often fail in ways that humans would not, highlighting a critical gap between artificial and human intelligence. This discrepancy has reignited debates about whether neural networks possess the essential components of human cognition. The challenge lies in developing systems that exhibit more human-like behavior, particularly regarding robustness and generalization. Unlike humans, who can adapt to environmental changes and generalize across diverse visual settings, AI models often need help with shifted data distributions between training and test sets. This lack of robustness in visual representations poses significant challenges for downstream applications that require strong generalization capabilities.

Researchers from Google DeepMind, Machine Learning Group, Technische Universität Berlin, BIFOLD, Berlin Institute for the Foundations of Learning and Data, Max Planck Institute for Human Development, Anthropic, Department of Artificial Intelligence, Korea University, Seoul, Max Planck Institute for Informatics propose a unique framework called AligNet to address the misalignment between human and machine visual representations. This approach aims to simulate large-scale human-like similarity judgment datasets for aligning neural network models with human perception. The methodology begins by using an affine transformation to align model representations with human semantic judgments in triplet odd-one-out tasks. This process incorporates uncertainty measures from human responses to improve model calibration. The aligned version of a state-of-the-art vision foundation model (VFM) then serves as a surrogate for generating human-like similarity judgments. By grouping representations into meaningful superordinate categories, the researchers sample semantically significant triplets and obtain odd-one-out responses from the surrogate model, resulting in a comprehensive dataset of human-like triplet judgments called AligNet.

The results demonstrate significant improvements in aligning machine representations with human judgments across multiple levels of abstraction. For global coarse-grained semantics, soft alignment substantially enhanced model performance, with accuracies increasing from 36.09-57.38% to 65.70-68.56%, surpassing the human-to-human reliability score of 61.92%. In local fine-grained semantics, alignment improved moderately, with accuracies rising from 46.04-57.72% to 58.93-62.92%. For class-boundary triplets, AligNet fine-tuning achieved remarkable alignment, with accuracies reaching 93.09-94.24%, exceeding the human noise ceiling of 89.21%. The effectiveness of alignment varied across abstraction levels, with different models showing strengths in different areas. Notably, AligNet fine-tuning generalized well to other human similarity judgment datasets, demonstrating substantial improvements in alignment across various object similarity tasks, including multi-arrangement and Likert-scale pairwise similarity ratings.

The AligNet methodology comprises several key steps to align machine representations with human visual perception. Initially, it uses the THINGS triplet odd-one-out dataset to learn an affine transformation into a global human object similarity space. This transformation is applied to a teacher model’s representations, creating a similarity matrix for object pairs. The process incorporates uncertainty measures about human responses using an approximate Bayesian inference method, replacing hard alignment with soft alignment.

The objective function of learning the uncertainty distillation transformation is to combine soft alignment with regularization to preserve local similarity structure. The transformed representations are then clustered into superordinate categories using k-means clustering. These clusters guide the generation of triplets from distinct ImageNet images, with odd-one-out choices determined by the surrogate teacher model.

Finally, a robust Kullback-Leibler divergence-based objective function facilitates the distillation of the teacher’s pairwise similarity structure into a student network. This AligNet objective is combined with regularization to preserve the pre-trained representation space, resulting in a fine-tuned student model that better aligns with human visual representations across multiple levels of abstraction.

This study addresses a critical deficiency in vision foundation models: their inability to adequately represent the multi-level conceptual structure of human semantic knowledge. By developing the AligNet framework, which aligns deep learning models with human similarity judgments, the research demonstrates significant improvements in model performance across various cognitive and machine learning tasks. The findings contribute to the ongoing debate about neural networks’ capacity to capture human-like intelligence, particularly in relational understanding and hierarchical knowledge organization. Ultimately, this work illustrates how representational alignment can enhance model generalization and robustness, bridging the gap between artificial and human visual perception.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

⏩ ⏩ FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)


Asjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.





Source link

Share

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

By submitting this form, you are consenting to receive marketing emails and alerts from: techaireports.com. You can revoke your consent to receive emails at any time by using the Unsubscribe link, found at the bottom of every email.

Latest Posts

Related Articles
s1: A Simple Yet Powerful Test-Time Scaling Approach for LLMs
OpenAI

s1: A Simple Yet Powerful Test-Time Scaling Approach for LLMs

Language models (LMs) have significantly progressed through increased computational power during training,...

Meta AI Introduces MILS: A Training-Free Multimodal AI Framework for Zero-Shot Image, Video, and Audio Understanding
OpenAI

Meta AI Introduces MILS: A Training-Free Multimodal AI Framework for Zero-Shot Image, Video, and Audio Understanding

Large Language Models (LLMs) are primarily designed for text-based tasks, limiting their...

Enhancing Mobile Ad Hoc Network Security: A Hybrid Deep Learning Model for Flooding Attack Detection
OpenAI

Enhancing Mobile Ad Hoc Network Security: A Hybrid Deep Learning Model for Flooding Attack Detection

Ad hoc networks are decentralized, self-configuring networks where nodes communicate without fixed...

4 Open-Source Alternatives to OpenAI’s 0/Month Deep Research AI Agent
OpenAI

4 Open-Source Alternatives to OpenAI’s $200/Month Deep Research AI Agent

OpenAI’s Deep Research AI Agent offers a powerful research assistant at a...