Home OpenAI Enhancing Text Retrieval: Overcoming the Limitations with Contextual Document Embeddings

OpenAI

Enhancing Text Retrieval: Overcoming the Limitations with Contextual Document Embeddings

adminUpdated 9 months Ago3 Mins read57 Views

Enhancing Text Retrieval: Overcoming the Limitations with Contextual Document Embeddings

Text retrieval in machine learning faces significant challenges in developing effective methods for indexing and retrieving documents. Traditional approaches relied on sparse lexical matching methods like BM25, which used n-gram frequencies. However, these statistical models have limitations in capturing semantic relationships and context. The primary neural method, a dual encoder architecture, encodes documents and queries into a dense latent space for retrieval. However, it needs to improve the ability to easily utilize previous corpus statistics such as inverse document frequency (IDF). This limitation makes neural models less adaptable to specific retrieval domains, as they need more context dependence than statistical models.

Researchers have made various attempts to address the challenges in text retrieval. Biencoder text embedding models like DPR, GTR, Contriever, LaPraDoR, Instructor, Nomic-Embed, E5, and GTE have been developed to improve retrieval performance. Some efforts have focused on adapting these models to new corpora at test time, proposing solutions such as unsupervised span-sampling, training on test corpora, and distillation from re-rankers. Moreover, other approaches include query clustering before training and considering contrastive batch sampling as a global optimization problem. Test-time adaptation techniques like pseudo-relevance feedback have also been explored, where relevant documents are used to enhance query representation.

Researchers from Cornell University have proposed an approach to address the limitations of current text retrieval models. Researchers argue that existing document embeddings lack context for targeted retrieval use cases and suggest that document embeddings should consider both the document itself and its neighboring documents. Two complementary methods are developed to achieve this, for creating contextualized document embeddings. The first method introduces an alternative contrastive learning objective that explicitly adds document neighbors into the intra-batch contextual loss. The second method presents a new contextual architecture that directly encodes neighboring document information into the representation.

The proposed method utilizes a two-phase training approach: a large weakly-supervised pre-training phase and a short supervised phase. The initial setup to conduct experiments uses a small setting with a six-layer transformer, a maximum sequence length of 64, and up to 64 additional contextual tokens. This is evaluated on a truncated version of the BEIR benchmark, with various batch and cluster sizes. For the large setting, a single model is trained on sequences of length 512 with 512 contextual documents and evaluated on the full MTEB benchmark. The training data included 200M weakly supervised data points from internet sources and 1.8M human-written query-document pairs from retrieval datasets. The model uses NomicBERT as its backbone, with 137M parameters.

The contextual batching approach demonstrated a strong correlation between batch difficulty and downstream performance, where harder batches in contrastive learning lead to better gradient approximation and more effective learning. The contextual architecture has improved performance across all downstream datasets, with improvements in smaller, out-of-domain datasets like ArguAna and SciFact. The model gains optimal performance when trained on a full scale after four epochs on the BGE meta-datasets. The model “cde-small-v1” obtained state-of-the-art results on the MTEB benchmark compared to same-size models, showing enhanced embedding performance across multiple domains like clustering, classification, and semantic similarity.

In this paper, researchers from Cornell University have proposed a method to address the limitations of current text retrieval models. This paper consists of two significant improvements to traditional “biencoder” models for generating embeddings. The first enhancement introduces an algorithm for reordering training data points to create more challenging batches, which enhances vanilla training with minimal modifications. The second improvement introduces a corpus-aware architecture for retrieval, enabling the training of a state-of-the-art text embedding model. This contextual architecture effectively incorporates neighboring document information, addressing the limitations of context-independent embeddings.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)

Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference: Join over 300 GenAI executives from Bayer, Microsoft, Flagship Pioneering to learn how to build fast, accurate AI search on object storage. (Promoted)

Source link

Previous post Machine Learning Meets Physics: The 2024 Nobel Prize Story

Agent Prune: A Robust and Economic Multi-Agent Communication Framework for LLMs that Saves Cost and Removes Redundant and Malicious Contents

Next post Agent Prune: A Robust and Economic Multi-Agent Communication Framework for LLMs that Saves Cost and Removes Redundant and Malicious Contents

Microsoft Open-Sources GitHub Copilot Chat Extension for VS Code—Now Free for All Developers

Microsoft has officially open-sourced the GitHub Copilot Chat extension for Visual Studio...

admin3 Mins read

OpenAI

Hugging Face Releases SmolLM3: A 3B Long-Context, Multilingual Reasoning Model

Hugging Face just released SmolLM3, the latest version of its “Smol” language...

admin3 Mins read

OpenAI

A Code Implementation for Designing Intelligent Multi-Agent Workflows with the BeeAI Framework

BeeAI FrameworkIn this tutorial, we explore the power and flexibility of the...

admin10 Mins read

OpenAI

Anthropic Proposes Targeted Transparency Framework for Frontier AI Systems

As the development of large-scale AI systems accelerates, concerns about safety, oversight,...

admin3 Mins read

This Week

OMEGA: A Structured Math Benchmark to Probe the Reasoning Limits of LLMs

LongWriter-Zero: A Reinforcement Learning Framework for Ultra-Long Text Generation Without Synthetic Data

Building Advanced Multi-Agent AI Workflows by Leveraging AutoGen and Semantic Kernel

Weekly Newsletter

Enhancing Text Retrieval: Overcoming the Limitations with Contextual Document Embeddings

Leave a comment

Leave a Reply Cancel reply

Latest Posts

LongWriter-Zero: A Reinforcement Learning Framework for Ultra-Long Text Generation Without Synthetic Data

Building Advanced Multi-Agent AI Workflows by Leveraging AutoGen and Semantic Kernel

TabArena: Benchmarking Tabular Machine Learning with Reproducibility and Ensembling at Scale

DSRL: A Latent-Space Reinforcement Learning Approach to Adapt Diffusion Policies in Real-World Robotics

Microsoft Open-Sources GitHub Copilot Chat Extension for VS Code—Now Free for All Developers

Hugging Face Releases SmolLM3: A 3B Long-Context, Multilingual Reasoning Model

A Code Implementation for Designing Intelligent Multi-Agent Workflows with the BeeAI Framework

Anthropic Proposes Targeted Transparency Framework for Frontier AI Systems

Get to Know Us

keep in touch