Home OpenAI AmbiGraph-Eval: A Benchmark for Resolving Ambiguity in Graph Query Generation
OpenAI

AmbiGraph-Eval: A Benchmark for Resolving Ambiguity in Graph Query Generation

Share
AmbiGraph-Eval: A Benchmark for Resolving Ambiguity in Graph Query Generation
Share


Semantic parsing converts natural language into formal query languages such as SQL or Cypher, allowing users to interact with databases more intuitively. Yet, natural language is inherently ambiguous, often supporting multiple valid interpretations, while query languages demand exactness. Although ambiguity in tabular queries has been explored, graph databases present a challenge due to their interconnected structures. Natural language queries on graph nodes and relationships often yield multiple interpretations due to the structural richness and diversity of graph data. For example, a query like “best evaluated restaurant” may vary depending on whether results consider individual ratings or aggregate scores.

Ambiguities in interactive systems pose serious risks, as failures in semantic parsing can cause queries to diverge from user intent. Such errors may result in unnecessary data retrieval and computation, wasting time and resources. In high-stakes contexts such as real-time decision-making, these issues can degrade performance, raise operational costs, and reduce effectiveness. LLM-based semantic parsing shows promise in addressing complex and ambiguous queries by using linguistic knowledge and interactive clarification. However, LLMs face a challenge of self-preference bias. Trained on human feedback, they may adopt annotator preferences, leading to systematic misalignment with actual user intent.

Researchers from Hong Kong Baptist University, the National University of Singapore, BIFOLD & TU Berlin, and Ant Group present a method to address ambiguity in graph query generation. The concept of ambiguity in graph database queries is developed, categorizing it into three types: Attribute, Relationship, and Attribute-Relationship ambiguities. Researchers introduced AmbiGraph-Eval, a benchmark containing 560 ambiguous queries and corresponding graph database samples to evaluate model performance. It tests nine LLMs, analyzing their ability to resolve ambiguities and identifying areas for improvement. The study reveals that reasoning capabilities provide a limited advantage, highlighting the importance of understanding graph ambiguity and mastering query syntax.

The AmbiGraph-Eval benchmark is designed to evaluate LLMs’ ability to generate syntactically correct and semantically appropriate graph queries, such as Cypher, from ambiguous natural language inputs. Moreover, the dataset is created in two phases: data collection and human review. Ambiguous prompts are obtained through three methods, including direct extraction from graph databases, synthesis from unambiguous data using LLMs, and full generation by prompting LLMs to create new cases. To evaluate performance, the researchers tested four closed-source LLMs (e.g., GPT-4, Claude-3.5-Sonnet) and four open-source LLMs (e.g., Qwen-2.5, LLaMA-3.1). Evaluations are conducted through API calls or using 4x NVIDIA A40 GPUs.

The evaluation of zero-shot performance on the AmbiGraph-Eval benchmark shows disparities among models in resolving graph data ambiguities. In attribute ambiguity tasks, O1-mini excels in same-entity (SE) scenarios, with GPT-4o and LLaMA-3.1 performing well. However, GPT-4o outperforms others in cross-entity (CE) tasks, showing superior reasoning across entities. For relationship ambiguity, LLaMA-3.1 leads, while GPT-4o shows limitations in SE tasks but excels in CE tasks. Attribute-relationship ambiguity emerges as the most challenging, with LLaMA-3.1 performing best in SE tasks and GPT-4o dominating CE tasks. Overall, models struggle more with multi-dimensional ambiguities compared to isolated attribute or relationship ambiguities.

In conclusion, researchers introduced AmbiGraph-Eval, a benchmark for evaluating the ability of LLMs to resolve ambiguity in graph database queries. Evaluations of nine models reveal significant challenges in generating accurate Cypher statements, with strong reasoning skills offering only limited benefits. Core challenges include recognizing ambiguous intent, generating valid syntax, interpreting graph structures, and performing numerical aggregations. Ambiguity detection and syntax generation emerged as major bottlenecks hindering performance. To address these issues, future research should enhance models’ ambiguity resolution and syntax handling using methods like syntax-aware prompting and explicit ambiguity signaling.


Check out the Technical Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.



Source link

Share

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles
Google AI Proposes Novel Machine Learning Algorithms for Differentially Private Partition Selection
OpenAI

Google AI Proposes Novel Machine Learning Algorithms for Differentially Private Partition Selection

Differential privacy (DP) stands as the gold standard for protecting user information...

Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving
OpenAI

Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving

LLMs have rapidly advanced with soaring parameter counts, widespread use of mixture-of-experts...

Native RAG vs. Agentic RAG: Which Approach Advances Enterprise AI Decision-Making?
OpenAI

Native RAG vs. Agentic RAG: Which Approach Advances Enterprise AI Decision-Making?

Retrieval-Augmented Generation (RAG) has emerged as a cornerstone technique for enhancing Large...

Zhipu AI Unveils ComputerRL: An AI Framework Scaling End-to-End Reinforcement Learning for Computer Use Agents
OpenAI

Zhipu AI Unveils ComputerRL: An AI Framework Scaling End-to-End Reinforcement Learning for Computer Use Agents

In the rapidly evolving landscape of AI-driven automation, Zhipu AI has introduced...