Home OpenAI A Coding Implementation of Extracting Structured Data Using LangSmith, Pydantic, LangChain, and Claude 3.7 Sonnet

OpenAI

A Coding Implementation of Extracting Structured Data Using LangSmith, Pydantic, LangChain, and Claude 3.7 Sonnet

adminUpdated 5 months Ago3 Mins read43 Views

A Coding Implementation of Extracting Structured Data Using LangSmith, Pydantic, LangChain, and Claude 3.7 Sonnet

Unlock the power of structured data extraction with LangChain and Claude 3.7 Sonnet, transforming raw text into actionable insights. This tutorial focuses on tracing LLM tool calling using LangSmith, enabling real-time debugging and performance monitoring of your extraction system. We utilize Pydantic schemas for precise data formatting and LangChain’s flexible prompting to guide Claude. Experience example-driven refinement, eliminating the need for complex training. This is a glimpse into LangSmith’s capabilities, showcasing how to build robust extraction pipelines for diverse applications, from document processing to automated data entry.

First, we need to install the necessary packages. We’ll use langchain-core and langchain_anthropic to interface with the Claude model.

!pip install --upgrade langchain-core
!pip install langchain_anthropic

If you’re using LangSmith for tracing and debugging, you can set up environment variables:

LANGSMITH_TRACING=True
LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
LANGSMITH_API_KEY="Your API KEY"
LANGSMITH_PROJECT="extraction_api"

Next, we must define the schema for the information we want to extract. We’ll use Pydantic models to create a structured representation of a person.

from typing import Optional
from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""


    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )

Now, we’ll define a prompt template that instructs Claude on how to perform the extraction task:

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder




prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),


        ("human", "{text}"),
    ]
)

This template provides clear instructions to the model about its task and how to handle missing information.

Next, we’ll initialize the Claude model that will perform our information extraction:

import getpass
import os


if not os.environ.get("ANTHROPIC_API_KEY"):
    os.environ["ANTHROPIC_API_KEY"] = getpass.getpass("Enter API key for Anthropic: ")


from langchain.chat_models import init_chat_model


llm = init_chat_model("claude-3-7-sonnet-20250219", model_provider="anthropic")

Now, we’ll configure our LLM to return structured output according to our schema:

structured_llm = llm.with_structured_output(schema=Person)

This key step tells the model to format its responses according to our Person schema.

Let’s test our extraction system with a simple example:

text = "Alan Smith is 6 feet tall and has blond hair."
prompt = prompt_template.invoke({"text": text})
result = structured_llm.invoke(prompt)
print(result)

Now, Let’s try a more complex example:

from typing import List


class Data(BaseModel):
    """Container for extracted information about people."""
    people: List[Person] = Field(default_factory=list, description="List of people mentioned in the text")


structured_llm = llm.with_structured_output(schema=Data)


text = "My name is Jeff, my hair is black and I am 6 feet tall. Anna has the same color hair as me."
prompt = prompt_template.invoke({"text": text})
result = structured_llm.invoke(prompt)
print(result)




# Next example
text = "The solar system is large, (it was discovered by Nicolaus Copernicus), but earth has only 1 moon."
prompt = prompt_template.invoke({"text": text})
result = structured_llm.invoke(prompt)
print(result)

In conclusion, this tutorial demonstrates building a structured information extraction system with LangChain and Claude that transforms unstructured text into organized data about people. The approach uses Pydantic schemas, custom prompts, and example-driven improvement without requiring specialized training pipelines. The system’s power comes from its flexibility, domain adaptability, and utilization of advanced LLM reasoning capabilities.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 85k+ ML SubReddit.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

Source link

Previous post Qwen Releases the Qwen2.5-VL-32B-Instruct: A 32B Parameter VLM that Surpasses Qwen2.5-VL-72B and Other Models like GPT-4o Mini

Next post TokenSet: A Dynamic Set-Based Framework for Semantic-Aware Visual Representation

Google AI Introduces Gemini 2.5 Flash Image: A New Model that Allows You to Generate and Edit Images by Simply Describing Them

Google AI has just unveiled Gemini 2.5 Flash Image, a new generation...

admin3 Mins read

OpenAI

What is MLSecOps(Secure CI/CD for Machine Learning)?: Top MLSecOps Tools (2025)

Machine learning (ML) is transforming industries, powering innovation in domains as varied...

admin5 Mins read

OpenAI

Your LLM is 5x Slower Than It Should Be. The Reason? Pessimism—and Stanford Researchers Just Showed How to Fix It

In the fast-paced world of AI, large language models (LLMs) like GPT-4...

admin4 Mins read

OpenAI

Building a Reliable End-to-End Machine Learning Pipeline Using MLE-Agent and Ollama Locally

We begin this tutorial by showing how we can combine MLE-Agent with...

admin5 Mins read

This Week

How They Imitate Your Writing Style and Improve Efficiency

Features, Benefits, Review and Alternatives • AI Parabellum

What is a Voice Agent in AI? Top 9 Voice Agent Platforms to Know (2025)

Weekly Newsletter

A Coding Implementation of Extracting Structured Data Using LangSmith, Pydantic, LangChain, and Claude 3.7 Sonnet

Leave a comment

Leave a Reply Cancel reply

Latest Posts

Features, Benefits, Review and Alternatives • AI Parabellum

What is a Voice Agent in AI? Top 9 Voice Agent Platforms to Know (2025)

Large Language Models LLMs vs. Small Language Models SLMs for Financial Institutions: A 2025 Practical Enterprise AI Guide

Google AI Proposes Novel Machine Learning Algorithms for Differentially Private Partition Selection

Google AI Introduces Gemini 2.5 Flash Image: A New Model that Allows You to Generate and Edit Images by Simply Describing Them

What is MLSecOps(Secure CI/CD for Machine Learning)?: Top MLSecOps Tools (2025)

Your LLM is 5x Slower Than It Should Be. The Reason? Pessimism—and Stanford Researchers Just Showed How to Fix It

Building a Reliable End-to-End Machine Learning Pipeline Using MLE-Agent and Ollama Locally

Get to Know Us

keep in touch