Structured Extraction from a PDF¶

This notebook demonstrates structured_output() applied to a real-world task: extracting structured metadata from a research paper that has been converted to markdown.

We fetch the ReAct: Synergizing Reasoning and Acting in Language Models paper (Yao et al., 2022), parse its first three pages to markdown using pymupdf4llm, then ask the LLM to fill in a Pydantic model with the paper's key fields.

Chapter 3 concept: structured_output() accepts any text prompt and a Pydantic model class, and returns a validated instance of that model. Here we use it to turn unstructured PDF text into a typed Python object in a single call.

In [1]:

Copied!

# Uncomment the line below to install `llm-agents-from-scratch` from PyPI
# !pip install llm-agents-from-scratch
# Uncomment the line below to install `llm-agents-from-scratch` from PyPI
# !pip install llm-agents-from-scratch

Running an Ollama service¶

To execute the code provided in this notebook, you'll need to have Ollama installed on your local machine and have its LLM hosting service running. To download Ollama, follow the instructions found on this page: https://ollama.com/download. After downloading and installing Ollama, you can start a service by opening a terminal and running the command ollama serve.

In [2]:

Copied!





import os
import shutil
import subprocess
import time
import urllib.error
import urllib.request


def ensure_ollama(host="http://localhost:11434", timeout=15):
    """Start Ollama if not already running and wait until responsive."""

    def _up():
        try:
            urllib.request.urlopen(f"{host}/api/tags", timeout=1)
            return True
        except (urllib.error.URLError, ConnectionError, TimeoutError):
            return False

    if _up():
        return print(f"✓ Ollama already running at {host}")

    # Lightning persistent path first, then standard locations
    ollama_path = shutil.which("ollama")
    if ollama_path is None:
        for candidate in [
            "/teamspace/studios/this_studio/.local/bin/ollama",
            "/usr/local/bin/ollama",
            "/usr/bin/ollama",
        ]:
            if os.path.exists(candidate):
                ollama_path = candidate
                break
    if ollama_path is None:
        raise RuntimeError(
            "Could not find the ollama binary. Install with: "
            "curl -fsSL https://ollama.com/install.sh | sh",
        )

    print(f"Starting Ollama server ({ollama_path})...")
    subprocess.Popen(
        [ollama_path, "serve"],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
    )

    deadline = time.time() + timeout
    while time.time() < deadline:
        if _up():
            return print(f"✓ Ollama up and running at {host}")
        time.sleep(0.5)

    raise RuntimeError(f"Ollama did not start within {timeout}s")


ensure_ollama()
import os
import shutil
import subprocess
import time
import urllib.error
import urllib.request


def ensure_ollama(host="http://localhost:11434", timeout=15):
    """Start Ollama if not already running and wait until responsive."""

    def _up():
        try:
            urllib.request.urlopen(f"{host}/api/tags", timeout=1)
            return True
        except (urllib.error.URLError, ConnectionError, TimeoutError):
            return False

    if _up():
        return print(f"✓ Ollama already running at {host}")

    # Lightning persistent path first, then standard locations
    ollama_path = shutil.which("ollama")
    if ollama_path is None:
        for candidate in [
            "/teamspace/studios/this_studio/.local/bin/ollama",
            "/usr/local/bin/ollama",
            "/usr/bin/ollama",
        ]:
            if os.path.exists(candidate):
                ollama_path = candidate
                break
    if ollama_path is None:
        raise RuntimeError(
            "Could not find the ollama binary. Install with: "
            "curl -fsSL https://ollama.com/install.sh | sh",
        )

    print(f"Starting Ollama server ({ollama_path})...")
    subprocess.Popen(
        [ollama_path, "serve"],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
    )

    deadline = time.time() + timeout
    while time.time() < deadline:
        if _up():
            return print(f"✓ Ollama up and running at {host}")
        time.sleep(0.5)

    raise RuntimeError(f"Ollama did not start within {timeout}s")


ensure_ollama()

✓ Ollama already running at http://localhost:11434

Installing the PDF Parser¶

We use pymupdf4llm to convert PDF pages to markdown. It is not part of the core llm-agents-from-scratch dependencies, so we install it here.

In [3]:

Copied!

!uv pip install pymupdf4llm
!uv pip install pymupdf4llm

Audited 1 package in 0.66ms

Fetching the Paper¶

We download the first three pages of the ReAct paper directly from arXiv. These pages cover the title, authors, abstract, and opening sections — enough context for the LLM to fill in all extraction fields.

In [4]:

Copied!





import tempfile
from pathlib import Path

import pymupdf4llm

PDF_URL = "https://arxiv.org/pdf/2210.03629"
PDF_PAGES = [0, 1, 2]  # title, abstract, intro

req = urllib.request.Request(
    PDF_URL,
    headers={"User-Agent": "llm-agents-from-scratch/1.0"},
)
with urllib.request.urlopen(req) as resp:
    pdf_bytes = resp.read()

print(f"Downloaded {len(pdf_bytes):,} bytes")
import tempfile
from pathlib import Path

import pymupdf4llm

PDF_URL = "https://arxiv.org/pdf/2210.03629"
PDF_PAGES = [0, 1, 2]  # title, abstract, intro

req = urllib.request.Request(
    PDF_URL,
    headers={"User-Agent": "llm-agents-from-scratch/1.0"},
)
with urllib.request.urlopen(req) as resp:
    pdf_bytes = resp.read()

print(f"Downloaded {len(pdf_bytes):,} bytes")

Downloaded 633,805 bytes

Parsing PDF to Markdown¶

In [5]:

Copied!





with tempfile.NamedTemporaryFile(
    suffix=".pdf",
    delete=False,
) as tmp:
    tmp.write(pdf_bytes)
    tmp_path = Path(tmp.name)

try:
    md_text = pymupdf4llm.to_markdown(str(tmp_path), pages=PDF_PAGES)
finally:
    tmp_path.unlink(missing_ok=True)

print(f"Extracted {len(md_text):,} characters of markdown")
print("--- preview (first 500 chars) ---")
print(md_text[:500])
with tempfile.NamedTemporaryFile(
    suffix=".pdf",
    delete=False,
) as tmp:
    tmp.write(pdf_bytes)
    tmp_path = Path(tmp.name)

try:
    md_text = pymupdf4llm.to_markdown(str(tmp_path), pages=PDF_PAGES)
finally:
    tmp_path.unlink(missing_ok=True)

print(f"Extracted {len(md_text):,} characters of markdown")
print("--- preview (first 500 chars) ---")
print(md_text[:500])

Extracted 14,957 characters of markdown
--- preview (first 500 chars) ---
Published as a conference paper at ICLR 2023 

# REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS 

Shunyu Yao _[∗]_[*,1] , Jeffrey Zhao[2] , Dian Yu[2] , Nan Du[2] , Izhak Shafran[2] , Karthik Narasimhan[1] , Yuan Cao[2] 

1Department of Computer Science, Princeton University 

2Google Research, Brain team 

1{shunyuy,karthikn}@princeton.edu 

2{jeffreyzhao,dianyu,dunan,izhak,yuancao}@google.com 

## ABSTRACT 

While large language models (LLMs) have demonstrated impressive performanc

Defining the Extraction Model¶

We define a Pydantic model that captures the key metadata fields we want to pull from the paper. The LLM will populate every field from the markdown text in a single structured_output() call.

In [6]:

Copied!





from pydantic import BaseModel, Field


class PaperSummary(BaseModel):
    """Structured metadata extracted from a research paper."""

    title: str = Field(description="Full title of the paper.")
    authors: list[str] = Field(
        description="List of author names as they appear on the paper.",
    )
    year: int = Field(
        description="Year the paper was published or submitted.",
    )
    abstract: str = Field(
        description="The paper's abstract, faithfully reproduced.",
    )
    key_contributions: list[str] = Field(
        description=(
            "Three to five concise bullet points summarising "
            "the paper's main contributions."
        ),
    )
    primary_topic: str = Field(
        description=(
            "One short phrase describing the paper's primary research topic "
            "(e.g. 'LLM reasoning', 'tool use', 'multi-agent systems')."
        ),
    )
from pydantic import BaseModel, Field


class PaperSummary(BaseModel):
    """Structured metadata extracted from a research paper."""

    title: str = Field(description="Full title of the paper.")
    authors: list[str] = Field(
        description="List of author names as they appear on the paper.",
    )
    year: int = Field(
        description="Year the paper was published or submitted.",
    )
    abstract: str = Field(
        description="The paper's abstract, faithfully reproduced.",
    )
    key_contributions: list[str] = Field(
        description=(
            "Three to five concise bullet points summarising "
            "the paper's main contributions."
        ),
    )
    primary_topic: str = Field(
        description=(
            "One short phrase describing the paper's primary research topic "
            "(e.g. 'LLM reasoning', 'tool use', 'multi-agent systems')."
        ),
    )

Extracting Structured Data¶

We pass the markdown text as the prompt and PaperSummary as the target model. structured_output() returns a fully validated PaperSummary instance — no parsing or post-processing needed.

In [7]:

Copied!





from llm_agents_from_scratch.llms.ollama import OllamaLLM

llm = OllamaLLM(model="qwen3:14b", think=False)

prompt = (
    "Extract structured metadata from the following research paper.\n\n"
    f"{md_text}"
)

summary = await llm.structured_output(prompt=prompt, mdl=PaperSummary)
print(type(summary), "\n")
print(summary.model_dump())
from llm_agents_from_scratch.llms.ollama import OllamaLLM

llm = OllamaLLM(model="qwen3:14b", think=False)

prompt = (
    "Extract structured metadata from the following research paper.\n\n"
    f"{md_text}"
)

summary = await llm.structured_output(prompt=prompt, mdl=PaperSummary)
print(type(summary), "\n")
print(summary.model_dump())

<class '__main__.PaperSummary'> 

{'title': 'REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS', 'authors': ['Shunyu Yao', 'Jeffrey Zhao', 'Dian Yu', 'Nan Du', 'Izhak Shafran', 'Karthik Narasimhan', 'Yuan Cao'], 'year': 2023, 'abstract': 'While large language models (LLMs) have demonstrated impressive performance across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both verbal reasoning traces and actions pertaining to a task in an interleaved manner, which allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting while also interacting with external environments to incorporate additional information into reasoning. We conduct empirical evaluations of ReAct and state-of-the-art baselines on four diverse benchmarks: HotPotQA, Fever, ALFWorld, and WebShop. ReAct outperforms existing methods in few-shot learning setups and demonstrates benefits in interpretability, trustworthiness, and diagnosability.', 'key_contributions': ['Introduce ReAct, a novel prompt-based paradigm to synergize reasoning and acting in language models for general task solving.', 'Perform extensive experiments across diverse benchmarks to showcase the advantage of ReAct in a few-shot learning setup over prior approaches that perform either reasoning or action generation in isolation.', 'Present systematic ablations and analysis to understand the importance of acting in reasoning tasks, and reasoning in interactive tasks.', 'Analyze the limitations of ReAct under the prompting setup and perform initial finetuning experiments showing the potential of ReAct to improve with additional training data.'], 'primary_topic': 'Synergizing reasoning and acting in language models for general task solving using the ReAct paradigm.'}

Result¶

In [8]:

Copied!





print(f"Title:          {summary.title}")
print(f"Year:           {summary.year}")
print(f"Authors:        {', '.join(summary.authors)}")
print(f"Primary topic:  {summary.primary_topic}")
print()
print("Abstract:")
print(summary.abstract)
print()
print("Key contributions:")
for point in summary.key_contributions:
    print(f"  • {point}")
print(f"Title:          {summary.title}")
print(f"Year:           {summary.year}")
print(f"Authors:        {', '.join(summary.authors)}")
print(f"Primary topic:  {summary.primary_topic}")
print()
print("Abstract:")
print(summary.abstract)
print()
print("Key contributions:")
for point in summary.key_contributions:
    print(f"  • {point}")

Title: REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS
Year: 2023
Authors: Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
Primary topic: Synergizing reasoning and acting in language models for general task solving using the ReAct paradigm.

Abstract:
While large language models (LLMs) have demonstrated impressive performance across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both verbal reasoning traces and actions pertaining to a task in an interleaved manner, which allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting while also interacting with external environments to incorporate additional information into reasoning. We conduct empirical evaluations of ReAct and state-of-the-art baselines on four diverse benchmarks: HotPotQA, Fever, ALFWorld, and WebShop. ReAct outperforms existing methods in few-shot learning setups and demonstrates benefits in interpretability, trustworthiness, and diagnosability.

Key contributions:
• Introduce ReAct, a novel prompt-based paradigm to synergize reasoning and acting in language models for general task solving.
• Perform extensive experiments across diverse benchmarks to showcase the advantage of ReAct in a few-shot learning setup over prior approaches that perform either reasoning or action generation in isolation.
• Present systematic ablations and analysis to understand the importance of acting in reasoning tasks, and reasoning in interactive tasks.
• Analyze the limitations of ReAct under the prompting setup and perform initial finetuning experiments showing the potential of ReAct to improve with additional training data.