Capstone 1 — Monte Carlo Estimation of Pi¶

Prerequisites: This capstone assumes you've read Chapters 2–4.

Setup Instructions¶

To ensure you have the required dependencies to run this notebook, you'll need to have our llm-agents-from-scratch framework installed on the running Jupyter kernel. To do this, you can launch this notebook with the following command while within the project's root directory:

# we need the notebook-utils and openai extras for this notebook
uv sync --extra notebook-utils --extra openai

# to launch the notebook
uv run --with jupyter jupyter lab

Alternatively, if you just want to use the published version of llm-agents-from-scratch without local development, you can install it from PyPI by uncommenting the cell below.

In [ ]:

Copied!

# Uncomment the line below to install `llm-agents-from-scratch` from PyPI
# !pip install 'llm-agents-from-scratch[notebook-utils,openai]'
# Uncomment the line below to install `llm-agents-from-scratch` from PyPI
# !pip install 'llm-agents-from-scratch[notebook-utils,openai]'

Running an Ollama service¶

To execute the code provided in this notebook, you’ll need to have Ollama installed on your local machine and have its LLM hosting service running. To download Ollama, follow the instructions found on this page: https://ollama.com/download. After downloading and installing Ollama, you can start a service by opening a terminal and running the command ollama serve.

If running on Runpod using the Runpod templates for this Capstone project, then an Ollama service will already be running for you.

Setup¶

In [1]:

Copied!

import logging
import os

from llm_agents_from_scratch.logger import enable_console_logging
import logging
import os

from llm_agents_from_scratch.logger import enable_console_logging

Constants¶

In [2]:

Copied!





IS_ON_RUNPOD = "RUNPOD_POD_ID" in os.environ
LOGGING_ENABLED = True
LOGGING_LEVEL = logging.INFO

# for task execution
MAX_STEPS = 20
NUM_REPLICATIONS = 10
IS_ON_RUNPOD = "RUNPOD_POD_ID" in os.environ
LOGGING_ENABLED = True
LOGGING_LEVEL = logging.INFO

# for task execution
MAX_STEPS = 20
NUM_REPLICATIONS = 10

In [3]:

Copied!





# Install additional dependencies for notebook
if IS_ON_RUNPOD:
    !uv pip install numpy pandas --system
else:
    !uv pip install numpy pandas
# Install additional dependencies for notebook
if IS_ON_RUNPOD:
    !uv pip install numpy pandas --system
else:
    !uv pip install numpy pandas

Audited 2 packages in 0.85ms

In [4]:

Copied!

# maybe enable logging
if LOGGING_ENABLED:
    enable_console_logging(LOGGING_LEVEL)
# maybe enable logging
if LOGGING_ENABLED:
    enable_console_logging(LOGGING_LEVEL)

LLMs¶

In [5]:

Copied!





if IS_ON_RUNPOD:
    backbone_llm = os.getenv("OLLAMA_MODEL")
    judge_llm = "gpt-5" if os.getenv("OPENAI_API_KEY") else backbone_llm
else:
    backbone_llm = "qwen3:8b"
    judge_llm = "gpt-5" if os.getenv("OPENAI_API_KEY") else backbone_llm
if IS_ON_RUNPOD:
    backbone_llm = os.getenv("OLLAMA_MODEL")
    judge_llm = "gpt-5" if os.getenv("OPENAI_API_KEY") else backbone_llm
else:
    backbone_llm = "qwen3:8b"
    judge_llm = "gpt-5" if os.getenv("OPENAI_API_KEY") else backbone_llm

In [6]:

Copied!

print(f"Backbone LLM: {backbone_llm}")
print(f"Judge LLM: {judge_llm}")
print(f"Backbone LLM: {backbone_llm}")
print(f"Judge LLM: {judge_llm}")

Backbone LLM: qwen3:8b
Judge LLM: gpt-5

Build Tools¶

(Listing 5.1) Tool: `generate_random_sample()`¶

In [7]:

Copied!





import uuid

import numpy as np
from pydantic import BaseModel, ConfigDict, Field, computed_field

from llm_agents_from_scratch.tools import PydanticFunctionTool

# Global registry to store samples
SAMPLE_REGISTRY: dict[str, list[tuple[float, float]]] = {}


class RandomSampleParams(BaseModel):
    """Params for generate_random_sample."""

    model_config = ConfigDict(extra="forbid")
    n: int = Field(description="The number of random points to generate")


class RandomSample(BaseModel):
    """Result from generate_random_sample."""

    sample_id: str = Field(
        description="Pass this sample_id to monte_carlo_estimate",
    )

    @computed_field
    @property
    def sample_size(
        self,
    ) -> int:
        """Determine n from SAMPLE_REGISTRY."""
        return len(SAMPLE_REGISTRY[self.sample_id])

    def __str__(self) -> str:
        """String representation of RandomSample."""
        return self.model_dump_json()


def generate_random_sample(params: RandomSampleParams) -> RandomSample:
    """Generate n random points in [0, 1] × [0, 1].

    Returns a sample_id. Pass this sample_id directly to monte_carlo_estimate.
    """
    pts = np.random.uniform(size=(params.n, 2))

    sample_id = str(uuid.uuid4())
    SAMPLE_REGISTRY[sample_id] = [tuple(pt) for pt in pts.tolist()]

    return RandomSample(sample_id=sample_id)


# generate random sample tool
random_sample_tool = PydanticFunctionTool(generate_random_sample)
import uuid

import numpy as np
from pydantic import BaseModel, ConfigDict, Field, computed_field

from llm_agents_from_scratch.tools import PydanticFunctionTool

# Global registry to store samples
SAMPLE_REGISTRY: dict[str, list[tuple[float, float]]] = {}


class RandomSampleParams(BaseModel):
    """Params for generate_random_sample."""

    model_config = ConfigDict(extra="forbid")
    n: int = Field(description="The number of random points to generate")


class RandomSample(BaseModel):
    """Result from generate_random_sample."""

    sample_id: str = Field(
        description="Pass this sample_id to monte_carlo_estimate",
    )

    @computed_field
    @property
    def sample_size(
        self,
    ) -> int:
        """Determine n from SAMPLE_REGISTRY."""
        return len(SAMPLE_REGISTRY[self.sample_id])

    def __str__(self) -> str:
        """String representation of RandomSample."""
        return self.model_dump_json()


def generate_random_sample(params: RandomSampleParams) -> RandomSample:
    """Generate n random points in [0, 1] × [0, 1].

    Returns a sample_id. Pass this sample_id directly to monte_carlo_estimate.
    """
    pts = np.random.uniform(size=(params.n, 2))

    sample_id = str(uuid.uuid4())
    SAMPLE_REGISTRY[sample_id] = [tuple(pt) for pt in pts.tolist()]

    return RandomSample(sample_id=sample_id)


# generate random sample tool
random_sample_tool = PydanticFunctionTool(generate_random_sample)

Demonstration¶

In [8]:

Copied!





from llm_agents_from_scratch.data_structures import ToolCall

rs_tool_call = ToolCall(
    tool_name=random_sample_tool.name,
    arguments={"n": 5000},
)
rs_tool_call_result = random_sample_tool(rs_tool_call)
rs_tool_call_result
from llm_agents_from_scratch.data_structures import ToolCall

rs_tool_call = ToolCall(
    tool_name=random_sample_tool.name,
    arguments={"n": 5000},
)
rs_tool_call_result = random_sample_tool(rs_tool_call)
rs_tool_call_result

Out[8]:

ToolCallResult(tool_call_id='92b63caf-caa4-4787-ba5a-f6c3ce49e966', content='{"sample_id":"40c260ba-cc15-4538-9657-c2b88bce0aa0","sample_size":5000}', error=False)

(Listing 5.2) Tool: `add_more_points()`¶

In [9]:

Copied!





class AddPointsParams(BaseModel):
    """Params for add_more_points_to_sample."""

    model_config = ConfigDict(extra="forbid")
    sample_id: str = Field(
        description="The sample_id of the sample to augment",
    )
    n: int = Field(description="The number of random points to generate")


def add_more_points_to_sample(params: AddPointsParams) -> RandomSample:
    """Add n more random points to an existing random sample.

    Returns a sample_id and the total number of points.
    """
    pts = np.random.uniform(size=(params.n, 2))

    # augment sample
    SAMPLE_REGISTRY[params.sample_id] += [tuple(pt) for pt in pts.tolist()]

    return RandomSample(sample_id=params.sample_id)


# create tool
add_more_points_tool = PydanticFunctionTool(add_more_points_to_sample)
class AddPointsParams(BaseModel):
    """Params for add_more_points_to_sample."""

    model_config = ConfigDict(extra="forbid")
    sample_id: str = Field(
        description="The sample_id of the sample to augment",
    )
    n: int = Field(description="The number of random points to generate")


def add_more_points_to_sample(params: AddPointsParams) -> RandomSample:
    """Add n more random points to an existing random sample.

    Returns a sample_id and the total number of points.
    """
    pts = np.random.uniform(size=(params.n, 2))

    # augment sample
    SAMPLE_REGISTRY[params.sample_id] += [tuple(pt) for pt in pts.tolist()]

    return RandomSample(sample_id=params.sample_id)


# create tool
add_more_points_tool = PydanticFunctionTool(add_more_points_to_sample)

Demonstration¶

In [10]:

Copied!





# get the sample ID of the previous random_sample_tool() invocation
random_sample = RandomSample.model_validate_json(rs_tool_call_result.content)

# build tool call for add more points
add_pts_tool_call = ToolCall(
    tool_name=add_more_points_tool.name,
    arguments={
        "sample_id": random_sample.sample_id,
        "n": 500,
    },
)
add_pts_tool_call_result = add_more_points_tool(add_pts_tool_call)
add_pts_tool_call_result
# get the sample ID of the previous random_sample_tool() invocation
random_sample = RandomSample.model_validate_json(rs_tool_call_result.content)

# build tool call for add more points
add_pts_tool_call = ToolCall(
    tool_name=add_more_points_tool.name,
    arguments={
        "sample_id": random_sample.sample_id,
        "n": 500,
    },
)
add_pts_tool_call_result = add_more_points_tool(add_pts_tool_call)
add_pts_tool_call_result

Out[10]:

ToolCallResult(tool_call_id='f9760b27-754e-420a-a9c6-f47e73226f9a', content='{"sample_id":"40c260ba-cc15-4538-9657-c2b88bce0aa0","sample_size":5500}', error=False)

(Listing 5.3) Tool: `monte_carlo_estimate()`¶

In [11]:

Copied!





class MonteCarloEstimateParams(BaseModel):
    """Params for monte_carlo_estimate."""

    model_config = ConfigDict(extra="forbid")
    sample_id: str = Field(
        description="The sample_id returned by generate_random_sample",
    )


class MonteCarloEstimateResult(BaseModel):
    """Results for monte_carlo_estimate."""

    sample_id: str
    sample_size: int
    estimate: float

    def __str__(self) -> str:
        """String representation of MonteCarloEstimateResult."""
        return self.model_dump_json()


def monte_carlo_estimate(
    params: MonteCarloEstimateParams,
) -> MonteCarloEstimateResult:
    """Estimate pi using Monte Carlo method.

    Args:
        params: Contains sample_id from generate_random_sample.

    Returns:
        Estimate of pi (float).
    """
    points = SAMPLE_REGISTRY[params.sample_id]
    n = len(points)
    inside = sum((x**2 + y**2) < 1 for x, y in points)
    return MonteCarloEstimateResult(
        estimate=(inside / n) * 4,
        sample_id=params.sample_id,
        sample_size=n,
    )


# create tool
monte_carlo_estimate_tool = PydanticFunctionTool(monte_carlo_estimate)
class MonteCarloEstimateParams(BaseModel):
    """Params for monte_carlo_estimate."""

    model_config = ConfigDict(extra="forbid")
    sample_id: str = Field(
        description="The sample_id returned by generate_random_sample",
    )


class MonteCarloEstimateResult(BaseModel):
    """Results for monte_carlo_estimate."""

    sample_id: str
    sample_size: int
    estimate: float

    def __str__(self) -> str:
        """String representation of MonteCarloEstimateResult."""
        return self.model_dump_json()


def monte_carlo_estimate(
    params: MonteCarloEstimateParams,
) -> MonteCarloEstimateResult:
    """Estimate pi using Monte Carlo method.

    Args:
        params: Contains sample_id from generate_random_sample.

    Returns:
        Estimate of pi (float).
    """
    points = SAMPLE_REGISTRY[params.sample_id]
    n = len(points)
    inside = sum((x**2 + y**2) < 1 for x, y in points)
    return MonteCarloEstimateResult(
        estimate=(inside / n) * 4,
        sample_id=params.sample_id,
        sample_size=n,
    )


# create tool
monte_carlo_estimate_tool = PydanticFunctionTool(monte_carlo_estimate)

Demonstration¶

In [12]:

Copied!





# build tool call for estimating Pi
mc_estimate_tool_call = ToolCall(
    tool_name=monte_carlo_estimate_tool.name,
    arguments={
        "sample_id": random_sample.sample_id,
    },
)
mc_estimate_tool_call_result = monte_carlo_estimate_tool(mc_estimate_tool_call)
mc_estimate_tool_call_result
# build tool call for estimating Pi
mc_estimate_tool_call = ToolCall(
    tool_name=monte_carlo_estimate_tool.name,
    arguments={
        "sample_id": random_sample.sample_id,
    },
)
mc_estimate_tool_call_result = monte_carlo_estimate_tool(mc_estimate_tool_call)
mc_estimate_tool_call_result

Out[12]:

ToolCallResult(tool_call_id='efbafba8-de8c-4b82-9bc1-0428cf247a04', content='{"sample_id":"40c260ba-cc15-4538-9657-c2b88bce0aa0","sample_size":5500,"estimate":3.1512727272727274}', error=False)

Define the Task¶

(Listing 5.4) Writing the task instruction¶

In [13]:

Copied!





instruction = """
You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the estimate falls in the range [3.1415, 3.1425).
Any value from 3.1415 up to (but not including) 3.1425 is a success.

Examples:
- 3.14159 ✓ (within range)
- 3.14200 ✓ (within range)
- 3.14149 ✗ (too low)
- 3.14250 ✗ (too high)

<algorithm>
1. Call generate_random_sample(1000000) to start with 1M points
2. Call monte_carlo_estimate(sample_id) to get estimate
3. Check: is the estimate between 3.1415 and 3.1425?
   - YES → Report success and STOP
   - NO → Continue to step 4
4. Call add_more_points_to_sample, doubling the points each time:
   - First add: 1 million
   - Second add: 2 million
   - Third add: 4 million
   - And so on, doubling each iteration
5. After adding points, go back to step 2

Exponential growth ensures faster convergence while demonstrating adaptive
sampling.
</algorithm>

<critical_rules>
- If the task is not complete, your response MUST contain a tool call
- Do not just describe what you plan to do—actually call the tool
- Do not stop until the estimate falls within the target range
- Keep track of your iteration to calculate the correct doubling amount
- NEVER fabricate tool results-only use actual tool responses
- NEVER invent a sample_id
</critical_rules>

<final_output>
When the estimate reaches the target precision, respond with this exact JSON
structure and nothing else:

{"sample_id": "<the-actual-sample-id-from-tool-response>"}

No explanation, no markdown formatting, no code blocks—just the raw JSON.
</final_output>

Begin by calling generate_random_sample(1000000).
""".strip()
instruction = """
You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the estimate falls in the range [3.1415, 3.1425).
Any value from 3.1415 up to (but not including) 3.1425 is a success.

Examples:
- 3.14159 ✓ (within range)
- 3.14200 ✓ (within range)
- 3.14149 ✗ (too low)
- 3.14250 ✗ (too high)


1. Call generate_random_sample(1000000) to start with 1M points
2. Call monte_carlo_estimate(sample_id) to get estimate
3. Check: is the estimate between 3.1415 and 3.1425?
   - YES → Report success and STOP
   - NO → Continue to step 4
4. Call add_more_points_to_sample, doubling the points each time:
   - First add: 1 million
   - Second add: 2 million
   - Third add: 4 million
   - And so on, doubling each iteration
5. After adding points, go back to step 2

Exponential growth ensures faster convergence while demonstrating adaptive
sampling.


- If the task is not complete, your response MUST contain a tool call
- Do not just describe what you plan to do—actually call the tool
- Do not stop until the estimate falls within the target range
- Keep track of your iteration to calculate the correct doubling amount
- NEVER fabricate tool results-only use actual tool responses
- NEVER invent a sample_id


When the estimate reaches the target precision, respond with this exact JSON
structure and nothing else:

{"sample_id": ""}

No explanation, no markdown formatting, no code blocks—just the raw JSON.


Begin by calling generate_random_sample(1000000).
""".strip()

(Listing 5.5) The Task¶

In [14]:

Copied!

from llm_agents_from_scratch.data_structures import Task

task = Task(
    instruction=instruction,
)
from llm_agents_from_scratch.data_structures import Task

task = Task(
    instruction=instruction,
)

(Listing 5.6) Creating our LLMAgent¶

In [15]:

Copied!





from llm_agents_from_scratch import LLMAgent
from llm_agents_from_scratch.llms import OllamaLLM

llm = OllamaLLM(backbone_llm)
llm_agent = LLMAgent(
    llm=llm,
    tools=[
        random_sample_tool,
        add_more_points_tool,
        monte_carlo_estimate_tool,
    ],
)
from llm_agents_from_scratch import LLMAgent
from llm_agents_from_scratch.llms import OllamaLLM

llm = OllamaLLM(backbone_llm)
llm_agent = LLMAgent(
    llm=llm,
    tools=[
        random_sample_tool,
        add_more_points_tool,
        monte_carlo_estimate_tool,
    ],
)

Perform the Task¶

In [16]:

Copied!

handler = llm_agent.run(task, max_steps=MAX_STEPS)
handler = llm_agent.run(task, max_steps=MAX_STEPS)

INFO (llm_agents_fs.LLMAgent) :      🚀 Starting task: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means ...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: generate_random_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"60059507-d97f-414e-9e12-3ad0f5aa22b9","sample_size":1000000}
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"60059507-d97f-414e-9e12-3ad0f5aa22b9"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"60059507-d97f-414e-9e12-3ad0f5aa22b9"}}
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"60059507-d97f-414e-9e12-3ad0f5aa22b9"}}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"60059507-d97f-414e-9e12-3ad0f5aa22b9","sample_size":1000000,"estimate":3.141968}
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The estimate from the first sample is 3.141968, which falls within the target range [3.1415, 3.1425). Therefore, the task is complete. ...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      No new step required.
INFO (llm_agents_fs.LLMAgent) :      🏁 Task completed: {"sample_id": "60059507-d97f-414e-9e12-3ad0f5aa22b9"}

In [17]:

Copied!

# if need to cancel uncomment code below
# handler.cancel()  # noqa: ERA001
# if need to cancel uncomment code below
# handler.cancel()  # noqa: ERA001

In [18]:

Copied!

handler.done()
handler.done()

Out[18]:

True

In [19]:

Copied!

if handler.done():
    # check if there was an error
    handler.exception()
if handler.done():
    # check if there was an error
    handler.exception()

In [20]:

Copied!

print(handler.rollout)
print(handler.rollout)

=== Task Step Start ===

💬 assistant: My current instruction is 'You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the estimate falls in the range [3.1415, 3.1425).
Any value from 3.1415 up to (but not including) 3.1425 is a success.

Examples:
- 3.14159 ✓ (within range)
- 3.14200 ✓ (within range)
- 3.14149 ✗ (too low)
- 3.14250 ✗ (too high)

<algorithm>
1. Call generate_random_sample(1000000) to start with 1M points
2. Call monte_carlo_estimate(sample_id) to get estimate
3. Check: is the estimate between 3.1415 and 3.1425?
   - YES → Report success and STOP
   - NO → Continue to step 4
4. Call add_more_points_to_sample, doubling the points each time:
   - First add: 1 million
   - Second add: 2 million
   - Third add: 4 million
   - And so on, doubling each iteration
5. After adding points, go back to step 2

Exponential growth ensures faster convergence while demonstrating adaptive
sampling.
</algorithm>

<critical_rules>
- If the task is not complete, your response MUST contain a tool call
- Do not just describe what you plan to do—actually call the tool
- Do not stop until the estimate falls within the target range
- Keep track of your iteration to calculate the correct doubling amount
- NEVER fabricate tool results-only use actual tool responses
- NEVER invent a sample_id
</critical_rules>

<final_output>
When the estimate reaches the target precision, respond with this exact JSON
structure and nothing else:

{"sample_id": "<the-actual-sample-id-from-tool-response>"}

No explanation, no markdown formatting, no code blocks—just the raw JSON.
</final_output>

Begin by calling generate_random_sample(1000000).'

💬 assistant: I need to make the following tool call(s):

{
    "id_": "d20e8901-62ee-4c9d-b644-542fa77b5b5e",
    "tool_name": "generate_random_sample",
    "arguments": {
        "n": 1000000
    }
}.

🔧 tool: {
    "tool_call_id": "d20e8901-62ee-4c9d-b644-542fa77b5b5e",
    "content": "{\"sample_id\":\"60059507-d97f-414e-9e12-3ad0f5aa22b9\",\"sample_size\":1000000}",
    "error": false
}

💬 assistant: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"60059507-d97f-414e-9e12-3ad0f5aa22b9"}}
</tool_call>

=== Task Step End ===

=== Task Step Start ===

💬 assistant: My current instruction is '{"name": "monte_carlo_estimate", "arguments": {"sample_id":"60059507-d97f-414e-9e12-3ad0f5aa22b9"}}'

💬 assistant: I need to make the following tool call(s):

{
    "id_": "c0c3a08c-6b61-40fe-9539-9dd53a548157",
    "tool_name": "monte_carlo_estimate",
    "arguments": {
        "sample_id": "60059507-d97f-414e-9e12-3ad0f5aa22b9"
    }
}.

🔧 tool: {
    "tool_call_id": "c0c3a08c-6b61-40fe-9539-9dd53a548157",
    "content": "{\"sample_id\":\"60059507-d97f-414e-9e12-3ad0f5aa22b9\",\"sample_size\":1000000,\"estimate\":3.141968}",
    "error": false
}

💬 assistant: The estimate from the first sample is 3.141968, which falls within the target range [3.1415, 3.1425). Therefore, the task is complete. 

I will now provide the final output with the sample_id.

=== Task Step End ===

In [21]:

Copied!

result = handler.exception() or handler.result()
result
result = handler.exception() or handler.result()
result

Out[21]:

TaskResult(task_id='d0c1ddd1-1f9b-45ec-937a-618febaab4a2', content='{"sample_id": "60059507-d97f-414e-9e12-3ad0f5aa22b9"}')

Evaluation¶

(Listing 5.7) Evaluating Task Success¶

In [22]:

Copied!





import json
from json import JSONDecodeError

from pydantic import ValidationError


def estimate_has_target_precision(estimate: MonteCarloEstimateResult) -> bool:
    """Checks if the estimate achieved the desired precision.

    Target precision is 3 decimal places (3.142), meaning the estimate
    should be between 3.1415 and 3.1425.
    """
    upper_bound = 3.1425
    lower_bound = 3.1415
    return lower_bound <= estimate.estimate < upper_bound


def is_task_success(
    handler: LLMAgent.TaskHandler,
    verbose: bool = False,
) -> bool:
    """Determines task success.

    Args:
        handler (LLMAgent.TaskHandler): The handler containing the
            result or exception of the task execution
        verbose (bool): Whether to print out details of the
            determination. Defaults to False.

    Returns:
        bool: True if task was successful. False, otherwise.
    """
    if handler.exception():
        if verbose:
            print(handler.exception())
        return False

    result = handler.result()
    try:
        output_data = json.loads(result.content)
        sample_id = output_data["sample_id"]
        params = MonteCarloEstimateParams(
            sample_id=sample_id,
        )
        estimate = monte_carlo_estimate(params)
        if verbose:
            print(
                f"Estimate: {estimate}",
            )
        return estimate_has_target_precision(estimate)
    except (ValidationError, KeyError, JSONDecodeError) as e:
        # invalid sample_id provided by LLM agent—unsuccessful task
        if verbose:
            print(f"The LLM agent returned an invalid output: {str(e)}.")
        return False
import json
from json import JSONDecodeError

from pydantic import ValidationError


def estimate_has_target_precision(estimate: MonteCarloEstimateResult) -> bool:
    """Checks if the estimate achieved the desired precision.

    Target precision is 3 decimal places (3.142), meaning the estimate
    should be between 3.1415 and 3.1425.
    """
    upper_bound = 3.1425
    lower_bound = 3.1415
    return lower_bound <= estimate.estimate < upper_bound


def is_task_success(
    handler: LLMAgent.TaskHandler,
    verbose: bool = False,
) -> bool:
    """Determines task success.

    Args:
        handler (LLMAgent.TaskHandler): The handler containing the
            result or exception of the task execution
        verbose (bool): Whether to print out details of the
            determination. Defaults to False.

    Returns:
        bool: True if task was successful. False, otherwise.
    """
    if handler.exception():
        if verbose:
            print(handler.exception())
        return False

    result = handler.result()
    try:
        output_data = json.loads(result.content)
        sample_id = output_data["sample_id"]
        params = MonteCarloEstimateParams(
            sample_id=sample_id,
        )
        estimate = monte_carlo_estimate(params)
        if verbose:
            print(
                f"Estimate: {estimate}",
            )
        return estimate_has_target_precision(estimate)
    except (ValidationError, KeyError, JSONDecodeError) as e:
        # invalid sample_id provided by LLM agent—unsuccessful task
        if verbose:
            print(f"The LLM agent returned an invalid output: {str(e)}.")
        return False

In [23]:

Copied!

is_task_success(handler, verbose=True)
is_task_success(handler, verbose=True)

Estimate: {"sample_id":"60059507-d97f-414e-9e12-3ad0f5aa22b9","sample_size":1000000,"estimate":3.141968}

Out[23]:

True

Trajectory Analysis¶

In [24]:

Copied!





if judge_llm.startswith("gpt-"):
    from llm_agents_from_scratch.llms.openai import OpenAILLM

    trajectory_judge = OpenAILLM(model=judge_llm)
else:
    # fallback to Ollama model
    trajectory_judge = OllamaLLM(model=judge_llm)
if judge_llm.startswith("gpt-"):
    from llm_agents_from_scratch.llms.openai import OpenAILLM

    trajectory_judge = OpenAILLM(model=judge_llm)
else:
    # fallback to Ollama model
    trajectory_judge = OllamaLLM(model=judge_llm)

(Listing 5.8) Rubric for LLM judge¶

In [25]:

Copied!





class TrajectoryEvalRubric(BaseModel):
    """Rubric for evaluating an execution trajectory."""

    reached_target_precision: bool = Field(
        description="True if agent achieved estimate that rounds to 3.142",
    )

    completed_without_max_steps: bool = Field(
        description=(
            "True if agent completed task without hitting max steps limit"
        ),
    )

    always_added_points_before_reestimating: bool = Field(
        description=(
            "False if agent called monte_carlo_estimate consecutively more "
            "than once before adding points"
        ),
    )

    reused_sample: bool = Field(
        description=(
            "True if agent used add_more_points_to_sample to grow the sample "
            "instead of creating new samples"
        ),
    )

    no_false_completion: bool = Field(
        description=(
            "True if agent only claimed success when the actual tool result "
            "showed 3.142. False if agent claimed convergence based on a "
            "fabricated or misread estimate."
        ),
    )

    no_missed_completion: bool = Field(
        description=(
            "True if agent stopped when estimate reached 3.142. False if "
            "agent continued adding points after already achieving target."
        ),
    )

    followed_output_format: bool = Field(
        description=(
            "True if agent's final response contained only the sample_id "
            "as instructed, with no additional text or explanation."
        ),
    )

    largest_sample_size: int | None = Field(
        description=(
            "The largest sample size achieved during the trajectory, "
            "or None if not determinable from tool outputs"
        ),
    )

    summary: str = Field(
        description="One sentence summary of trajectory quality",
    )
class TrajectoryEvalRubric(BaseModel):
    """Rubric for evaluating an execution trajectory."""

    reached_target_precision: bool = Field(
        description="True if agent achieved estimate that rounds to 3.142",
    )

    completed_without_max_steps: bool = Field(
        description=(
            "True if agent completed task without hitting max steps limit"
        ),
    )

    always_added_points_before_reestimating: bool = Field(
        description=(
            "False if agent called monte_carlo_estimate consecutively more "
            "than once before adding points"
        ),
    )

    reused_sample: bool = Field(
        description=(
            "True if agent used add_more_points_to_sample to grow the sample "
            "instead of creating new samples"
        ),
    )

    no_false_completion: bool = Field(
        description=(
            "True if agent only claimed success when the actual tool result "
            "showed 3.142. False if agent claimed convergence based on a "
            "fabricated or misread estimate."
        ),
    )

    no_missed_completion: bool = Field(
        description=(
            "True if agent stopped when estimate reached 3.142. False if "
            "agent continued adding points after already achieving target."
        ),
    )

    followed_output_format: bool = Field(
        description=(
            "True if agent's final response contained only the sample_id "
            "as instructed, with no additional text or explanation."
        ),
    )

    largest_sample_size: int | None = Field(
        description=(
            "The largest sample size achieved during the trajectory, "
            "or None if not determinable from tool outputs"
        ),
    )

    summary: str = Field(
        description="One sentence summary of trajectory quality",
    )

(Listing 5.9) LLM judge instruction prompt¶

In [26]:

Copied!





judge_prompt_template = """Evaluate this Monte Carlo pi estimation trajectory.

The agent had three tools:
- `generate_random_sample(n)` - Creates NEW sample
- `add_more_points_to_sample(sample_id, n)` - Adds points to EXISTING sample
- `monte_carlo_estimate(sample_id)` - Returns pi estimate

Correct behavior:
1. Create sample once
2. Estimate → if not between 3.1415 and 3.1425,
   add points → re-estimate → repeat
3. When target reached, respond with ONLY the sample_id (no other text)

Note: If final_response is "Max steps error", the agent failed to complete
the task within the allowed number of steps.

HALLUCINATION MARKER: If you see "💬 assistant: 🔧 tool:" in the trajectory,
the agent fabricated a tool response instead of waiting for the actual result.
This is a critical failure—set no_false_completion to False.

<final_response>
{result}
</final_response>

<trajectory>
{trajectory}
</trajectory>

Evaluate and submit your judgment.""".strip()
judge_prompt_template = """Evaluate this Monte Carlo pi estimation trajectory.

The agent had three tools:
- `generate_random_sample(n)` - Creates NEW sample
- `add_more_points_to_sample(sample_id, n)` - Adds points to EXISTING sample
- `monte_carlo_estimate(sample_id)` - Returns pi estimate

Correct behavior:
1. Create sample once
2. Estimate → if not between 3.1415 and 3.1425,
   add points → re-estimate → repeat
3. When target reached, respond with ONLY the sample_id (no other text)

Note: If final_response is "Max steps error", the agent failed to complete
the task within the allowed number of steps.

HALLUCINATION MARKER: If you see "💬 assistant: 🔧 tool:" in the trajectory,
the agent fabricated a tool response instead of waiting for the actual result.
This is a critical failure—set no_false_completion to False.


{result}


{trajectory}


Evaluate and submit your judgment.""".strip()

In [27]:

Copied!





trajectory_eval = await trajectory_judge.structured_output(
    prompt=judge_prompt_template.format(
        result=str(result),
        trajectory=handler.rollout,
    ),
    mdl=TrajectoryEvalRubric,
)
trajectory_eval = await trajectory_judge.structured_output(
    prompt=judge_prompt_template.format(
        result=str(result),
        trajectory=handler.rollout,
    ),
    mdl=TrajectoryEvalRubric,
)

In [28]:

Copied!

print(trajectory_eval.model_dump_json(indent=4))
print(trajectory_eval.model_dump_json(indent=4))

{
    "reached_target_precision": true,
    "completed_without_max_steps": true,
    "always_added_points_before_reestimating": true,
    "reused_sample": false,
    "no_false_completion": true,
    "no_missed_completion": true,
    "followed_output_format": true,
    "largest_sample_size": 1000000,
    "summary": "Agent created one sample, achieved an in-range estimate on the first evaluation, and returned only the sample_id in the correct format without hallucinations."
}

Replications for a more reliable evaluation¶

In this section, we'll repeat the task multiple times to get a more robust evaluation of our LLM agent's performance.

(Listing 5.10) Repeated task executions with our LLM agent¶

In [29]:

Copied!





handlers = []
for _ in range(NUM_REPLICATIONS):
    h = llm_agent.run(task, max_steps=MAX_STEPS)
    handlers.append(h)
handlers = []
for _ in range(NUM_REPLICATIONS):
    h = llm_agent.run(task, max_steps=MAX_STEPS)
    handlers.append(h)

INFO (llm_agents_fs.LLMAgent) :      🚀 Starting task: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means ...[TRUNCATED]
INFO (llm_agents_fs.LLMAgent) :      🚀 Starting task: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means ...[TRUNCATED]
INFO (llm_agents_fs.LLMAgent) :      🚀 Starting task: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means ...[TRUNCATED]
INFO (llm_agents_fs.LLMAgent) :      🚀 Starting task: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means ...[TRUNCATED]
INFO (llm_agents_fs.LLMAgent) :      🚀 Starting task: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means ...[TRUNCATED]
INFO (llm_agents_fs.LLMAgent) :      🚀 Starting task: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means ...[TRUNCATED]
INFO (llm_agents_fs.LLMAgent) :      🚀 Starting task: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means ...[TRUNCATED]
INFO (llm_agents_fs.LLMAgent) :      🚀 Starting task: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means ...[TRUNCATED]
INFO (llm_agents_fs.LLMAgent) :      🚀 Starting task: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means ...[TRUNCATED]
INFO (llm_agents_fs.LLMAgent) :      🚀 Starting task: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means the...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: You are tasked with estimating pi using Monte Carlo methods.

TARGET: Get an estimate accurate to 3 decimal places.
Success means ...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: generate_random_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"e5287619-f281-440e-b7de-a4a95babd65e","sample_size":1000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: generate_random_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"4db769e3-f379-4dee-9d7f-85c3c68102b3","sample_size":1000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: generate_random_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"5059f2c7-c59d-4421-94e5-ae8be5c4e0f4","sample_size":1000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: generate_random_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"50ae7440-1f7a-4a83-98f6-4e0ef99bb959","sample_size":1000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: generate_random_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"42b00643-7eeb-4f15-b8da-92cf4a8688ea","sample_size":1000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: generate_random_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"f9951e3b-103e-41ad-9ebc-b643e54cf375","sample_size":1000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: generate_random_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"240e5a0d-9558-467e-91f3-a02fbc6c1613","sample_size":1000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: generate_random_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"984c9a5c-594b-4054-a06c-d590e1561d90","sample_size":1000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: generate_random_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"091a0481-7dc5-4c3f-abf9-342868ef2b91","sample_size":1000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: generate_random_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"0407bfff-3918-40db-9cf1-f44777272b00","sample_size":1000000}
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"e5287619-f281-440e-b7de-a4a95babd65e"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id": "4db769e3-f379-4dee-9d7f-85c3c68102b3"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"5059f2c7-c59d-4421-94e5-ae8be5c4e0f4"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"50ae7440-1f7a-4a83-98f6-4e0ef99bb959"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"42b00643-7eeb-4f15-b8da-92cf4a8688ea"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"f9951e3b-103e-41ad-9ebc-b643e54cf375"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"240e5a0d-9558-467e-91f3-a02fbc6c1613"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id": "984c9a5c-594b-4054-a06c-d590e1561d90"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"091a0481-7dc5-4c3f-abf9-342868ef2b91"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"0407bfff-3918-40db-9cf1-f44777272b00"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"e5287619-f281-440e-b7de-a4a95babd65e"}}
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"e5287619-f281-440e-b7de-a4a95babd65e"}}
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id": "4db769e3-f379-4dee-9d7f-85c3c68102b3"}}
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id": "4db769e3-f379-4dee-9d7f-85c3c68102b3"}}
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"5059f2c7-c59d-4421-94e5-ae8be5c4e0f4"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"5059f2c7-c59d-4421-94e5-ae8be5c4e0f4"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"50ae7440-1f7a-4a83-98f6-4e0ef99bb959"}}
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"50ae7440-1f7a-4a83-98f6-4e0ef99bb959"}}
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"42b00643-7eeb-4f15-b8da-92cf4a8688ea"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"42b00643-7eeb-4f15-b8da-92cf4a8688ea"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"f9951e3b-103e-41ad-9ebc-b643e54cf375"}}
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"f9951e3b-103e-41ad-9ebc-b643e54cf375"}}
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"240e5a0d-9558-467e-91f3-a02fbc6c1613"}}
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"240e5a0d-9558-467e-91f3-a02fbc6c1613"}}
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id": "984c9a5c-594b-4054-a06c-d590e1561d90"}}
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id": "984c9a5c-594b-4054-a06c-d590e1561d90"}}
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"091a0481-7dc5-4c3f-abf9-342868ef2b91"}}
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"091a0481-7dc5-4c3f-abf9-342868ef2b91"}}
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"0407bfff-3918-40db-9cf1-f44777272b00"}}
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: {"name": "monte_carlo_estimate", "arguments": {"sample_id":"0407bfff-3918-40db-9cf1-f44777272b00"}}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"e5287619-f281-440e-b7de-a4a95babd65e","sample_size":1000000,"estimate":3.143148}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"4db769e3-f379-4dee-9d7f-85c3c68102b3","sample_size":1000000,"estimate":3.145144}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"5059f2c7-c59d-4421-94e5-ae8be5c4e0f4","sample_size":1000000,"estimate":3.142488}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"50ae7440-1f7a-4a83-98f6-4e0ef99bb959","sample_size":1000000,"estimate":3.14308}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"42b00643-7eeb-4f15-b8da-92cf4a8688ea","sample_size":1000000,"estimate":3.14312}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"f9951e3b-103e-41ad-9ebc-b643e54cf375","sample_size":1000000,"estimate":3.1418}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"240e5a0d-9558-467e-91f3-a02fbc6c1613","sample_size":1000000,"estimate":3.144024}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"984c9a5c-594b-4054-a06c-d590e1561d90","sample_size":1000000,"estimate":3.143396}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"091a0481-7dc5-4c3f-abf9-342868ef2b91","sample_size":1000000,"estimate":3.140616}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"0407bfff-3918-40db-9cf1-f44777272b00","sample_size":1000000,"estimate":3.141684}
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The Monte Carlo estimate with 1 million points is 3.143148, which is outside the target range [3.1415, 3.1425). I need to add more poin...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The estimate from the first sample is 3.145144, which is outside the target range [3.1415, 3.1425). I need to add more points to improv...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The Monte Carlo estimate with 1,000,000 points is **3.142488**, which is slightly below the target range of [3.1415, 3.1425). I need to...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The estimate from the first sample is 3.14308, which is outside the target range of [3.1415, 3.1425). I need to add more points to the ...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The Monte Carlo estimate with 1 million points is 3.14312, which is outside the target range [3.1415, 3.1425). I need to add more point...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The Monte Carlo estimate with 1 million points is 3.1418, which falls within the target range [3.1415, 3.1425). Therefore, the task is ...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The Monte Carlo estimate with 1 million points is 3.144024, which is outside the target range [3.1415, 3.1425). I need to add more poin...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The Monte Carlo estimate with 1 million points is 3.143396, which is outside the target range [3.1415, 3.1425). I need to add more poin...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The estimate from the first sample is 3.140616, which is below the target range of [3.1415, 3.1425). I need to add more points to impro...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The Monte Carlo estimate with 1,000,000 points is **3.141684**, which is within the target range of [3.1415, 3.1425). 

✅ **Success!** ...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: I need to make the following tool call(s):

{
    "id_": "243e8d3c-6c6f-4d8e-8d6e-3d9e8d3c6c6f",
    "tool_name": "add_more_points_to_samp...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: I need to make the following tool call(s):

{
    "id_": "243e8d3c-6c6f-4d8e-8d6e-3d9e8d3c6c6f",
    "tool_name": "add_more_points...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: I need to make the following tool call(s):

{
    "id_": "8f5a5e8c-3e8d-4a8b-9d7e-0a7b6c5d4e3f",
    "tool_name": "add_more_points_to_samp...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: I need to make the following tool call(s):

{
    "id_": "8f5a5e8c-3e8d-4a8b-9d7e-0a7b6c5d4e3f",
    "tool_name": "add_more_points...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: I need to make the following tool call(s):

{
    "id_": "362d3f8e-4e6c-4f5a-8d8e-5c8d3f8e4e6c",
    "tool_name": "add_more_points_to_samp...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: I need to make the following tool call(s):

{
    "id_": "362d3f8e-4e6c-4f5a-8d8e-5c8d3f8e4e6c",
    "tool_name": "add_more_points...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: I need to make the following tool call(s):

{
    "id_": "6f1d7c3e-2b9a-4d8e-8c4a-9f0e8d1a2b3c",
    "tool_name": "add_more_points_to_samp...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: I need to make the following tool call(s):

{
    "id_": "6f1d7c3e-2b9a-4d8e-8c4a-9f0e8d1a2b3c",
    "tool_name": "add_more_points...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: I need to make the following tool call(s):

{
    "id_": "4e65936e-9225-4943-998b-87d522b15d6f",
    "tool_name": "add_more_points_to_samp...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: I need to make the following tool call(s):

{
    "id_": "4e65936e-9225-4943-998b-87d522b15d6f",
    "tool_name": "add_more_points...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      No new step required.
INFO (llm_agents_fs.LLMAgent) :      🏁 Task completed: {"sample_id": "f9951e3b-103e-41ad-9ebc-b643e54cf375"}
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: I need to make the following tool call(s):

{
    "id_": "6a7f5c8e-3d6e-4d8e-8f9a-0b1c2d3e4f56",
    "tool_name": "add_more_points_to_samp...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: I need to make the following tool call(s):

{
    "id_": "6a7f5c8e-3d6e-4d8e-8f9a-0b1c2d3e4f56",
    "tool_name": "add_more_points...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: I need to make the following tool call(s):

{
    "id_": "1a2b3c4d-5678-90ef-ghij-klmnopqrstuv",
    "tool_name": "add_more_points_to_samp...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: I need to make the following tool call(s):

{
    "id_": "1a2b3c4d-5678-90ef-ghij-klmnopqrstuv",
    "tool_name": "add_more_points...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: I need to make the following tool call(s):

{
    "id_": "3747d811-4a83-4bc5-a7d8-d718cb59cf9d",
    "tool_name": "monte_carlo_estimate",
...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: I need to make the following tool call(s):

{
    "id_": "3747d811-4a83-4bc5-a7d8-d718cb59cf9d",
    "tool_name": "monte_carlo_est...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      No new step required.
INFO (llm_agents_fs.LLMAgent) :      🏁 Task completed: {"sample_id": "0407bfff-3918-40db-9cf1-f44777272b00"}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: add_more_points_to_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"e5287619-f281-440e-b7de-a4a95babd65e","sample_size":2000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: add_more_points_to_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"4db769e3-f379-4dee-9d7f-85c3c68102b3","sample_size":2000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: add_more_points_to_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"5059f2c7-c59d-4421-94e5-ae8be5c4e0f4","sample_size":3000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: add_more_points_to_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"50ae7440-1f7a-4a83-98f6-4e0ef99bb959","sample_size":2000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: add_more_points_to_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"42b00643-7eeb-4f15-b8da-92cf4a8688ea","sample_size":2000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: add_more_points_to_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"240e5a0d-9558-467e-91f3-a02fbc6c1613","sample_size":2000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: add_more_points_to_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"984c9a5c-594b-4054-a06c-d590e1561d90","sample_size":3000000}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: add_more_points_to_sample
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"091a0481-7dc5-4c3f-abf9-342868ef2b91","sample_size":2000000}
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: I need to make the following tool call(s):

{
    "id_": "f1a3d8e9-4c8e-4d8e-8d6e-3d9e8d3c6c6f",
    "tool_name": "monte_carlo_estimate...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id": "4db769e3-f379-4dee-9d7f-85c3c68102b3"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: I need to make the following tool call(s):

{
    "id_": "f3a1d6e3-4f5a-4b7c-8d8e-5c8d3f8e4e6c",
    "tool_name": "monte_carlo_estimate...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: I need to make the following tool call(s):

{
    "id_": "d5a8c6f3-3e4b-4f2a-8d9e-0f1a2b3c4d5e",
    "tool_name": "monte_carlo_estimate...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: I need to make the following tool call(s):

{
    "id_": "a11972aa-efe5-4580-84b4-5f2e0fabbc49",
    "tool_name": "monte_carlo_estimate...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"240e5a0d-9558-467e-91f3-a02fbc6c1613"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: I need to make the following tool call(s):

{
    "id_": "d6e7f8g9-0123-4567-8hij-klmnopqrstuv",
    "tool_name": "monte_carlo_estimate...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"091a0481-7dc5-4c3f-abf9-342868ef2b91"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: I need to make the following tool call(s):

{
    "id_": "f1a3d8e9-4c8e-4d8e-8d6e-3d9e8d3c6c6f",
    "tool_name": "monte_carlo_estimate",
...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: I need to make the following tool call(s):

{
    "id_": "f1a3d8e9-4c8e-4d8e-8d6e-3d9e8d3c6c6f",
    "tool_name": "monte_carlo_est...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id": "4db769e3-f379-4dee-9d7f-85c3c68102b3"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id": "4db769e3-f379-4dee-9d7f-85c3c68102b3"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: I need to make the following tool call(s):

{
    "id_": "f3a1d6e3-4f5a-4b7c-8d8e-5c8d3f8e4e6c",
    "tool_name": "monte_carlo_estimate",
...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: I need to make the following tool call(s):

{
    "id_": "f3a1d6e3-4f5a-4b7c-8d8e-5c8d3f8e4e6c",
    "tool_name": "monte_carlo_est...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: {"id_": "d5a8c6f3-3e4b-4f2a-8d9e-0f1a2b3c4d5e", "tool_name": "monte_carlo_estimate", "arguments": {"sample_id": "50ae7440-1f7a-4a83-98f6-4...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: {"id_": "d5a8c6f3-3e4b-4f2a-8d9e-0f1a2b3c4d5e", "tool_name": "monte_carlo_estimate", "arguments": {"sample_id": "50ae7440-1f7a-4a8...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: I need to make the following tool call(s):

{
    "id_": "a11972aa-efe5-4580-84b4-5f2e0fabbc49",
    "tool_name": "monte_carlo_estimate",
...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: I need to make the following tool call(s):

{
    "id_": "a11972aa-efe5-4580-84b4-5f2e0fabbc49",
    "tool_name": "monte_carlo_est...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"240e5a0d-9558-467e-91f3-a02fbc6c1613"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"240e5a0d-9558-467e-91f3-a02fbc6c1613"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: {"id_": "d6e7f8g9-0123-4567-8hij-klmnopqrstuv", "tool_name": "monte_carlo_estimate", "arguments": {"sample_id": "984c9a5c-594b-4054-a06c-d...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: {"id_": "d6e7f8g9-0123-4567-8hij-klmnopqrstuv", "tool_name": "monte_carlo_estimate", "arguments": {"sample_id": "984c9a5c-594b-405...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"091a0481-7dc5-4c3f-abf9-342868ef2b91"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"091a0481-7dc5-4c3f-abf9-342868ef2b91"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"e5287619-f281-440e-b7de-a4a95babd65e","sample_size":2000000,"estimate":3.14273}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"4db769e3-f379-4dee-9d7f-85c3c68102b3","sample_size":2000000,"estimate":3.142738}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"5059f2c7-c59d-4421-94e5-ae8be5c4e0f4","sample_size":3000000,"estimate":3.141924}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"50ae7440-1f7a-4a83-98f6-4e0ef99bb959","sample_size":2000000,"estimate":3.143482}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"42b00643-7eeb-4f15-b8da-92cf4a8688ea","sample_size":2000000,"estimate":3.142816}
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: 🔧 tool: {
    "tool_call_id": "1907106d-d239-46a1-ab94-0cfca4c490fb",
    "content": "{\"sample_id\":\"240e5a0d-9558-467e-91f3-a02fbc6c...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"984c9a5c-594b-4054-a06c-d590e1561d90","sample_size":3000000,"estimate":3.1424106666666667}
INFO (llm_agents_fs.TaskHandler) :      🛠️ Executing Tool Call: monte_carlo_estimate
INFO (llm_agents_fs.TaskHandler) :      ✅ Successful Tool Call: {"sample_id":"091a0481-7dc5-4c3f-abf9-342868ef2b91","sample_size":2000000,"estimate":3.14127}
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The Monte Carlo estimate with 2 million points is 3.14273, which is still outside the target range [3.1415, 3.1425). I need to add more...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The estimate with 2 million points is 3.142738, which is still outside the target range [3.1415, 3.1425). I need to double the sample s...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The Monte Carlo estimate with 3,000,000 points is **3.141924**, which falls within the target range of [3.1415, 3.1425). The estimate i...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The estimate with 2 million points is 3.143482, which is still outside the target range of [3.1415, 3.1425). I need to add more points ...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      ✅ Step Result: The Monte Carlo estimate with 2 million points is 3.142816, which is still outside the target range [3.1415, 3.1425). I need to add mor...[TRUNCATED]
INFO (llm_agents_fs.TaskHandler) :      🧠 New Step: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"240e5a0d-9558-467e-91f3-a02fbc6c1613"}}
</tool_call>
INFO (llm_agents_fs.TaskHandler) :      ⚙️ Processing Step: <tool_call>
{"name": "monte_carlo_estimate", "arguments": {"sample_id":"240e5a0d-9558-467e-91f3-a02fbc6c1613"}}
</tool_call>

In [30]:

Copied!

# can execute this repeatedly until all handlers are done
[str(h.exception() or h.result()) if h.done() else "Not Done" for h in handlers]
# can execute this repeatedly until all handlers are done
[str(h.exception() or h.result()) if h.done() else "Not Done" for h in handlers]

Out[30]:

['Max steps reached.',
 '{"sample_id": "50ae7440-1f7a-4a83-98f6-4e0ef99bb959"}',
 '{"sample_id": "42b00643-7eeb-4f15-b8da-92cf4a8688ea"}',
 '{"sample_id": "0407bfff-3918-40db-9cf1-f44777272b00"}',
 '{"sample_id": "5059f2c7-c59d-4421-94e5-ae8be5c4e0f4"}',
 '{"sample_id": "984c9a5c-594b-4054-a06c-d590e1561d90"}',
 'Max steps reached.',
 '{"sample_id": "240e5a0d-9558-467e-91f3-a02fbc6c1613"}',
 '{"sample_id": "e5287619-f281-440e-b7de-a4a95babd65e"}',
 '{"sample_id": "f9951e3b-103e-41ad-9ebc-b643e54cf375"}']

(Listing 5.11) Task Success and Trajectory Evaluations of Individual Runs¶

In [31]:

Copied!





import asyncio

task_success_evals = []
trajectory_eval_coros = []
for handler in handlers:
    # task success evaluation
    task_success_evals.append(is_task_success(handler))

    # trajectory evaluation coroutines
    coro = trajectory_judge.structured_output(
        prompt=judge_prompt_template.format(
            result=str(handler.exception() or handler.result()),
            trajectory=handler.rollout,
        ),
        mdl=TrajectoryEvalRubric,
    )
    trajectory_eval_coros.append(coro)

trajectory_evals = await asyncio.gather(*trajectory_eval_coros)
import asyncio

task_success_evals = []
trajectory_eval_coros = []
for handler in handlers:
    # task success evaluation
    task_success_evals.append(is_task_success(handler))

    # trajectory evaluation coroutines
    coro = trajectory_judge.structured_output(
        prompt=judge_prompt_template.format(
            result=str(handler.exception() or handler.result()),
            trajectory=handler.rollout,
        ),
        mdl=TrajectoryEvalRubric,
    )
    trajectory_eval_coros.append(coro)

trajectory_evals = await asyncio.gather(*trajectory_eval_coros)

In [32]:

Copied!

print(task_success_evals)
print(trajectory_evals)
print(task_success_evals)
print(trajectory_evals)

[False, True, True, True, True, True, False, False, True, True]
[TrajectoryEvalRubric(reached_target_precision=False, completed_without_max_steps=False, always_added_points_before_reestimating=False, reused_sample=False, no_false_completion=True, no_missed_completion=True, followed_output_format=False, largest_sample_size=4000000, summary='Failed to reach the target before max steps, repeatedly re-estimated without adding points, created a second sample and used an invalid sample_id, and did not follow the required final output format.'), TrajectoryEvalRubric(reached_target_precision=True, completed_without_max_steps=True, always_added_points_before_reestimating=True, reused_sample=True, no_false_completion=True, no_missed_completion=True, followed_output_format=True, largest_sample_size=4000000, summary='Agent correctly created one sample, iteratively doubled points until the estimate 3.142415 was within range, and responded with only the sample_id.'), TrajectoryEvalRubric(reached_target_precision=True, completed_without_max_steps=True, always_added_points_before_reestimating=True, reused_sample=True, no_false_completion=True, no_missed_completion=True, followed_output_format=True, largest_sample_size=8000000, summary='Agent correctly created one sample, incrementally added points, reached an estimate rounding to 3.142, and responded with only the sample_id.'), TrajectoryEvalRubric(reached_target_precision=True, completed_without_max_steps=True, always_added_points_before_reestimating=True, reused_sample=False, no_false_completion=True, no_missed_completion=True, followed_output_format=True, largest_sample_size=1000000, summary='Agent created one sample, hit target on first estimate, and returned the correct final JSON format without errors.'), TrajectoryEvalRubric(reached_target_precision=True, completed_without_max_steps=True, always_added_points_before_reestimating=True, reused_sample=True, no_false_completion=True, no_missed_completion=False, followed_output_format=True, largest_sample_size=3000000, summary='Correct tools used and final format followed with sample reuse; however, the agent misread an earlier in-range estimate and continued unnecessarily before ultimately reaching the target.'), TrajectoryEvalRubric(reached_target_precision=True, completed_without_max_steps=True, always_added_points_before_reestimating=True, reused_sample=True, no_false_completion=True, no_missed_completion=True, followed_output_format=True, largest_sample_size=3000000, summary='Agent created one sample, iteratively added points and re-estimated, reached target precision, and returned only the sample_id.'), TrajectoryEvalRubric(reached_target_precision=True, completed_without_max_steps=False, always_added_points_before_reestimating=False, reused_sample=False, no_false_completion=False, no_missed_completion=False, followed_output_format=False, largest_sample_size=14000000, summary='Agent reached target precision at 6M points but misread it, repeatedly hallucinated results, created new samples unnecessarily, violated step logic and output format, and ultimately hit max steps.'), TrajectoryEvalRubric(reached_target_precision=False, completed_without_max_steps=True, always_added_points_before_reestimating=False, reused_sample=True, no_false_completion=False, no_missed_completion=True, followed_output_format=True, largest_sample_size=2000000, summary='Agent fabricated tool outputs (hallucination marker), re-estimated without adding points, and falsely claimed success; final JSON format was correct.'), TrajectoryEvalRubric(reached_target_precision=True, completed_without_max_steps=True, always_added_points_before_reestimating=True, reused_sample=True, no_false_completion=True, no_missed_completion=True, followed_output_format=True, largest_sample_size=8000000, summary='Agent followed the correct loop using a single sample, progressively added points, stopped upon reaching the target, and returned only the sample_id.'), TrajectoryEvalRubric(reached_target_precision=True, completed_without_max_steps=True, always_added_points_before_reestimating=True, reused_sample=True, no_false_completion=True, no_missed_completion=True, followed_output_format=True, largest_sample_size=1000000, summary='Agent created one sample, obtained 3.1418 (rounds to 3.142), and correctly stopped with only the sample_id.')]

Evaluation Summary¶

In [33]:

Copied!

import pandas as pd

from llm_agents_from_scratch.notebook_utils import set_dataframe_display_options

# sets display options for pd.DataFrame in notebooks
set_dataframe_display_options()
import pandas as pd

from llm_agents_from_scratch.notebook_utils import set_dataframe_display_options

# sets display options for pd.DataFrame in notebooks
set_dataframe_display_options()

In [34]:

Copied!





# shape eval results into a pd.DataFrame
evals_df = pd.DataFrame(
    data=[e.model_dump() for e in trajectory_evals],
)

# add task_success column
evals_df.insert(0, "task_success", task_success_evals)

# separate summary column
summary_df = evals_df[["summary"]].copy()
evals_df = evals_df.drop(columns=["summary"])

# compute aggregations: TOTAL and AVG rows
total_row = {}
avg_row = {}

for col, dtype in evals_df.dtypes.items():
    if dtype == "bool" or pd.api.types.is_numeric_dtype(dtype):
        total_row[col] = evals_df[col].sum()
        avg_row[col] = evals_df[col].mean()
    else:
        total_row[col] = "TOTAL"
        avg_row[col] = "AVG"

# merge evaluations and aggregations dataframes
evals_df = pd.concat(
    [
        evals_df,
        pd.DataFrame([total_row, avg_row], index=["TOTAL", "AVG"]),
    ],
)

# style
evals_df.style.apply(
    lambda r: ["border-top: 2px solid #444"] * len(r)
    if r.name == "TOTAL"
    else [""] * len(r),
    axis=1,
)
# shape eval results into a pd.DataFrame
evals_df = pd.DataFrame(
    data=[e.model_dump() for e in trajectory_evals],
)

# add task_success column
evals_df.insert(0, "task_success", task_success_evals)

# separate summary column
summary_df = evals_df[["summary"]].copy()
evals_df = evals_df.drop(columns=["summary"])

# compute aggregations: TOTAL and AVG rows
total_row = {}
avg_row = {}

for col, dtype in evals_df.dtypes.items():
    if dtype == "bool" or pd.api.types.is_numeric_dtype(dtype):
        total_row[col] = evals_df[col].sum()
        avg_row[col] = evals_df[col].mean()
    else:
        total_row[col] = "TOTAL"
        avg_row[col] = "AVG"

# merge evaluations and aggregations dataframes
evals_df = pd.concat(
    [
        evals_df,
        pd.DataFrame([total_row, avg_row], index=["TOTAL", "AVG"]),
    ],
)

# style
evals_df.style.apply(
    lambda r: ["border-top: 2px solid #444"] * len(r)
    if r.name == "TOTAL"
    else [""] * len(r),
    axis=1,
)

Out[34]:

	task_success	reached_target_precision	completed_without_max_steps	always_added_points_before_reestimating	reused_sample	no_false_completion	no_missed_completion	followed_output_format	largest_sample_size
0	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	1.000000	0.000000	4000000.000000
1	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	4000000.000000
2	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	8000000.000000
3	1.000000	1.000000	1.000000	1.000000	0.000000	1.000000	1.000000	1.000000	1000000.000000
4	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.000000	1.000000	3000000.000000
5	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	3000000.000000
6	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	14000000.000000
7	0.000000	0.000000	1.000000	0.000000	1.000000	0.000000	1.000000	1.000000	2000000.000000
8	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	8000000.000000
9	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1000000.000000
TOTAL	7.000000	8.000000	8.000000	7.000000	7.000000	8.000000	8.000000	8.000000	48000000.000000
AVG	0.700000	0.800000	0.800000	0.700000	0.700000	0.800000	0.800000	0.800000	4800000.000000

In [35]:

Copied!

summary_df
summary_df

Out[35]:

	summary
0	Failed to reach the target before max steps, repeatedly re-estimated without adding points, created a second sample and used an invalid sample_id, and did not follow the required final output format.
1	Agent correctly created one sample, iteratively doubled points until the estimate 3.142415 was within range, and responded with only the sample_id.
2	Agent correctly created one sample, incrementally added points, reached an estimate rounding to 3.142, and responded with only the sample_id.
3	Agent created one sample, hit target on first estimate, and returned the correct final JSON format without errors.
4	Correct tools used and final format followed with sample reuse; however, the agent misread an earlier in-range estimate and continued unnecessarily before ultimately reaching the target.
5	Agent created one sample, iteratively added points and re-estimated, reached target precision, and returned only the sample_id.
6	Agent reached target precision at 6M points but misread it, repeatedly hallucinated results, created new samples unnecessarily, violated step logic and output format, and ultimately hit max steps.
7	Agent fabricated tool outputs (hallucination marker), re-estimated without adding points, and falsely claimed success; final JSON format was correct.
8	Agent followed the correct loop using a single sample, progressively added points, stopped upon reaching the target, and returned only the sample_id.
9	Agent created one sample, obtained 3.1418 (rounds to 3.142), and correctly stopped with only the sample_id.

In [36]:

Copied!

# write results to json
evals_df.to_json("evals_df.json")
summary_df.to_json("summary_df.json")
# write results to json
evals_df.to_json("evals_df.json")
summary_df.to_json("summary_df.json")

Capstone 1 — Monte Carlo Estimation of Pi¶

Setup Instructions¶

Running an Ollama service¶

Setup¶

Constants¶

LLMs¶

Build Tools¶

(Listing 5.1) Tool: generate_random_sample()¶

Demonstration¶

(Listing 5.2) Tool: add_more_points()¶

Demonstration¶

(Listing 5.3) Tool: monte_carlo_estimate()¶

Demonstration¶

Define the Task¶

(Listing 5.4) Writing the task instruction¶

(Listing 5.5) The Task¶

(Listing 5.6) Creating our LLMAgent¶

Perform the Task¶

Evaluation¶

(Listing 5.7) Evaluating Task Success¶

Trajectory Analysis¶

(Listing 5.8) Rubric for LLM judge¶

(Listing 5.9) LLM judge instruction prompt¶

Replications for a more reliable evaluation¶

(Listing 5.10) Repeated task executions with our LLM agent¶

(Listing 5.11) Task Success and Trajectory Evaluations of Individual Runs¶

Evaluation Summary¶

(Listing 5.1) Tool: `generate_random_sample()`¶

(Listing 5.2) Tool: `add_more_points()`¶

(Listing 5.3) Tool: `monte_carlo_estimate()`¶