This tutorial will guide you through testing your agent's performance on T-Bench.

Introduction

T-Bench tasks are complex environments with pre-configured lifecycles, include custom startup and teardown instructions. The T-Bench harness handles this complexity for you, so you can focus on writing your agent.

Integration Approaches

There are two approaches to integrating your own agent into T-Bench:

Install and run in the task container
Implement the BaseAgent interface

Install and run in the task container

The most straightforward approach is to install your agent as a package in the task container.

This will affect your agent's performance

While this approach is the simplest, T-Bench task containers were not designed to install agents and the agent installation will fail on some task containers (e.g. broken networking or volume constraints).

We've provided an AbstractInstalledAgent class that you can implement to install and run your agent in the task container. Implementing the interface involves defining an installation script, the execution command, and environment variables.

We only recommend this approach as a quick and dirty way to evaluate your agent but do not recommend it for production or public use.

Implement the `BaseAgent` interface

Implementing the BaseAgent interface is more robust than installing your agent in the task container, but, depending on your agent's implementation, can require a bit of surgery to implement.

The BaseAgent interface only requires you to implement one method:

def perform_task(
    self,
    task_description: str,
    session: TmuxSession,
    container: Container,
    logging_dir: Path | None = None,
) -> AgentResult:
    """
    Perform the task described by `task_description`.
 
    Args:
        task_description: The description of the task to perform.
        session: Optional tmux session to send commands to.
        container: The docker container in which to complete the task.
        logging_dir: The directory to optionally log your agent's output.
 
    Returns:
        An `AgentResult` object with the agent's token counts and optionally a
        failure mode for debugging.
    """
    pass

Most agents use custom tools to interact with their environment. These tools can typically be implemented as bash commands that can be executed using container.exec_run in python or docker exec <container-id> <command> in the terminal.

To integrate your agent by implementing the BaseAgent interface, you'll need to

Replace any <command> your agent executes with with container.exec_run("<command>") in python or docker exec <container-id> <command> in the terminal.
Pass the container ID to your agent. We recommend setting it as an environment variable.

We are working on drop-in replacements for common libraries like subprocess to make this process as easy as possible.

Want help?

We are also willing to pair-program with you to integrate your agent. Email mikeam@cs.stanford.edu or alex@laude.org (or both!) if you'd like help.

Test your agent

Introduction

Integration Approaches

Install and run in the task container

Implement the `BaseAgent` interface

On this page

Test your agent

Introduction

Integration Approaches

Install and run in the task container

Implement the BaseAgent interface

On this page

Implement the `BaseAgent` interface