Test your agent
This tutorial will guide you through testing your agent's performance on T-Bench.
Introduction
T-Bench tasks are complex environments with pre-configured lifecycles, include custom startup and teardown instructions. The T-Bench harness handles this complexity for you, so you can focus on writing your agent.
Integration Approaches
There are two approaches to integrating your own agent into T-Bench:
- Install and run in the task container
- Implement the
BaseAgent
interface
Install and run in the task container
The most straightforward approach is to install your agent as a package in the task container.
This will affect your agent's performance
While this approach is the simplest, T-Bench task containers were not designed to install agents and the agent installation will fail on some task containers (e.g. broken networking or volume constraints).
We've provided an AbstractInstalledAgent
class that you can implement to install and run your agent in the task container. Implementing the interface involves defining an installation script, the execution command, and environment variables.
We only recommend this approach as a quick and dirty way to evaluate your agent but do not recommend it for production or public use.
Implement the BaseAgent
interface
Implementing the BaseAgent
interface is more robust than installing your agent in the task container, but, depending on your agent's implementation, can require a bit of surgery to implement.
The BaseAgent
interface only requires you to implement one method:
Most agents use custom tools to interact with their environment. These tools can typically be implemented as bash commands that can be executed using container.exec_run
in python or docker exec <container-id> <command>
in the terminal.
To integrate your agent by implementing the BaseAgent
interface, you'll need to
- Replace any
<command>
your agent executes with withcontainer.exec_run("<command>")
in python ordocker exec <container-id> <command>
in the terminal. - Pass the container ID to your agent. We recommend setting it as an environment variable.
We are working on drop-in replacements for common libraries like subprocess
to make this process as easy as possible.
Want help?
We are also willing to pair-program with you to integrate your agent. Email mikeam@cs.stanford.edu or alex@laude.org (or both!) if you'd like help.
Dashboard
The T-Bench dashboard provides an overview of recent evaluation harness runs by contributors with database write access. Its primary purpose is to aggregate results and share asciinema recordings of model runs.
Tasks
Documentation for the T-Bench tasks feature, a feature to help you create and manage tasks.