Contributing

T-bench is looking for contributors to help build the first benchmark for AI agents that operate in the command line. Contributors will have the opportunity to become co-authors on a paper which we hope to submit to a top-tier ML venue like NeurIPS. Contributors will be considered eligible for co-authorship after earning 10 points.

There are three primary ways to contribute:

Contribution	Points Awarded	Description
Author a Test Instance	2 points	A complete test instance consists of the task description, environment, solution, and unit tests. Points will be awarded once the task has been approved and merged into main.
Build a Complex Environment	2 points + 1 point per task using it	An example might be a containerized realistic deployment environment, or a set of complicated APIs for an agent to interact with. Before starting one of these environments, it's best to meet with a project lead to determine the exact scope of the environment and whether it will be reusable for enough test instances.
Resolve Bountied Issues	1-5 points	From time to time an issue on GitHub will be tagged with "[Bountied - n Points]" where "n" is the number of points that will be awarded for closing the issue. An example single-point issue might be something that would take an hour or two like improving documentation. A more advanced issue could require several days of engineering, like setting up an online dashboard for test results or implementing an advanced agent. If you think there's an important feature we're missing, feel free to make an issue and ask the project leads to assign a bounty.

Points may also be awarded on an ad-hoc basis by project leads as an incentive.

Final authorship and author order will be determined at the discretion of the project leads by taking into account each contributor's point total and other intangible contributions.

Contributors with fewer than 10 points may be thanked in the acknowledgements.

Using AI coding tools

Our goal is to evaluate the performance of language models on these tasks, and overzealous use of coding assistants (e.g. Cursor, Claude Code) may produce tasks that are artificially easy for those models. That being said, these are powerful tools and can help accelerate task creation. Our official policy is that they should be used responsibly and never to generate a task in its entirety. For a good example of how to use Claude Code to create a task, see using-claude-code.

Task Requirements

Note

We might not merge tasks that don't meet our requirements and we won't be able to give you credit for them!

Looking for inspiration to get you started? Check out our living list of task ideas.

Your task shouldn't be too similar to an existing task. It's your responsibility to check this before you being writing one. You can browse the existing tasks here. If you're in doubt, please reach out to a project lead.

Tasks should be hard and realistic. The best tasks are real problems that you've personally struggled with before. Think back to a time when you spent an hour or two (or even days!) trying to accomplish some task on your computer.

However, tasks don't necessarily have to be something that you would personally do on the command line. For example, a human might not use the CLI to remove spam from their email (preferring to use the GUI), but an AI agent equipped with an API key might find it most convenient to write a simple script to iterate over all messages.

We're looking for tasks that...

Involve compiling complicated packages from source with exotic configurations
Require running old code in a new environment
Perform operations under constraints (e.g. optimize this code so that it passes all test cases but runs in under X milliseconds, train a model that get X% on this benchmark)
Identify and fix or exploit security vulnerabilities
String together multiple API calls
Perform complicated networking tasks
Resolve package dependencies
Deploy complicated (simulated) cloud environments
Manipulate interactive environments like vim
Have a potentially dangerous solution that the agent must avoid to pass

What we don't need

Through our initial testing we've found that some tasks are not a good fit for the benchmark. While we're excited about all contributions, we will not accept tasks that fall into the following categories:

Tasks that are too easy

We've found that agents can already solve many common CLI tasks, including:

Manipulating file directories or changing permissions
Answering general knowledge questions (e.g. "echo the name of the 13th President")
Resolving common git states

Tasks that require dynamic external dependencies

It's important that a task passes its test cases both on the day you submit it and at any future point in time. For example, weak tasks are "What is the headline of the New York Times today?" and "Update this Pandas script to use the latest version" since both will have different correct solutions depending on when the task is initiated.

Tasks that can't be sufficiently tested

Currently we support programmatic PyTest tests. This means that each instance has to have a well-defined final state that passes all tests. For example, a task like "Write the first two levels of the Sierpinski triangle in vim" sounds promising, but would be challenging to evaluate. Which characters should the model use? Does this instruction include the zeroth level? Which scale should the triangles be at?

In the future we may support LLM-as-a-judge tests which would enable open-ended tasks.