You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+6-150Lines changed: 6 additions & 150 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,42 +6,7 @@
6
6
7
7
## Overview
8
8
9
-
**Reinforcement learning (RL)** has catalyzed the evolution of Large Language Models (LLMs) from simple **Chatbots (Level 1)** to powerful **Reasoners (Level 2)** capable of superhuman performance on complex tasks like mathematics and coding. Models trained with reinforcement learning have demonstrated remarkable abilities to develop complex reasoning strategies through exploration and exploitation, as seen in breakthrough models like DeepSeek's R1, which naturally learns to construct long reasoning chains to solve challenging problems.
10
-
11
-
The advancement of foundation models has fueled aspirations for true **Agents (Level 3)**. Unlike chatbots and reasoners, agents not only utilize their internal knowledge but also actively explore external environments through autonomous action. Traditional agent approaches primarily rely on human-designed workflows, where models passively interact with environments according to predefined rules. While reasoners have freed us from the burden of prompt engineering through their ability to independently analyze and break down problems, a new question emerges: Can we enable models to independently take actions and explore environments on their own? The intersection of **RL & Agent** reveals this promising frontier—where models learn not just to reason but to act autonomously in complex, dynamic environments.
12
-
13
-
**Agent-R1** is an open-source framework designed to accelerate research and development at this critical intersection. Our framework employs **End-to-End** reinforcement learning to train agents in specific environments. Developers need only define domain-specific tools and reward functions to extend Agent-R1 to their unique use cases, eliminating the need for complex workflow engineering. We hope our modest contribution can benefit the open-source community, making it easier for researchers and developers to create and explore agents in their own domains, collectively advancing the development of autonomous agents.
14
-
15
-
## Algorithm
16
-
17
-
Reinforcement learning for Agents differs significantly from LLM (Chatbots, Reasoners). The key distinction lies in the fact that: **a complete Agent trajectory typically involves multi-turn interaction**, requiring **multiple tool calls** to solve user queries. Below, we formalize this distinction using the Markov Decision Process (MDP) framework.
18
-
19
-
In the contect of LLM:
20
-
-**State**: Simply the sequence of the input prompt and all generated text so far
21
-
-**Action**: Selecting the next token from the vocabulary to add to the sequence
22
-
-**Transitions**: Straightforward addition of the selected token to the existing sequence
23
-
-**Rewards**: Typically only provided at the end of the sequence generation
24
-
25
-
For Agent, the components are more complex:
26
-
-**State**: Includes not only the input and generated text, but also all tool responses from previous interactions
27
-
-**Action**: Still involves selecting the next token, but some tokens can trigger tool calls
28
-
-**Transitions**:
29
-
- Regular tokens: Simply add to the sequence like in traditional LLM training
30
-
- Tool-triggering tokens: Cause external tool execution that produces responses, **introducing significant stochasticity** into state transitions unlike the deterministic nature of standard LLM generation
31
-
-**Rewards**: Can be provided at multiple points:
32
-
- After each tool call, which **naturally creates process-level rewards** based on the quality and effectiveness of tool usage
33
-
- At the end of the complete interaction based on overall task completion
34
-
35
-
In order to better understand the difference between **Agent** and context of **LLM**, we provide formal definitions for both methods separately:
36
-

37
-
where:
38
-
39
-
- $X$ is the sequence of the current prompt
40
-
- $C_j$ is the result of the $j$-th tool call and $m$ is the number of tool calls
41
-
- $a_t$ is the token selected from the vocabulary
42
-
- $t_j$ is the number of token responses between the $j-1$th and $j$-th tool calls, $0<t_1+t_2+...+t_m<t$
43
-
44
-
This richer reinforcement learning framework allows Agent-R1 to train LLMs that learn effective strategies for when and how to use tools across multi-turn interactions. By optimizing over entire trajectories rather than single responses, we can apply algorithms like PPO, REINFORCE++, and GRPO to develop agents that reason effectively before taking actions.
9
+
**Agent-R1** is an open-source framework designed to accelerate research and development at this critical intersection. Our framework employs **End-to-End** reinforcement learning to train agents in specific environments. Developers need only define domain-specific tools and reward functions to extend Agent-R1 to their unique use cases, eliminating the need for complex workflow engineering. We hope our modest contribution can benefit the open-source community, making it easier for researchers and developers to create and explore agents in their own domains, collectively advancing the development of autonomous agents. For more details on the algorithm, see [algorithm doc](https://github.com/0russwest0/Agent-R1/tree/tmp_readme/docs/algorithm/algorithm.md).
45
10
46
11
## Key Features
47
12
@@ -57,78 +22,11 @@ This richer reinforcement learning framework allows Agent-R1 to train LLMs that
57
22
-**Additional Use Cases**: More example implementations across diverse scenarios and domains
- Generate embeddings using the BAAI/bge-large-en-v1.5 model
120
-
- Build a FAISS index for efficient similarity search
121
-
- Save the embeddings and index files in the data/corpus/hotpotqa directory
122
-
123
-
#### 4. Run PPO/REINFORCE++/GRPO training with Qwen2.5-1.5B-Instruct
124
-
```bash
125
-
# Run the PPO training script
126
-
bash run_ppo.sh
127
-
# Run the REINFORCE++ training script
128
-
bash run_rpp.sh
129
-
# Run the GRPO training script
130
-
bash run_grpo.sh
131
-
```
132
30
133
31
### Results on HotpotQA
134
32
@@ -150,57 +48,15 @@ Notably, our experiments reveal a striking correlation: EM scores, number of too
150
48
151
49
## Extending Agent-R1 with Your Own Tools and Environments
152
50
153
-
Agent-R1 is designed to be easily extensible, allowing you to create custom tools and environments for your specific use cases. This section outlines the key files and components you need to modify or create.
154
-
155
-
### Key Components to Extend
156
-
157
-
1.**Custom Data Processing**
158
-
- Create a new script in `examples/data_preprocess/` following `hotpotqa.py`
159
-
- Implement data download functions (optional, see `download_file()` in `hotpotqa.py`)
160
-
- Create data processing functions to transform raw data into the required format:
161
-
- Create a mapping function (`process_fn()`) to standardize each example
162
-
- Format data with appropriate instruction templates
163
-
- Save processed data as parquet files for training and validation
164
-
165
-
2.**Custom Tools**
166
-
- Create a new Python file in `agent_r1/tool/tools/` (e.g., `my_custom_tool.py`)
167
-
- Extend the `Tool` base class from `agent_r1.tool.tool_base`
168
-
- Implement the required methods:
169
-
-`__init__()`: Define tool name, description, and parameter schema
170
-
-`execute()`: Implement the core functionality of your tool
171
-
-`batch_execute()`: Implement batch processing capability if needed
172
-
- Register your tool in `agent_r1/tool/tools/__init__.py` by adding it to the `_default_tools()` function
173
-
174
-
3.**Custom Reward Functions**
175
-
- Create a new Python file in `verl/utils/reward_score/` following `qa_em_and_format.py`
176
-
- Create specific scoring functions:
177
-
- Format validation (see `compute_score_format()` which checks for proper output structure)
178
-
- Answer evaluation (see `compute_score_answer()` which compares against ground truth)
179
-
- Combined scoring functions (see `compute_score_format_answer()`)
180
-
- Register your reward function in `verl/utils/reward_score/__init__.py`
181
-
182
-
### Example Workflow
183
-
184
-
To create a custom application with Agent-R1:
185
-
186
-
1. Identify the tools your agent will need to accomplish its tasks
187
-
2. Implement each tool by extending the `Tool` base class
188
-
3. Create appropriate data preprocessing for your specific use case:
189
-
- Download and format your dataset
190
-
- Define appropriate instruction templates
191
-
- Structure data with necessary fields
192
-
4. Implement custom reward functions if needed:
193
-
- Define how to extract answers from model outputs
194
-
- Create scoring functions for format validation
195
-
- Implement task-specific evaluation metrics
196
-
5. Configure a training script with appropriate parameters
197
-
6. Run the training script to train your agent
51
+
**Extending Agent-R1** is straightforward: create **custom tools** by extending the `Tool` base class, implement **data preprocessing** scripts to format your dataset, and define **reward functions** for task-specific evaluation. Register these components in their respective directories, and configure a training script to adapt Agent-R1 to your use case.
198
52
199
53
For detailed implementation guidance, examine the existing code:
See the [extending doc](https://github.com/0russwest0/Agent-R1/tree/tmp_readme/docs/extend/extending.md) for details.
59
+
204
60
## Feedback
205
61
We welcome all forms of feedback! Please raise an issue for bugs, questions, or suggestions. This helps our team address common problems efficiently and builds a more productive community.
Reinforcement learning for Agents differs significantly from LLM (Chatbots, Reasoners). The key distinction lies in the fact that: **a complete Agent trajectory typically involves multi-turn interaction**, requiring **multiple tool calls** to solve user queries. Below, we formalize this distinction using the Markov Decision Process (MDP) framework.
4
+
5
+
In the contect of LLM:
6
+
-**State**: Simply the sequence of the input prompt and all generated text so far
7
+
-**Action**: Selecting the next token from the vocabulary to add to the sequence
8
+
-**Transitions**: Straightforward addition of the selected token to the existing sequence
9
+
-**Rewards**: Typically only provided at the end of the sequence generation
10
+
11
+
For Agent, the components are more complex:
12
+
-**State**: Includes not only the input and generated text, but also all tool responses from previous interactions
13
+
-**Action**: Still involves selecting the next token, but some tokens can trigger tool calls
14
+
-**Transitions**:
15
+
- Regular tokens: Simply add to the sequence like in traditional LLM training
16
+
- Tool-triggering tokens: Cause external tool execution that produces responses, **introducing significant stochasticity** into state transitions unlike the deterministic nature of standard LLM generation
17
+
-**Rewards**: Can be provided at multiple points:
18
+
- After each tool call, which **naturally creates process-level rewards** based on the quality and effectiveness of tool usage
19
+
- At the end of the complete interaction based on overall task completion
20
+
21
+
In order to better understand the difference between **Agent** and context of **LLM**, we provide formal definitions for both methods separately:
22
+

23
+
where:
24
+
25
+
- $X$ is the sequence of the current prompt
26
+
- $C_j$ is the result of the $j$-th tool call and $m$ is the number of tool calls
27
+
- $a_t$ is the token selected from the vocabulary
28
+
- $t_j$ is the number of token responses between the $j-1$th and $j$-th tool calls, $0<t_1+t_2+...+t_m<t$
29
+
30
+
This richer reinforcement learning framework allows Agent-R1 to train LLMs that learn effective strategies for when and how to use tools across multi-turn interactions. By optimizing over entire trajectories rather than single responses, we can apply algorithms like PPO, REINFORCE++, and GRPO to develop agents that reason effectively before taking actions.
## Extending Agent-R1 with Your Own Tools and Environments
2
+
3
+
Agent-R1 is designed to be easily extensible, allowing you to create custom tools and environments for your specific use cases. This section outlines the key files and components you need to modify or create.
4
+
5
+
### Key Components to Extend
6
+
7
+
1.**Custom Data Processing**
8
+
- Create a new script in `examples/data_preprocess/` following `hotpotqa.py`
9
+
- Implement data download functions (optional, see `download_file()` in `hotpotqa.py`)
10
+
- Create data processing functions to transform raw data into the required format:
11
+
- Create a mapping function (`process_fn()`) to standardize each example
12
+
- Format data with appropriate instruction templates
13
+
- Save processed data as parquet files for training and validation
14
+
15
+
2.**Custom Tools**
16
+
- Create a new Python file in `agent_r1/tool/tools/` (e.g., `my_custom_tool.py`)
17
+
- Extend the `Tool` base class from `agent_r1.tool.tool_base`
18
+
- Implement the required methods:
19
+
-`__init__()`: Define tool name, description, and parameter schema
20
+
-`execute()`: Implement the core functionality of your tool
21
+
-`batch_execute()`: Implement batch processing capability if needed
22
+
- Register your tool in `agent_r1/tool/tools/__init__.py` by adding it to the `_default_tools()` function
23
+
24
+
3.**Custom Reward Functions**
25
+
- Create a new Python file in `verl/utils/reward_score/` following `qa_em_and_format.py`
26
+
- Create specific scoring functions:
27
+
- Format validation (see `compute_score_format()` which checks for proper output structure)
28
+
- Answer evaluation (see `compute_score_answer()` which compares against ground truth)
29
+
- Combined scoring functions (see `compute_score_format_answer()`)
30
+
- Register your reward function in `verl/utils/reward_score/__init__.py`
31
+
32
+
### Example Workflow
33
+
34
+
To create a custom application with Agent-R1:
35
+
36
+
1. Identify the tools your agent will need to accomplish its tasks
37
+
2. Implement each tool by extending the `Tool` base class
38
+
3. Create appropriate data preprocessing for your specific use case:
39
+
- Download and format your dataset
40
+
- Define appropriate instruction templates
41
+
- Structure data with necessary fields
42
+
4. Implement custom reward functions if needed:
43
+
- Define how to extract answers from model outputs
44
+
- Create scoring functions for format validation
45
+
- Implement task-specific evaluation metrics
46
+
5. Configure a training script with appropriate parameters
47
+
6. Run the training script to train your agent
48
+
49
+
For detailed implementation guidance, examine the existing code:
0 commit comments