Skip to content

Commit de00bb1

Browse files
committed
update readme
1 parent 0ad0c8a commit de00bb1

File tree

5 files changed

+161
-150
lines changed

5 files changed

+161
-150
lines changed

README.md

Lines changed: 6 additions & 150 deletions
Original file line numberDiff line numberDiff line change
@@ -6,42 +6,7 @@
66

77
## Overview
88

9-
**Reinforcement learning (RL)** has catalyzed the evolution of Large Language Models (LLMs) from simple **Chatbots (Level 1)** to powerful **Reasoners (Level 2)** capable of superhuman performance on complex tasks like mathematics and coding. Models trained with reinforcement learning have demonstrated remarkable abilities to develop complex reasoning strategies through exploration and exploitation, as seen in breakthrough models like DeepSeek's R1, which naturally learns to construct long reasoning chains to solve challenging problems.
10-
11-
The advancement of foundation models has fueled aspirations for true **Agents (Level 3)**. Unlike chatbots and reasoners, agents not only utilize their internal knowledge but also actively explore external environments through autonomous action. Traditional agent approaches primarily rely on human-designed workflows, where models passively interact with environments according to predefined rules. While reasoners have freed us from the burden of prompt engineering through their ability to independently analyze and break down problems, a new question emerges: Can we enable models to independently take actions and explore environments on their own? The intersection of **RL & Agent** reveals this promising frontier—where models learn not just to reason but to act autonomously in complex, dynamic environments.
12-
13-
**Agent-R1** is an open-source framework designed to accelerate research and development at this critical intersection. Our framework employs **End-to-End** reinforcement learning to train agents in specific environments. Developers need only define domain-specific tools and reward functions to extend Agent-R1 to their unique use cases, eliminating the need for complex workflow engineering. We hope our modest contribution can benefit the open-source community, making it easier for researchers and developers to create and explore agents in their own domains, collectively advancing the development of autonomous agents.
14-
15-
## Algorithm
16-
17-
Reinforcement learning for Agents differs significantly from LLM (Chatbots, Reasoners). The key distinction lies in the fact that: **a complete Agent trajectory typically involves multi-turn interaction**, requiring **multiple tool calls** to solve user queries. Below, we formalize this distinction using the Markov Decision Process (MDP) framework.
18-
19-
In the contect of LLM:
20-
- **State**: Simply the sequence of the input prompt and all generated text so far
21-
- **Action**: Selecting the next token from the vocabulary to add to the sequence
22-
- **Transitions**: Straightforward addition of the selected token to the existing sequence
23-
- **Rewards**: Typically only provided at the end of the sequence generation
24-
25-
For Agent, the components are more complex:
26-
- **State**: Includes not only the input and generated text, but also all tool responses from previous interactions
27-
- **Action**: Still involves selecting the next token, but some tokens can trigger tool calls
28-
- **Transitions**:
29-
- Regular tokens: Simply add to the sequence like in traditional LLM training
30-
- Tool-triggering tokens: Cause external tool execution that produces responses, **introducing significant stochasticity** into state transitions unlike the deterministic nature of standard LLM generation
31-
- **Rewards**: Can be provided at multiple points:
32-
- After each tool call, which **naturally creates process-level rewards** based on the quality and effectiveness of tool usage
33-
- At the end of the complete interaction based on overall task completion
34-
35-
In order to better understand the difference between **Agent** and context of **LLM**, we provide formal definitions for both methods separately:
36-
![equation](./image/Equation.png)
37-
where:
38-
39-
- $X$ is the sequence of the current prompt
40-
- $C_j$ is the result of the $j$-th tool call and $m$ is the number of tool calls
41-
- $a_t$ is the token selected from the vocabulary
42-
- $t_j$ is the number of token responses between the $j-1$th and $j$-th tool calls, $0<t_1+t_2+...+t_m<t$
43-
44-
This richer reinforcement learning framework allows Agent-R1 to train LLMs that learn effective strategies for when and how to use tools across multi-turn interactions. By optimizing over entire trajectories rather than single responses, we can apply algorithms like PPO, REINFORCE++, and GRPO to develop agents that reason effectively before taking actions.
9+
**Agent-R1** is an open-source framework designed to accelerate research and development at this critical intersection. Our framework employs **End-to-End** reinforcement learning to train agents in specific environments. Developers need only define domain-specific tools and reward functions to extend Agent-R1 to their unique use cases, eliminating the need for complex workflow engineering. We hope our modest contribution can benefit the open-source community, making it easier for researchers and developers to create and explore agents in their own domains, collectively advancing the development of autonomous agents. For more details on the algorithm, see [algorithm doc](https://github.com/0russwest0/Agent-R1/tree/tmp_readme/docs/algorithm/algorithm.md).
4510

4611
## Key Features
4712

@@ -57,78 +22,11 @@ This richer reinforcement learning framework allows Agent-R1 to train LLMs that
5722
- **Additional Use Cases**: More example implementations across diverse scenarios and domains
5823

5924
## Get Started
25+
- [Environment Setup](https://github.com/0russwest0/Agent-R1/tree/tmp_readme/docs/getting_started/installation.md)
26+
- [Quick Start: Try Default Search Tool on HotpotQA](https://github.com/0russwest0/Agent-R1/tree/tmp_readme/docs/getting_started/quickstart.md)
6027

61-
### Environment Setup
62-
63-
**Clone the repository**
64-
```bash
65-
git clone https://github.com/0russwest0/Agent-R1.git
66-
cd Agent-R1
67-
```
68-
69-
**Install `verl`**
70-
```bash
71-
mkdir -p envs
72-
cd envs
73-
conda create -n verl python==3.9
74-
conda activate verl
75-
# install verl together with some lightweight dependencies in setup.py
76-
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
77-
pip3 install flash-attn --no-build-isolation
78-
git clone https://github.com/volcengine/verl.git
79-
cd verl
80-
pip3 install -e .
81-
```
82-
83-
### Quick Start: Try Default Search Tool on HotpotQA
84-
#### 1. Install `FlagEmbedding` and `faiss`
85-
```bash
86-
pip3 install FlagEmbedding
87-
pip3 install faiss-cpu
88-
```
89-
90-
#### 2. Download and preprocess HotpotQA dataset
91-
```bash
92-
# Create data directory
93-
mkdir -p data/hotpotqa
94-
95-
# Run the preprocessing script
96-
python examples/data_preprocess/hotpotqa.py --local_dir ./data/hotpotqa
97-
```
9828

99-
This script will:
100-
- Download the HotpotQA dataset directly from the source
101-
- Process the data into the format required by Agent-R1
102-
- Save the processed data as train.parquet and validation.parquet in the specified directory
10329

104-
#### 3. Build hotpotqa search index
105-
```bash
106-
# Download the corpus file (gzipped)
107-
mkdir -p data/corpus/hotpotqa
108-
wget https://huggingface.co/datasets/BeIR/hotpotqa/resolve/main/corpus.jsonl.gz -O data/corpus/hotpotqa/corpus.jsonl.gz
109-
110-
# Extract the gzipped file
111-
gunzip -c data/corpus/hotpotqa/corpus.jsonl.gz > data/corpus/hotpotqa/hpqa_corpus.jsonl
112-
113-
# Process the corpus and build the search index
114-
python scripts/hotpotqa_search/process_hotpotqa.py
115-
```
116-
117-
This script will:
118-
- Load the corpus data
119-
- Generate embeddings using the BAAI/bge-large-en-v1.5 model
120-
- Build a FAISS index for efficient similarity search
121-
- Save the embeddings and index files in the data/corpus/hotpotqa directory
122-
123-
#### 4. Run PPO/REINFORCE++/GRPO training with Qwen2.5-1.5B-Instruct
124-
```bash
125-
# Run the PPO training script
126-
bash run_ppo.sh
127-
# Run the REINFORCE++ training script
128-
bash run_rpp.sh
129-
# Run the GRPO training script
130-
bash run_grpo.sh
131-
```
13230

13331
### Results on HotpotQA
13432

@@ -150,57 +48,15 @@ Notably, our experiments reveal a striking correlation: EM scores, number of too
15048

15149
## Extending Agent-R1 with Your Own Tools and Environments
15250

153-
Agent-R1 is designed to be easily extensible, allowing you to create custom tools and environments for your specific use cases. This section outlines the key files and components you need to modify or create.
154-
155-
### Key Components to Extend
156-
157-
1. **Custom Data Processing**
158-
- Create a new script in `examples/data_preprocess/` following `hotpotqa.py`
159-
- Implement data download functions (optional, see `download_file()` in `hotpotqa.py`)
160-
- Create data processing functions to transform raw data into the required format:
161-
- Create a mapping function (`process_fn()`) to standardize each example
162-
- Format data with appropriate instruction templates
163-
- Save processed data as parquet files for training and validation
164-
165-
2. **Custom Tools**
166-
- Create a new Python file in `agent_r1/tool/tools/` (e.g., `my_custom_tool.py`)
167-
- Extend the `Tool` base class from `agent_r1.tool.tool_base`
168-
- Implement the required methods:
169-
- `__init__()`: Define tool name, description, and parameter schema
170-
- `execute()`: Implement the core functionality of your tool
171-
- `batch_execute()`: Implement batch processing capability if needed
172-
- Register your tool in `agent_r1/tool/tools/__init__.py` by adding it to the `_default_tools()` function
173-
174-
3. **Custom Reward Functions**
175-
- Create a new Python file in `verl/utils/reward_score/` following `qa_em_and_format.py`
176-
- Create specific scoring functions:
177-
- Format validation (see `compute_score_format()` which checks for proper output structure)
178-
- Answer evaluation (see `compute_score_answer()` which compares against ground truth)
179-
- Combined scoring functions (see `compute_score_format_answer()`)
180-
- Register your reward function in `verl/utils/reward_score/__init__.py`
181-
182-
### Example Workflow
183-
184-
To create a custom application with Agent-R1:
185-
186-
1. Identify the tools your agent will need to accomplish its tasks
187-
2. Implement each tool by extending the `Tool` base class
188-
3. Create appropriate data preprocessing for your specific use case:
189-
- Download and format your dataset
190-
- Define appropriate instruction templates
191-
- Structure data with necessary fields
192-
4. Implement custom reward functions if needed:
193-
- Define how to extract answers from model outputs
194-
- Create scoring functions for format validation
195-
- Implement task-specific evaluation metrics
196-
5. Configure a training script with appropriate parameters
197-
6. Run the training script to train your agent
51+
**Extending Agent-R1** is straightforward: create **custom tools** by extending the `Tool` base class, implement **data preprocessing** scripts to format your dataset, and define **reward functions** for task-specific evaluation. Register these components in their respective directories, and configure a training script to adapt Agent-R1 to your use case.
19852

19953
For detailed implementation guidance, examine the existing code:
20054
- Tools: `agent_r1/tool/tools/calculator_tool.py`, `search_tool.py`
20155
- Data processing: `examples/data_preprocess/hotpotqa.py`
20256
- Reward functions: `verl/utils/reward_score/qa_em_and_format.py`
20357

58+
See the [extending doc](https://github.com/0russwest0/Agent-R1/tree/tmp_readme/docs/extend/extending.md) for details.
59+
20460
## Feedback
20561
We welcome all forms of feedback! Please raise an issue for bugs, questions, or suggestions. This helps our team address common problems efficiently and builds a more productive community.
20662

docs/algorithm/algorithm.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
## Algorithm
2+
3+
Reinforcement learning for Agents differs significantly from LLM (Chatbots, Reasoners). The key distinction lies in the fact that: **a complete Agent trajectory typically involves multi-turn interaction**, requiring **multiple tool calls** to solve user queries. Below, we formalize this distinction using the Markov Decision Process (MDP) framework.
4+
5+
In the contect of LLM:
6+
- **State**: Simply the sequence of the input prompt and all generated text so far
7+
- **Action**: Selecting the next token from the vocabulary to add to the sequence
8+
- **Transitions**: Straightforward addition of the selected token to the existing sequence
9+
- **Rewards**: Typically only provided at the end of the sequence generation
10+
11+
For Agent, the components are more complex:
12+
- **State**: Includes not only the input and generated text, but also all tool responses from previous interactions
13+
- **Action**: Still involves selecting the next token, but some tokens can trigger tool calls
14+
- **Transitions**:
15+
- Regular tokens: Simply add to the sequence like in traditional LLM training
16+
- Tool-triggering tokens: Cause external tool execution that produces responses, **introducing significant stochasticity** into state transitions unlike the deterministic nature of standard LLM generation
17+
- **Rewards**: Can be provided at multiple points:
18+
- After each tool call, which **naturally creates process-level rewards** based on the quality and effectiveness of tool usage
19+
- At the end of the complete interaction based on overall task completion
20+
21+
In order to better understand the difference between **Agent** and context of **LLM**, we provide formal definitions for both methods separately:
22+
![equation](../../image/Equation.png)
23+
where:
24+
25+
- $X$ is the sequence of the current prompt
26+
- $C_j$ is the result of the $j$-th tool call and $m$ is the number of tool calls
27+
- $a_t$ is the token selected from the vocabulary
28+
- $t_j$ is the number of token responses between the $j-1$th and $j$-th tool calls, $0<t_1+t_2+...+t_m<t$
29+
30+
This richer reinforcement learning framework allows Agent-R1 to train LLMs that learn effective strategies for when and how to use tools across multi-turn interactions. By optimizing over entire trajectories rather than single responses, we can apply algorithms like PPO, REINFORCE++, and GRPO to develop agents that reason effectively before taking actions.
31+
32+
33+

docs/extend/extending.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
## Extending Agent-R1 with Your Own Tools and Environments
2+
3+
Agent-R1 is designed to be easily extensible, allowing you to create custom tools and environments for your specific use cases. This section outlines the key files and components you need to modify or create.
4+
5+
### Key Components to Extend
6+
7+
1. **Custom Data Processing**
8+
- Create a new script in `examples/data_preprocess/` following `hotpotqa.py`
9+
- Implement data download functions (optional, see `download_file()` in `hotpotqa.py`)
10+
- Create data processing functions to transform raw data into the required format:
11+
- Create a mapping function (`process_fn()`) to standardize each example
12+
- Format data with appropriate instruction templates
13+
- Save processed data as parquet files for training and validation
14+
15+
2. **Custom Tools**
16+
- Create a new Python file in `agent_r1/tool/tools/` (e.g., `my_custom_tool.py`)
17+
- Extend the `Tool` base class from `agent_r1.tool.tool_base`
18+
- Implement the required methods:
19+
- `__init__()`: Define tool name, description, and parameter schema
20+
- `execute()`: Implement the core functionality of your tool
21+
- `batch_execute()`: Implement batch processing capability if needed
22+
- Register your tool in `agent_r1/tool/tools/__init__.py` by adding it to the `_default_tools()` function
23+
24+
3. **Custom Reward Functions**
25+
- Create a new Python file in `verl/utils/reward_score/` following `qa_em_and_format.py`
26+
- Create specific scoring functions:
27+
- Format validation (see `compute_score_format()` which checks for proper output structure)
28+
- Answer evaluation (see `compute_score_answer()` which compares against ground truth)
29+
- Combined scoring functions (see `compute_score_format_answer()`)
30+
- Register your reward function in `verl/utils/reward_score/__init__.py`
31+
32+
### Example Workflow
33+
34+
To create a custom application with Agent-R1:
35+
36+
1. Identify the tools your agent will need to accomplish its tasks
37+
2. Implement each tool by extending the `Tool` base class
38+
3. Create appropriate data preprocessing for your specific use case:
39+
- Download and format your dataset
40+
- Define appropriate instruction templates
41+
- Structure data with necessary fields
42+
4. Implement custom reward functions if needed:
43+
- Define how to extract answers from model outputs
44+
- Create scoring functions for format validation
45+
- Implement task-specific evaluation metrics
46+
5. Configure a training script with appropriate parameters
47+
6. Run the training script to train your agent
48+
49+
For detailed implementation guidance, examine the existing code:
50+
- Tools: `agent_r1/tool/tools/calculator_tool.py`, `search_tool.py`
51+
- Data processing: `examples/data_preprocess/hotpotqa.py`
52+
- Reward functions: `verl/utils/reward_score/qa_em_and_format.py`
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
### Environment Setup
2+
3+
**Clone the repository**
4+
```bash
5+
git clone https://github.com/0russwest0/Agent-R1.git
6+
cd Agent-R1
7+
```
8+
9+
**Install `verl`**
10+
```bash
11+
mkdir -p envs
12+
cd envs
13+
conda create -n verl python==3.9
14+
conda activate verl
15+
# install verl together with some lightweight dependencies in setup.py
16+
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
17+
pip3 install flash-attn --no-build-isolation
18+
git clone https://github.com/volcengine/verl.git
19+
cd verl
20+
pip3 install -e .
21+
```

docs/getting_started/quickstart.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
### Quick Start: Try Default Search Tool on HotpotQA
2+
#### 1. Install `FlagEmbedding` and `faiss`
3+
```bash
4+
pip3 install FlagEmbedding
5+
pip3 install faiss-cpu
6+
```
7+
8+
#### 2. Download and preprocess HotpotQA dataset
9+
```bash
10+
# Create data directory
11+
mkdir -p data/hotpotqa
12+
13+
# Run the preprocessing script
14+
python examples/data_preprocess/hotpotqa.py --local_dir ./data/hotpotqa
15+
```
16+
17+
This script will:
18+
- Download the HotpotQA dataset directly from the source
19+
- Process the data into the format required by Agent-R1
20+
- Save the processed data as train.parquet and validation.parquet in the specified directory
21+
22+
#### 3. Build hotpotqa search index
23+
```bash
24+
# Download the corpus file (gzipped)
25+
mkdir -p data/corpus/hotpotqa
26+
wget https://huggingface.co/datasets/BeIR/hotpotqa/resolve/main/corpus.jsonl.gz -O data/corpus/hotpotqa/corpus.jsonl.gz
27+
28+
# Extract the gzipped file
29+
gunzip -c data/corpus/hotpotqa/corpus.jsonl.gz > data/corpus/hotpotqa/hpqa_corpus.jsonl
30+
31+
# Process the corpus and build the search index
32+
python scripts/hotpotqa_search/process_hotpotqa.py
33+
```
34+
35+
This script will:
36+
- Load the corpus data
37+
- Generate embeddings using the BAAI/bge-large-en-v1.5 model
38+
- Build a FAISS index for efficient similarity search
39+
- Save the embeddings and index files in the data/corpus/hotpotqa directory
40+
41+
#### 4. Run PPO/REINFORCE++/GRPO training with Qwen2.5-1.5B-Instruct
42+
```bash
43+
# Run the PPO training script
44+
bash run_ppo.sh
45+
# Run the REINFORCE++ training script
46+
bash run_rpp.sh
47+
# Run the GRPO training script
48+
bash run_grpo.sh
49+
```

0 commit comments

Comments
 (0)