Paper Link: https://arxiv.org/pdf/1911.00357
In Decentralized Distributed Proximal Policy Optimization (DD-PPO) each worker alternates between collecting experience in a resource-intensive and GPU accelerated simulated environment and optimizing the model
To avoid delays due to stragglers a preemption threshold is implemented where the experience collection of stragglers is forced to end early once a pre-specified percentage of other workers finish collecting experience; All workers then begin optimizing the models
- Sim time is almost equivalent for all environments
- Sim time can vary due to large differences in environment complexity
In both cases DD-PPO scales near-linearly
Preliminaries: Reinforcement Learning and PPO
Decentralized Distributed Proximal Policy Optimization
Core problem: How do you distribute RL training across many machines
Usual way in RL - asynchronous: You have a central “parameter server” holding the master copy of the network, and a lot of “rollout workers” each go play the game and send experience or gradients back whenever they’re ready
This approach is fragile, small bugs cause confusing crashes
Read about allreduce in DDP
The paper talks about adapting allreduce to Reinforcement Learning
At step
Each worker:
- Has its own copy of the policy parameters
- Collects a rollout - goes and acts in its environment using its current policy to gather experience
- Computes a PPO gradient from that experience
- AllReduces the gradient with all the other workers (averaging)
- Updates parameters
In Supervised Learning every gradient computation takes around the same time so synchronizing is easy
In RL this breaks because Different environments take wildly different amounts of time to simulate
Preemption threshold: The paper implements a fix where instead of waiting for all workers they wait for some p% of workers have finished collecting experience (~60% worked well for them) and at that point the stragglers are forced to stop their rollout early
Then allreduce proceeds
2 things to keep in mind:
- All workers’ contribution is the same (even the one that was cut short)
- A worker must collect at least 1/4 of maximum number of steps before it can be preempted