DD-PPO: Learning Near-Perfect PointGoal Navigators From 2.5 Billion Frames

Paper Link: https://arxiv.org/pdf/1911.00357

In Decentralized Distributed Proximal Policy Optimization (DD-PPO) each worker alternates between collecting experience in a resource-intensive and GPU accelerated simulated environment and optimizing the model

To avoid delays due to stragglers a preemption threshold is implemented where the experience collection of stragglers is forced to end early once a pre-specified percentage of other workers finish collecting experience; All workers then begin optimizing the models

Sim time is almost equivalent for all environments
Sim time can vary due to large differences in environment complexity

In both cases DD-PPO scales near-linearly

Preliminaries: Reinforcement Learning and PPO

Decentralized Distributed Proximal Policy Optimization

Core problem: How do you distribute RL training across many machines

Usual way in RL - asynchronous: You have a central “parameter server” holding the master copy of the network, and a lot of “rollout workers” each go play the game and send experience or gradients back whenever they’re ready

This approach is fragile, small bugs cause confusing crashes

Read about allreduce in DDP

The paper talks about adapting allreduce to Reinforcement Learning

At step , worker has a copy of the parameters, , calculates the gradient , and updates via

Each worker:

Has its own copy of the policy parameters
Collects a rollout - goes and acts in its environment using its current policy to gather experience
Computes a PPO gradient from that experience
AllReduces the gradient with all the other workers (averaging)
Updates parameters

In Supervised Learning every gradient computation takes around the same time so synchronizing is easy

In RL this breaks because Different environments take wildly different amounts of time to simulate

Preemption threshold: The paper implements a fix where instead of waiting for all workers they wait for some p% of workers have finished collecting experience (~60% worked well for them) and at that point the stragglers are forced to stop their rollout early

Then allreduce proceeds

2 things to keep in mind:

All workers’ contribution is the same (even the one that was cut short)
A worker must collect at least 1/4 of maximum number of steps before it can be preempted

Ayush Garg

Recently Updated