天天看点

(zhuan) Prioritized Experience ReplayPrioritized Experience Replay

JAN 26, 2016

<a href="http://arxiv.org/pdf/1511.05952.pdf">Schaul, Quan, Antonoglou, Silver, 2016</a>

Uniform sampling from replay memories is not an efficient way to learn. Rather, using a clever prioritization scheme to label the experiences in replay memory, learning can be carried out much faster and more effectively. However, certain biases are introduced by this non-uniform sampling; hence, weighted importance sampling must be employed in order to correct for this. It is shown through experimentation with the Atari Learning Environment that prioritized sampling with Double DQN significantly outperforms the previous state-of-the-art Atari results.

Implemented Double DQN with main changes being the addition of prioritized experience replay sampling and importance-sampling

Tested on Atari Learning Environment

Lots of insight about the repercussions of this research and plenty of discussion on extensions

The magnitude of the TD-error indicates how unexpected a certain transition was

The TD-error can be a poor estimate about the amount an agent can learn from a transition when rewards are noisy

Problems with greedily selecting experiences:

High-error transitions are replayed too frequently

Low-error transitions are almost entirely ignored

Expensive to update entire replay memory, so errors are only updated for transitions that are replayed

Lack of diversity leads to over-fitting

A stochastic sampling method is introduced which finds a balance between greedy prioritization and random sampling (current method)

Two variants of P(i)=pαi∑kpαkP(i)=piα∑kpkα were studied, where PP is the probability of sampling transition ii, pi&gt;0pi&gt;0 is the priority of transition ii, and the exponent αα determines how much prioritization is used, with α=0α=0 the uniform case

Variant 1: proportional prioritization, where pi=|δi|+ϵpi=|δi|+ϵ is used and ϵϵ is a small positive constant that prevents the edge-case of transitions not being revisited once their error is zero. δδ is the TD-error

Variant 2: rank-based prioritization, with pi=1rank(i)pi=1rank(i) where rank(i)rank(i) is the rank of transition ii when the replay memory is sorted according to δiδi

Key insight The estimation of the expected value of the total discounted reward with stochastic updates requires that the updates correspond to the same distribution as the expectation. Prioritized replay introduces a bias that changes this distribution uncontrollably. This can be corrected by using importance-sampling (IS) weights wi=(1N1P(i))βwi=(1N1P(i))β that fully compensate for the non-uniform probabilities P(i)P(i) if β=1β=1. These weights are folded into the Q-learning update by using wi×δiwi×δi, which is normalized by 1maxiwi1maxiwi

IS is annealed from β0β0 to 1, which means its affect is felt more strongly at the end of the stochastic process; this is because the unbiased nature of the updates in RL is most important near convergence

IS also reduces the gradient magnitudes which is good for optimization; allows the algorithm to follow the curvature of highly non-linear optimization landscapes because the Taylor expansion (gradient descent) is constantly re-approximated