Policy Gradient Algorithms

More documents

Recommendations

Info

Special case – Generalized L R-I• Consider binary bandit problems witharbitrary rewardsReinforcement Comparison• Set baseline to average of observedrewards• Softmax action selectionReinforcement Comparison contd.Computation ofcharacteristic eligibility forsoftmax action selectionContinuous Actions• Use a Gaussian distribution to selectactions• For suitable choice of parameters:
MC Policy Gradient• Samples are entire trajectoriess 0 , a 0 , r 1 , s 1 , a 1 , . . . , s T• Evaluation criterion is the return along the path,instead of immediate rewards• The gradient estimation equation becomes:where, R i (s 0 ) is the return starting from state s 0and p i (s 0 ;!) is the probability of i th trajectory,starting from s 0 and using policy given by !.MC Policy Gradient contd.• The “likelihood ratio" in this case evaluatesto:• Estimate depends on starting state s 0 .One way to address this problem is toassume a fixed initial state.• More common assumption is to use theaverage reward formulation.(1)MC Policy Gradient contd.• Recall:– Maximize average reward per time step:– Unichain assumption: One set of “recurrent"class of states– ! " is then state independent– Recurrent class: Starting from any state in theclass, the probability of visiting all the states inthe class is 1.MC Policy Gradient contd.• Assumption 1: For every policy underconsideration, the Unichain assumption issatisfied, with the same set of recurrentstates.• Pick one recurrent state i*. Trajectories aredefined as starting and ending at thisrecurrent state.• Assumption 2: Bounded rewards.
Page 1: Policy Gradient Algorithms• Why?-

Policy Gradient Algorithms

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?