当前位置：

首页
/
IT
/
程序
/
Python
/
强化学习之 Policy Gradient 笔记

强化学习之 Policy Gradient 笔记

Policy Gradient 方法是强化学习中非常重要的方法. 不同于基于最优价值的算法, Policy Gradient 算法更着眼于算法的长期回报. 策略梯度根据目标函数的梯度方向去寻找最优策略. 策略梯度算法中, 整个回合结束之后才会进行学习, 所以策略梯度算法对全局过程有更好的把握. DeepMind 的 David Silver 在深度学习讲座中这样评价基于策略的方法:

Policy Based 强化学习方法优点:

- 收敛性好

- 在高维和连续问题中比较有效

- 能学习随机策略

其缺点有:

- 容易陷入局部最优

- 评价一个策略比较低效

基本理论

从理论上讲, 其实策略梯度其实是更容易理解的一种方法, 毕竟我们对梯度下降再熟悉不过了. 理解策略梯度的关键点在于理解目标函数. 就像前文所述, 强化学习的目标是寻找一个策略过程, 使得这个过程的回报期望最大化. 我们的目标函数是:

J ( θ ) = x p θ ( x ) p θ ( x ) r ( x ) d x J(\theta) = \int_{x \sim p_{_\theta}(x)}^{} p_{_\theta}(x)\,r(x) \,dx J(θ)=xpθ(x)pθ(x)r(x)dx

其中 x xx 是行为 (可以是一个向量), p θ ( x ) p_{_\theta}(x) pθ(x) 就是选择行为的概率. J ( θ ) J(\theta)J(θ) 就是整个回合的收益期望. 从策略梯度算法的思路来看, 算法的目标就是使得收益的期望值最大化. 求最大的过程其实就是通过梯度计算实现的.

目标函数的导数函数如下:

θ J ( θ ) = x p θ ( x ) θ

p θ

( x ) r ( x ) d x \nabla_{\theta} J(\theta) = \int_{x \sim p_{_\theta}(x)}^{} \nabla_{_\theta} p_{_\theta}(x)\,r(x) \,dx θJ(θ)=xpθ(x)θpθ(x)r(x)dx = x p θ ( x ) p θ ( x )

p θ

( x )

p θ

( x )

r ( x ) d x = \int_{x \sim p_{_\theta}(x)}^{} p_{_\theta}(x)\,\frac{\nabla_{_\theta} p_{_\theta}(x)}{p_{_\theta}(x)}\,r(x) \,dx =xpθ(x)pθ(x)pθ(x)θpθ(x)r(x)dx = x p θ ( x ) p θ ( x ) θ l o g

p θ

( x )

r ( x ) d x = \int_{x \sim p_{_\theta}(x)}^{} p_{_\theta}(x)\,\nabla_{_\theta}log\,{p_{_\theta}(x)}\,r(x) \,dx =xpθ(x)pθ(x)θlogpθ(x)r(x)dx = E x p θ ( x ) [ θ l o g

p θ

( x )

r ( x ) ] = E_{x \sim p_{_\theta}(x)}^{} [\nabla_{_\theta}log\,{p_{_\theta}(x)}\,r(x)] =Expθ(x)[θlogpθ(x)r(x)]

上面公式的推导用了连续函数, 其实在离散情况下也是基本适用的. 上面的对数概率部分可以继续分析, 如下所示:

θ l o g p θ ( x ) = t = 0 T

l o g

p θ

( a t s t ) \nabla_{_\theta} log\,{p_{_\theta}(x)} = \sum_{t=0}^{T} \nabla_{_\theta} log\,p _{_\theta}(a_{t}|s_{t}) θlogpθ(x)=t=0Tθlogpθ(atst)

所以最终策略价值梯度的公式如下:

θ J ( θ ) = i = 1 N [ t = 0 T θ l o g p θ ( a i , t

s i , t

)   (
 t = 0 T
r (

s i , t

a i , t

) ) ] \nabla_{\theta} J(\theta) = \sum_{i=1}^{N}[ \sum_{t=0}^{T} \nabla_{_\theta} log\,p _{_\theta}(a_{i,t}|s_{i,t}) \ (\sum_{t=0}^{T}r(s_{i,t}, a_{i,t}))] θJ(θ)=i=1N[t=0Tθlogpθ(ai,tsi,t) (t=0Tr(si,t,ai,t))]

这个公式其实是有问题的, t = 0 T r (

s i , t

a i , t

) \sum_{t^{}=0}^{T}r(s_{i,t}, a_{i,t})t=0Tr(si,t,ai,t)这部分在任何时候都会乘到梯度公式. 然而某一步的 action 应该只能影响到之后的过程才对, 所以上面的公式可以修正为如下形式:

θ J ( θ ) = i = 1 N [ t = 0 T θ l o g p θ ( a i , t

s i , t

)   (
t ' = t
T
r (
s

i , t '

i , t ') ) ] \nabla_{\theta} J(\theta) = \sum_{i=1}^{N}[ \sum_{t=0}^{T} \nabla_{_\theta} log\,p _{_\theta}(a_{i,t}|s_{i,t}) \ (\sum_{t^{'}=t}^{T}r(s_{i,t^{'}}, a_{i,t^{'}}))] θJ(θ)=i=1N[t=0Tθlogpθ(ai,tsi,t) (t'=tTr(si,t',ai,t'))]

从更细节的角度分析, 上面这个公式依然是有问题的. 在很多 reward 部分的求和运算可能导致, 对所有的行为其回报都是增强的. 这样就是弱化了 reward 的真实意义. 所以在工程实现中, 还是会把 reward 部分进行均值偏移处理, 甚至标准化处理.

TensorFlow 实现

虽然 Policy Gradient 很少单独使用了, 但是结合代码实现还是对理解有帮助的. 我还是看的周莫凡的实现, 算是代码阅读吧.

import numpy as np
import tensorflow as tf
np.random.seed(1)
tf.set_random_seed(1)
class PolicyGradient:
    def __init__(self,
                 n_actions,
                 n_features,
                 learning_rate=0.01,
                 reward_decay=0.95,
                 output_graph=False):
        self.n_actions = n_actions
        self.n_features = n_features
        self.lr = learning_rate
        self.gamma = reward_decay
        self.ep_obs, self.ep_as, self.ep_rs = [], [], []
        self._build_net()
        self.sess = tf.Session()
        if output_graph:
            tf.summary.FileWriter("logs/", self.sess.graph)
        self.sess.run(tf.global_variables_initializer())
    def _build_net(self):
        with tf.name_scope('inputs'):
            self.tf_obs = tf.placeholder(tf.float32, [None, self.n_features], name="observations")
            self.tf_acts = tf.placeholder(tf.int32, [None, ], name="actions_num")
            self.tf_vt = tf.placeholder(tf.float32, [None, ], name="actions_value")
        layer = tf.layers.dense(
            inputs=self.tf_obs,
            units=10,
            activation=tf.nn.tanh,
            kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3),
            bias_initializer=tf.constant_initializer(0.1),
            name='fc1'
        )
        all_act = tf.layers.dense(
            inputs=layer,
            units=self.n_actions,
            activation=None,
            kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3),
            bias_initializer=tf.constant_initializer(0.1),
            name='fc2'
        )
        self.all_act_prob = tf.nn.softmax(all_act, name='act_prob')
        with tf.name_scope('loss'):
            neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob) * tf.one_hot(self.tf_acts, self.n_actions), axis=1)
            loss = tf.reduce_sum(neg_log_prob * self.tf_vt)
        with tf.name_scope('train'):
            self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)
    def choose_action(self, observation):
        prob_weights = self.sess.run(self.all_act_prob, feed_dict={self.tf_obs: observation[np.newaxis, :]})
        action = np.random.choice(range(prob_weights.shape[1]), p=prob_weights.ravel())
        return action
    def store_transition(self, s, a, r):
        self.ep_obs.append(s)
        self.ep_as.append(a)
        self.ep_rs.append(r)
    def learn(self):
        discounted_ep_rs_norm = self._discount_and_norm_rewards()
        self.sess.run(self.train_op, feed_dict={
            self.tf_obs: np.vstack(self.ep_obs),
            self.tf_acts: np.array(self.ep_as),
            self.tf_vt: discounted_ep_rs_norm,
        })
        self.ep_obs, self.ep_as, self.ep_rs = [], [], []
        return discounted_ep_rs_norm
    def _discount_and_norm_rewards(self):
        discounted_ep_rs = np.zeros_like(self.ep_rs)
        running_add = 0
        for t in reversed(range(0, len(self.ep_rs))):
            running_add = running_add * self.gamma + self.ep_rs[t]
            discounted_ep_rs[t] = running_add
        discounted_ep_rs -= np.mean(discounted_ep_rs)
        discounted_ep_rs /= np.std(discounted_ep_rs)
        return discounted_ep_rs

首先看这个_discount_and_norm_rewards 函数, 这里就包含了数据处理均值偏移处理和标准化的逻辑. 而且 reward 数据是反向计算的, 即计算了从 t 到 T 的回报值.

构建的 TensorFlow 网络也是比较简单的, 就两个全连接层. 用 softmax 计算各个 action 的概率, 然后根据实际的行为选择一个概率值, 然后再求对数, 最后乘以 reward 部分的数据. 而梯度公式中的梯度其实已经体现在神经网络的训练过程中了.

之后准备结合 Baseline 研究下 DDPG 和 Actor Critic(坑先挖在这里了).

来源: https://juejin.im/entry/5b659cfd6fb9a04f97652d36

与本文相关文章

暂无,快来抢沙发吧！