强化学习--DDPG---tensorflow实现

完整代码：https://github.com/zle1992/Reinforcement_Learning_Game

论文《Continuous control with deep reinforcement learning》https://arxiv.org/pdf/1509.02971.pdf

Deep_Deterministic_Policy_Gradient

DDPG与AC的区别：

AC:

　　Actor: 利用td_error更新参数，td_error 来自Critic

　　Critic:根据value(s)函数的贝尔曼方程更新梯度

DDPG:

　　Actor: maximize the q，输出action

　　Critic：根据Q(s,a)函数的贝尔曼方程更新梯度, 输出q值

DDPG 只能预测连续的动作输出。

逻辑梳理：

1、DDPG是AC 模型，输入包括（S,R,S_,A）

2、Actor

intput:(S)

output: a

loss :max(q)

q 来自Critic

3、Critic

input : S 、A

output: q

loss: R+ GAMMA * q_ - q

问题来了，q_ how to get? ---->Critic网络可以输入（S_,a_）得到q_ 但是，不能用同一个网络啊，所以，利用错位时间，我们使用Critic2（不可训练的）

Critic2需要a_ how to get?/----->Action网络可以输出（S_）得到a_，同理，我们使用Actor2(不可训练的)得到a_

流程

a = actor(s ,trian)

a_ = actor(s_,not_train)

q = critic(s,a trian)

q_critic(s_,a_,not_train)

a_loss = max(q)

c_loss = R+ GAMMA * q_ - q

代码：

DDPY.py

 import os

 import numpy as np

 import tensorflow as tf

 from abc import ABCMeta, abstractmethod

 np.random.seed(1)

 tf.set_random_seed(1)

 import logging  # 引入logging模块

 logging.basicConfig(level=logging.DEBUG,

                     format='%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s')  # logging.basicConfig函数对日志的输出格式及方式做相关配置

 # 由于日志基本配置中级别设置为DEBUG，所以一下打印信息将会全部显示在控制台上

 tfconfig = tf.ConfigProto()

 tfconfig.gpu_options.allow_growth = True

 session = tf.Session(config=tfconfig)

 class DDPG(object):

     __metaclass__ = ABCMeta

     """docstring for ACNetwork"""

     def __init__(self,

             n_actions,

             n_features,

             reward_decay,

             lr_a,

             lr_c,

             memory_size,

             output_graph,

             log_dir,

             model_dir,

             TAU,

             a_bound,

             ):

         super(DDPG, self).__init__()

         self.n_actions = n_actions

         self.n_features = n_features

         self.gamma=reward_decay

         self.memory_size =memory_size

         self.output_graph=output_graph

         self.lr_a =lr_a

         self.lr_c = lr_c

         self.log_dir = log_dir

         self.model_dir = model_dir

         # total learning step

         self.learn_step_counter = 0

         self.TAU = TAU     # soft replacement

         self.a_bound = a_bound

         self.s = tf.placeholder(tf.float32,[None]+self.n_features,name='s')

         self.s_next = tf.placeholder(tf.float32,[None]+self.n_features,name='s_next')

         self.r = tf.placeholder(tf.float32,[None,],name='r')

         #self.a = tf.placeholder(tf.int32,[None,1],name='a')

         with tf.variable_scope('Actor'):

             self.a = self._build_a_net(self.s, scope='eval', trainable=True)

             a_ = self._build_a_net(self.s_next, scope='target', trainable=False)

         with tf.variable_scope('Critic'):

             q  = self._build_c_net(self.s, self.a,scope='eval', trainable=True)

             q_  = self._build_c_net(self.s_next, a_,scope='target', trainable=False)

         # networks parameters

         self.ae_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval')

         self.at_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target')

         self.ce_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval')

         self.ct_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target')

         with tf.variable_scope('train_op_actor'):

             self.loss_actor = -tf.reduce_mean(q)

             self.train_op_actor = tf.train.AdamOptimizer(self.lr_a).minimize(self.loss_actor,var_list=self.ae_params)  

         with tf.variable_scope('train_op_critic'):

             q_target = self.r + self.gamma * q_

             self.loss_critic =tf.losses.mean_squared_error(labels=q_target, predictions=q)

             self.train_op_critic = tf.train.AdamOptimizer(self.lr_c).minimize(self.loss_critic,var_list=self.ce_params)

             # target net replacement

         self.soft_replace = [tf.assign(t, (1 - self.TAU) * t + self.TAU * e)

                                for t, e in zip(self.at_params + self.ct_params, self.ae_params + self.ce_params)]

         self.sess = tf.Session()

         if self.output_graph:

             tf.summary.FileWriter(self.log_dir,self.sess.graph)

         self.sess.run(tf.global_variables_initializer())

         self.cost_his =[0]

         self.cost =0 

         self.saver = tf.train.Saver()

         if not os.path.exists(self.model_dir):

             os.mkdir(self.model_dir)

         checkpoint = tf.train.get_checkpoint_state(self.model_dir)

         if checkpoint and checkpoint.model_checkpoint_path:

             self.saver.restore(self.sess, checkpoint.model_checkpoint_path)

             print ("Loading Successfully")

             self.learn_step_counter = int(checkpoint.model_checkpoint_path.split('-')[-1]) + 1

     @abstractmethod

     def _build_a_net(self,x,scope,trainable):

         raise NotImplementedError

     def _build_c_net(self,x,scope,trainable):

         raise NotImplementedError

     def learn(self,data):

         # soft target replacement

         self.sess.run(self.soft_replace)

         batch_memory_s = data['s']

         batch_memory_a =  data['a']

         batch_memory_r = data['r']

         batch_memory_s_ = data['s_']

         _, cost = self.sess.run(

             [self.train_op_actor, self.loss_actor],

             feed_dict={

                 self.s: batch_memory_s,

             })

         _, cost = self.sess.run(

             [self.train_op_critic, self.loss_critic],

             feed_dict={

                 self.s: batch_memory_s,

                 self.a: batch_memory_a,

                 self.r: batch_memory_r,

                 self.s_next: batch_memory_s_,

             })

         self.cost_his.append(cost)

         self.cost =cost

         self.learn_step_counter += 1

             # save network every 100000 iteration

         if self.learn_step_counter % 10000 == 0:

             self.saver.save(self.sess,self.model_dir,global_step=self.learn_step_counter)

     def choose_action(self,s): 

         return self.sess.run(self.a, {self.s: s[np.newaxis,:]})[0]

         # s = s[np.newaxis,:]

         # probs = self.sess.run(self.acts_prob,feed_dict={self.s:s})

         # return np.random.choice(np.arange(probs.shape[1]), p=probs.ravel())

game.py

 import sys

 import gym

 import numpy as np

 import tensorflow as tf

 sys.path.append('./')

 sys.path.append('model')

 from util import Memory ,StateProcessor

 from DDPG import DDPG

 from ACNetwork import ACNetwork

 np.random.seed(1)

 tf.set_random_seed(1)

 import logging  # 引入logging模块

 logging.basicConfig(level=logging.DEBUG,

                     format='%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s')  # logging.basicConfig函数对日志的输出格式及方式做相关配置

 # 由于日志基本配置中级别设置为DEBUG，所以一下打印信息将会全部显示在控制台上

 import os

 os.environ["CUDA_VISIBLE_DEVICES"] = ""

 tfconfig = tf.ConfigProto()

 tfconfig.gpu_options.allow_growth = True

 session = tf.Session(config=tfconfig)

 class DDPG4Pendulum(DDPG):

     """docstring for ClassName"""

     def __init__(self, **kwargs):

         super(DDPG4Pendulum, self).__init__(**kwargs)

     def _build_a_net(self,s,scope,trainable):

         w_initializer, b_initializer = tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)

         #w_initializer, b_initializer = None,None

         with tf.variable_scope(scope):

             e1 = tf.layers.dense(inputs=s,

                     units=30,

                     bias_initializer = b_initializer,

                     kernel_initializer=w_initializer,

                     activation = tf.nn.relu,

                     trainable=trainable)

             a = tf.layers.dense(inputs=e1,

                     units=self.n_actions,

                     bias_initializer = b_initializer,

                     kernel_initializer=w_initializer,

                     activation = tf.nn.tanh,

                     trainable=trainable) 

         return tf.multiply(a, self.a_bound, name='scaled_a')  

     def _build_c_net(self,s,a,scope,trainable):

         w_initializer, b_initializer = tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)

         with tf.variable_scope(scope):

             n_l1 = 30

             w1_s = tf.get_variable('w1_s',self.n_features+[n_l1],trainable=trainable)

             w1_a = tf.get_variable('w1_a',[self.n_actions,n_l1],trainable=trainable)

             b1 = tf.get_variable('b1', [1, n_l1], trainable=trainable)

             net = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1)

             q = tf.layers.dense(inputs=net,

                     units=1,

                     bias_initializer = b_initializer,

                     kernel_initializer=w_initializer,

                     activation =None,

                     trainable=trainable) 

         return q   

 batch_size = 32

 memory_size  =10000

 env = gym.make('Pendulum-v0') #连续

 n_features= [env.observation_space.shape[0]]

 n_actions= env.action_space.shape[0]

 a_bound = env.action_space.high

 env = env.unwrapped

 MAX_EP_STEPS =200

 def run():

     RL = DDPG4Pendulum(

         n_actions=n_actions,

         n_features=n_features,

         reward_decay=0.9,

         lr_a = 0.001,

         lr_c = 0.002,

         memory_size=memory_size,

         TAU = 0.01,

         output_graph=False,

         log_dir = 'Pendulum/log/DDPG4Pendulum/',

         a_bound =a_bound,

         model_dir = 'Pendulum/model_dir/DDPG4Pendulum/'

         )

     memory = Memory(n_actions,n_features,memory_size=memory_size)

     var = 3  # control exploration

     step = 0

     for episode in range(2000):

         # initial observation

         observation = env.reset()

         ep_r = 0

         for j in range(MAX_EP_STEPS):

             # RL choose action based on observation

             action = RL.choose_action(observation)

             action = np.clip(np.random.normal(action, var), -2, 2)    # add randomness to action selection for exploration

             # RL take action and get_collectiot next observation and reward

             observation_, reward, done, info=env.step(action) # take a random action

             #print('step:%d---episode:%d----reward:%f---action:%f'%(step,episode,reward,action))

             memory.store_transition(observation, action, reward/10, observation_)

             if step > memory_size:

                 #env.render()

                 var *= .9995    # decay the action randomness

                 data = memory.sample(batch_size)

                 RL.learn(data)

             # swap observation

             observation = observation_

             ep_r += reward

             # break while loop when end of this episode

             if(episode>200):

                 env.render()  # render on the screen

             if j == MAX_EP_STEPS-1:

                 print('step: ',step,

                     'episode: ', episode,

                       'ep_r: ', round(ep_r, 2),

                       'var:',var,

                       #loss: ',RL.cost

                       )

                 break

             step += 1

     # end of game

     print('game over')

     env.destroy()

 def main():

     run()

 if __name__ == '__main__':

     main()

     #run2()

强化学习--DDPG---tensorflow实现的更多相关文章

强化学习之三：双臂赌博机（Two-armed Bandit）
本文是对Arthur Juliani在Medium平台发布的强化学习系列教程的个人中文翻译,该翻译是基于个人分享知识的目的进行的,欢迎交流!(This article is my personal t ...
深度强化学习：Policy-Based methods、Actor-Critic以及DDPG
Policy-Based methods 在上篇文章中介绍的Deep Q-Learning算法属于基于价值(Value-Based)的方法,即估计最优的action-value function $q ...
强化学习(十六) 深度确定性策略梯度(DDPG)
在强化学习(十五) A3C中,我们讨论了使用多线程的方法来解决Actor-Critic难收敛的问题,今天我们不使用多线程,而是使用和DDQN类似的方法:即经验回放和双网络的方法来改进Actor-Cri ...
学习笔记TF053:循环神经网络，TensorFlow Model Zoo，强化学习，深度森林，深度学习艺术
循环神经网络.https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/re ...
深度强化学习——连续动作控制DDPG、NAF
一.存在的问题 DQN是一个面向离散控制的算法,即输出的动作是离散的.对应到Atari 游戏中,只需要几个离散的键盘或手柄按键进行控制. 然而在实际中,控制问题则是连续的,高维的,比如一个具有6个关节 ...
Ubuntu下常用强化学习实验环境搭建(MuJoCo, OpenAI Gym, rllab, DeepMind Lab, TORCS, PySC2)
http://lib.csdn.net/article/aimachinelearning/68113 原文地址:http://blog.csdn.net/jinzhuojun/article/det ...
阅读AuTO利用深度强化学习自动优化数据中心流量工程(一)
目录问题解决方法模型选择框架构建 Sigcomm'18 AuTO: Scaling Deep Reinforcement Learning for Datacenter-Scale Autom ...
Deep Learning专栏--强化学习之从 Policy Gradient 到 A3C（3）
在之前的强化学习文章里,我们讲到了经典的MDP模型来描述强化学习,其解法包括value iteration和policy iteration,这类经典解法基于已知的转移概率矩阵P,而在实际应用中,我们 ...
Flink + 强化学习搭建实时推荐系统
如今的推荐系统,对于实时性的要求越来越高,实时推荐的流程大致可以概括为这样: 推荐系统对于用户的请求产生推荐,用户对推荐结果作出反馈 (购买/点击/离开等等),推荐系统再根据用户反馈作出新的推荐.这个 ...
强化学习(十五) A3C
在强化学习(十四) Actor-Critic中,我们讨论了Actor-Critic的算法流程,但是由于普通的Actor-Critic算法难以收敛,需要一些其他的优化.而Asynchronous Adv ...

随机推荐

hash_map
点开一道第是自己oj的第440大关,想a了,一直想却无果,学长一句点醒,开始写hash. 关于这道题呢很无语了,两天卡在这上面,而且有些dalao不到20min就a了.我太菜了. 所以要深入讨论这道题 ...
转：JVM 内存初学 (堆(heap)、栈(stack)和方法区(method) )
原文地址:JVM 内存初学 (堆(heap).栈(stack)和方法区(method) ) 博主推荐深入浅出JVM 这本书先了解具体的概念:JAVA的JVM的内存可分为3个区:堆(heap).栈( ...
LeetCode 811 Subdomain Visit Count 解题报告
题目要求 A website domain like "discuss.leetcode.com" consists of various subdomains. At the t ...
《HTTP - 基于http的认证》
推荐一首歌 - 好吧,今天刚入职第二天,也没听歌. 哈哈哈哈. 1:何为认证? - 其实这个问题就比较宽泛了,总的来说,就是你有证明你身份的标识. - 和人类社会一样,你花了钱想看一场场演唱会,但是谁 ...
js 进制之间的转换
//十进制转其他 var x=110; alert(x); alert(x.toString(8)); alert(x.toString(32)); alert(x.toString(16)); // ...
linux批量替换文本字符串
(一)通过vi编辑器来替换.vi/vim 中可以使用 :s 命令来替换字符串.:s/well/good/ 替换当前行第一个 well 为 good:s/well/good/g 替换当前行所有 well ...
获取APP和设备相关信息
APP NAME: [[[NSBundle mainBundle] infoDictionary] objectForKey:@"CFBundleDisplayName"] APP ...
【托业】【新托业TOEIC新题型真题】学习笔记5-题库二->P7
--------------------------------------单词-------------------------------------- amenity 适意:休闲设施 onsit ...
接口测试工具-Jmeter使用笔记（一：运行一个HTTP请求）
博主自从毕业从事软件测试行业距今一年半时间,大多数时间都在跟各种API打交道,使用过的接口测试工具也有许多,本文记录下各工具的使用心得,以及重点介绍我在工作中是如何使用Jmeter做测试的,都是在wi ...
（5.1）sql server系统数据库
关键词:mssql系统数据库,sql server系统数据库,tempdb的作用 master:它包含一个系统表集合,是整个实例的中央存储库,维护登录账户,其他数据库,文件分布,系统配置设置,磁盘空间 ...

强化学习--DDPG---tensorflow实现

逻辑梳理：

流程

代码：

强化学习--DDPG---tensorflow实现的更多相关文章

随机推荐

热门专题