baselines中环境包装器EpisodicLifeEnv的分析

如题：

class EpisodicLifeEnv(gym.Wrapper):

    def __init__(self, env):

        """Make end-of-life == end-of-episode, but only reset on true game over.

        Done by DeepMind for the DQN and co. since it helps value estimation.

        """

        gym.Wrapper.__init__(self, env)

        self.lives = 0

        self.was_real_done = True

    def step(self, action):

        obs, reward, done, info = self.env.step(action)

        self.was_real_done = done

        # check current lives, make loss of life terminal,

        # then update lives to handle bonus lives

        lives = self.env.unwrapped.ale.lives()

        if lives < self.lives and lives > 0:

            # for Qbert sometimes we stay in lives == 0 condition for a few frames

            # so it's important to keep lives > 0, so that we only reset once

            # the environment advertises done.

            done = True

        self.lives = lives

        return obs, reward, done, info

    def reset(self, **kwargs):

        """Reset only when lives are exhausted.

        This way all states are still reachable even though lives are episodic,

        and the learner need not know about any of this behind-the-scenes.

        """

        if self.was_real_done:

            obs = self.env.reset(**kwargs)

        else:

            # no-op step to advance from terminal/lost life state

            obs, _, _, _ = self.env.step(0)

        self.lives = self.env.unwrapped.ale.lives()

        return obs

EpisodicLifeEnv包装器是针对环境中有多条lives的，游戏中所剩的lives通过： lives = self.env.unwrapped.ale.lives()获得。

主要需要说明的代码为：

        if lives < self.lives and lives > 0:

            # for Qbert sometimes we stay in lives == 0 condition for a few frames

            # so it's important to keep lives > 0, so that we only reset once

            # the environment advertises done.

            done = True

根据注释可以知道，对于游戏Qbert来说当所剩lives为0的时候这时返回的done为false，也就是说还需要几帧画面后才会获得done=True的反馈，如果我们将判断条件：

        if lives < self.lives and lives > 0:

改为：

        if lives < self.lives and lives >=0:

这样，step返回的 return obs, reward, done, info 将作为一个episode的最后一帧数据来处理，并调用reset函数中的：

        else:

            # no-op step to advance from terminal/lost life state

            obs, _, _, _ = self.env.step(0)

这样，在随后的几帧数据中由于 self.was_real_done = False，而 lives = self.env.unwrapped.ale.lives()=0，会不断的循环调用reset操作。

当然针对Qbert游戏中的这种问题我们还可以使用其他的修改方式：

class EpisodicLifeEnv(gym.Wrapper):

    def __init__(self, env):

        """Make end-of-life == end-of-episode, but only reset on true game over.

        Done by DeepMind for the DQN and co. since it helps value estimation.

        """

        gym.Wrapper.__init__(self, env)

        self.lives = 0

        self.was_real_done = True

    def step(self, action):

        obs, reward, done, info = self.env.step(action)

        # self.was_real_done = done

        # check current lives, make loss of life terminal,

        # then update lives to handle bonus lives

        lives = self.env.unwrapped.ale.lives()

        if lives < self.lives:

            # for Qbert sometimes we stay in lives == 0 condition for a few frames

            # so it's important to keep lives > 0, so that we only reset once

            # the environment advertises done.

            done = True

        self.lives = lives

        return obs, reward, done, info

    def reset(self, **kwargs):

        """Reset only when lives are exhausted.

        This way all states are still reachable even though lives are episodic,

        and the learner need not know about any of this behind-the-scenes.

        """

        # if self.was_real_done:

        if self.lives == 0:

            obs = self.env.reset(**kwargs)

        else:

            # no-op step to advance from terminal/lost life state

            obs, _, _, _ = self.env.step(0)

        self.lives = self.env.unwrapped.ale.lives()

        return obs

==================================================

baselines中环境包装器EpisodicLifeEnv的分析的更多相关文章

Oracle中CBO优化器简介
Oracle中CBO优化器简介 Oracle数据库中的优化器是SQL分析和执行的优化工具.它负责制定SQL的执行计划,也就是它负责保证SQL的执行计划的效率最高,比如优化器决定Oracle以什么样的方 ...
SwiftUI 中一些和响应式状态有关的属性包装器的用途
SwiftUI 借鉴了 React 等 UI 框架的概念,通过 state 的变化,对 View 进行响应式的渲染.主要通过 @State, @StateObject, @ObservedObject ...
Java中基本数据类型和包装器类型的关系
在程序设计中经常用到一系列的数据类型,在Java中也一样包含八中数据类型,这八种数据类型又各自对应一种包装器类型.如下表: 基本类型包装器类型 boolean Boolean char Charac ...
javaweb 中的过滤器包装器
过滤器要做的事情: 请求过滤器:完毕安全检查,又一次格式化请求首部或体.建立请求审计或日志响应过滤器: 压缩响应流,追加或改动响应流创建一个全然不同的响应. 过滤器和servlet三个相似地 ...
Java中的类加载器以及Tomcat的类加载机制
在加载阶段,虚拟机需要完成以下三件事情: 1.通过一个类的全限定名来获取其定义的二进制字节流. 2.将这个字节流所代表的静态存储结构转化为方法区的运行时数据结构. 3.在Java堆中生成一个代表这个类 ...
【Keras案例学习】 sklearn包装器使用示范（mnist_sklearn_wrapper）
import numpy as np from keras.datasets import mnist from keras.models import Sequential from keras.l ...
Netty中NioEventLoopGroup的创建源码分析
NioEventLoopGroup的无参构造: public NioEventLoopGroup() { this(0); } 调用了单参的构造: public NioEventLoopGroup(i ...
global对象，数据存储方式和检测，包装器对象等
1.理解global对象 global对象是作为 window 对象的一部分实现的,我们无法通过代码访问到 global 对象. 我们平时在全局环境下定义的内容(变量,函数,常量等等)都是作为 glo ...
Linux 内核调度器源码分析 - 初始化
导语上篇系列文混部之殇-论云原生资源隔离技术之CPU隔离(一) 介绍了云原生混部场景中CPU资源隔离核心技术:内核调度器,本系列文章<Linux内核调度器源码分析>将从源码的角度剖析内 ...
SwiftUI 简明教程之属性包装器
本文为 Eul 样章,如果您喜欢,请移步 AppStore/Eul 查看更多内容. Eul 是一款 SwiftUI & Combine 教程 App(iOS.macOS),以文章(文字.图片. ...

随机推荐

[TinyRenderer] Chapter1 p3 Line
(注:本小节不是对划线算法事无巨细的证明,如果你需要更加系统的学习,请跳转至文末的参考部分) 如果你是一名曾经学习过图形学基础的学生,那么你一定对画线算法稔熟于心,中点划线算法,Bresenham算法 ...
echo输出带颜色的字
文章目录格式所有颜色字体样式示例格式 \033[A;F;Bm #放在文本的左边,可以影响后面所有字体的样式解释: F代表字体颜色值(Font),颜色编号30~37. B代表背景颜色值(Ba ...
CLR via C# 笔记 -- 计算限制的异步操作(27)
1. 线程池基础. 创建和销毁线程是一个昂贵的操作,要耗费大量时间.太多的线程会浪费内存资源.由于操作系统必须调度可运行的线程并执行上下文切换,所以大多的线程还对性能不利.为了改善这个情况,CLR包含 ...
CLR via C# 笔记 -- 枚举(15)
1. 枚举继承System.Enum,后者继承 System.ValueType,所以枚举是值类型. 2. 枚举不能定义任何方法.属性和事件,不过可以定义扩展方法 3. ToString()方法 Co ...
linux挂载的ntfs格式硬盘无法使用回收站
linux挂载的ntfs格式硬盘无法使用回收站解决办法: 新建回收站文件, 文件名为Trash-XXX . 比如Trash-1000 这里的1000就是你的$UID. sudo mkdir /.Tr ...
.NET个人博客-使用Back进行消息推送
使用Back推送消息到你的iPhone 前言我的好友看了我的博客,给我提了个需求,让我搞个网站通知,我开始以为就是评论回复然后发送邮件通知.不过他告诉我网站通知是,当有人评论或者留言后,会通知到我这 ...
EC热键问题
EC热键问题 ec 问题描述 ACPI事件监控按键监控 UDEV事件监控 kprobe探测初步总结热键功能流程调试记录 PS2 问题描述系统无触摸板打开和关闭的提示已知热键功能快捷键功能 ...
韦东山freeRTOS系列教程之【第八章】事件组(event group)
目录系列教程总目录概述 8.1 事件组概念与操作 8.1.1 事件组的概念 8.1.2 事件组的操作 8.2 事件组函数 8.2.1 创建 8.2.2 删除 8.2.3 设置事件 8.2.4 等待 ...
Mac Idea中获取application.properties的值，中文乱码
设置idea配置将Properties Files (*.properties)下的Default encoding for properties files设置为UTF-8,将Transparen ...
洛谷P1063
[NOIP2006 提高组] 能量项链题目描述在 Mars 星球上,每个 Mars 人都随身佩带着一串能量项链.在项链上有 $N$ 颗能量珠.能量珠是一颗有头标记与尾标记的珠子,这些标记对应着 ...

baselines中环境包装器EpisodicLifeEnv的分析

baselines中环境包装器EpisodicLifeEnv的分析的更多相关文章

随机推荐

热门专题