openAI的比赛retro contest的一些细节设置（Detail）

2018年openAI公司搞了一个比赛retro contest，该比赛目的是为了在自家的库retro上测试迁移强化学习的性能，虽然这个比赛已经结束多年但是现在了解一些也是有一定益处的。

比赛细节介绍，原网址：

https://contest.openai.com/2018-1/details/#get-roms

===========================================================

比赛游戏的训练集：

You are free to train your agent however you'd like, but we recommend using Sonic 1, 2, and 3 & Knuckles, which are available on Steam here:

刺猬索尼克1、刺猬索尼克2、刺猬索尼克3 （这三个游戏需要在steam上购买并下载，具体参看：如何使用Python环境下的2D经典游戏仿真器（openai推出的）retro库运行游戏"刺猬索尼克" (SonicTheHedgehog-Genesis) ）

训练集（上面上个游戏中分别取不同游戏的不同状态作为训练的起始状态）：

验证集（上面上个游戏中分别取不同游戏的不同状态作为训练的起始状态）：

============================================================

比赛的评价环境：

硬件环境：

6 E5-2690v3 cores, 56GB of RAM, and a single K80 GPU

训练时间：

12 hours of time you should average ~43ms per timestep to get to 1 million timesteps within the limit

一个step的耗时为43毫秒。

The environment runs at ~1000 frames per second for a single core with random agent, meaning 1ms per frame, or 4ms per timestep, leaving you 39ms for your processing.

由于在测试时也进行训练同时测试时动作选择也是需要计算的，而一个不需要计算和训练的random agent一秒钟大约会和环境交互1000次（1000帧），如果训练的话这1000帧画面就是1000/4=250 timestep，也就是4毫秒一个timestep，由此可以估算出不进行动作选择计算和训练的random agent一个timestep耗时4ms，那么进行动作选择计算和训练的时间大致为43-4=39毫秒。

这里面有个容易搞不懂的地方，那就是为什么1000帧画面是250个timestep，中间的这个4倍是哪里来的。这里起始有个frameskip的操作，就是agent给出一个动作后起始是连续用这个动作进行4次交互的而不是一次，那么4次交互的画面我们只选择第一个画面，这样的话一共和环境交互了1000个frame，而这最后我们是只抽取其中的250个frame作为训练，而这250个frame再单步滑动的4个frame组成一个timestep，因此1000个frame最后形成了250个timestep。

The environment is stochastic in that it has sticky frameskip. While normal frameskip always repeats an action n times, sticky frameskip occasionally repeats an action n+1 times. When this happens, the following action is repeated one fewer times, since it is delayed by an extra frame. For the contest, sticky frameskip repeats an action an extra time with probability 0.25.

frameskip操作本身进行动作重复选择时可以加入概率随机性，具体参看(Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents)：https://arxiv.org/pdf/1709.06009.pdf

Your agent is allowed to learn (adjust its weights, use a replay buffer, etc) during test time, although a separate copy of the agent will be run on each test level.

每次测试都进行训练，但是每次测试时使用的agent都是完全独立的，也就是说每次的测试agent都是训练agent的完全拷贝。

In addition you are limited to 4,500 timesteps per episode, corresponding to 18,000 frames or 5 minutes of real time at 60 fps.

不论是训练还是测试的时候对每一个回合的长度进行限制，每个回合最多timestep为4500，也就是和环境交互18000个frame。对episode长度进行限制以防止有的episode进入无法终止的情况。

====================================================

The reward your agent receives is proportional to its progress to the predefined horizontal offset within each level, positive for getting closer, negative for getting further away. If you reach the offset, the sum of your rewards will be 9000. In addition there is a time bonus that starts at 1000 and decreases learning to 0 at the end of the time limit, so beating the level as quickly as possible is rewarded.

奖励函数的设定，奖励函数和水平位移相关，agent距离设定的位移越近reward越大，距离越远reward越小。

关于对游戏中的奖励函数设定和游戏终止条件的判断是很复杂的，尤其是在 “刺猬索尼克”这样的游戏中，奖励函数的设定对游戏性能有较大的影响。

查看contest.json内容：

路径：~/anaconda3/envs/baselines/lib/python3.7/site-packages/retro/data/stable/SonicTheHedgehog-Genesis

contest.json

{

  "done": {

    "script": "lua:contest_done"

  },

  "reward": {

    "script": "lua:contest_reward"

  },

  "scripts": [

    "script.lua"

  ]

}

可以看到游戏终结的判断函数为lua脚本script.lua中设定的contest_done函数，reward函数则是script.lua中的contest_reward函数：

script.lua

level_max_x = {

-- Green Hill Zone

    ["zone=0,act=0"] = 0x2560,

    ["zone=0,act=1"] = 0x1F60,

    ["zone=0,act=2"] = 0x292A,

-- Marble Zone

    ["zone=2,act=0"] = 0x1860,

    ["zone=2,act=1"] = 0x1860,

    ["zone=2,act=2"] = 0x1720,

-- Spring Yard Zone

    ["zone=4,act=0"] = 0x2360,

    ["zone=4,act=1"] = 0x2960,

    ["zone=4,act=2"] = 0x2B83,

-- Labyrinth Zone

    ["zone=1,act=0"] = 0x1A50,

    ["zone=1,act=1"] = 0x1150,

    ["zone=1,act=2"] = 0x1CC4,

-- Star Light Zone

    ["zone=3,act=0"] = 0x2060,

    ["zone=3,act=1"] = 0x2060,

    ["zone=3,act=2"] = 0x1F48,

-- Scrap Brain Zone

    ["zone=5,act=0"] = 0x2260,

    ["zone=5,act=1"] = 0x1EE0,

    -- ["zone=5,act=2"] = 000000, -- does not have a max x

}

function level_key()

    return string.format("zone=%d,act=%d", data.zone, data.act)

end

function clip(v, min, max)

    if v < min then

        return min

    elseif v > max then

        return max

    else

        return v

    end

end

data.prev_lives = 3

function contest_done()

    if data.lives < data.prev_lives then

        return true

    end

    data.prev_lives = data.lives

    if calc_progress(data) >= 1 then

        return true

    end

    return false

end

data.offset_x = nil

end_x = nil

function calc_progress(data)

    if data.offset_x == nil then

        data.offset_x = -data.x

        end_x = level_max_x[level_key()] - data.x

    end

    local cur_x = clip(data.x + data.offset_x, 0, end_x)

    return cur_x / end_x

end

data.prev_progress = 0

frame_limit = 18000

function contest_reward()

    local progress = calc_progress(data)

    local reward = (progress - data.prev_progress) * 9000

    data.prev_progress = progress

    -- bonus for beating level quickly

    if progress >= 1 then

        reward = reward + (1 - clip(scenario.frame / frame_limit, 0, 1)) * 1000

    end

    return reward

end

data.xpos_last_x = nil

function xpos_done()

    if data.lives < data.prev_lives then

        return true

    end

    data.prev_lives = data.lives

    if scenario.frame >= frame_limit then

        return true

    end

    return data.x > level_max_x[level_key()]

end

function xpos_rew()

    if data.xpos_last_x == nil then

        data.xpos_last_x = data.x

    end

    local result = data.x - data.xpos_last_x

    data.xpos_last_x = data.x

    return result

end

其中，level_max_x中给出不同游戏状态开始的游戏中最终的终点x轴的距离（如：0x2560 则是变量地址，该地址的变量则是游戏终点的x轴上位置）

level_max_x = {

-- Green Hill Zone

    ["zone=0,act=0"] = 0x2560,

    ["zone=0,act=1"] = 0x1F60,

    ["zone=0,act=2"] = 0x292A,

-- Marble Zone

    ["zone=2,act=0"] = 0x1860,

    ["zone=2,act=1"] = 0x1860,

    ["zone=2,act=2"] = 0x1720,

-- Spring Yard Zone

    ["zone=4,act=0"] = 0x2360,

    ["zone=4,act=1"] = 0x2960,

    ["zone=4,act=2"] = 0x2B83,

-- Labyrinth Zone

    ["zone=1,act=0"] = 0x1A50,

    ["zone=1,act=1"] = 0x1150,

    ["zone=1,act=2"] = 0x1CC4,

-- Star Light Zone

    ["zone=3,act=0"] = 0x2060,

    ["zone=3,act=1"] = 0x2060,

    ["zone=3,act=2"] = 0x1F48,

-- Scrap Brain Zone

    ["zone=5,act=0"] = 0x2260,

    ["zone=5,act=1"] = 0x1EE0,

    -- ["zone=5,act=2"] = 000000, -- does not have a max x

}

由于我们游戏的开始状态的x轴位置不为0，所以我们要记录我们游戏状态开始时的x轴位置坐标，这里设定为data.offset_x，由于游戏开始时的agent在x轴上的位置为data.x，所以我们设定data.offset_x= -data.x, 由下面的函数 calc_progress可以看到我们对data.offset_x的设置只进行一次，即游戏开始的状态，以后则不对data.offset_x进行修改。

我们将游戏开始时的位置（ -data.offset_x ）作为距离规整后的0点，距离规整后的终点位置则为 level_max_x[level_key()] +data.offset_x=level_max_x[level_key()] - data.x

于是，在以后的位置计算中agent的位置则为data.x + data.offset_x，游戏起始点位置为0，终点位置为end_x 。

agent在游戏起始点和终点之间的位置比例为： cur_x / end_x ，该数值大于等于0小于等于1。

同时需要对游戏中agent的位置进行clip，因为游戏中的agent位置可能会小于游戏起始点也可能大于游戏终止点。

function calc_progress(data)

    if data.offset_x == nil then

        data.offset_x = -data.x

        end_x = level_max_x[level_key()] - data.x

    end

    local cur_x = clip(data.x + data.offset_x, 0, end_x)

    return cur_x / end_x

end

我们得到了游戏中agent的位置在起始点和终点之间的比值就可以进入奖励函数的核心部分：

data.prev_progress = 0

frame_limit = 18000

function contest_reward()

    local progress = calc_progress(data)

    local reward = (progress - data.prev_progress) * 9000

    data.prev_progress = progress

    -- bonus for beating level quickly

    if progress >= 1 then

        reward = reward + (1 - clip(scenario.frame / frame_limit, 0, 1)) * 1000

    end

    return reward

end

分析上面代码，data.prev_progress指的是前一个时刻agent在0和end_x之间的比例，progress表示当前时刻agent在0和end_x之间的比例，当前时刻由于agent移动所带来的reward奖励为local reward = (progress - data.prev_progress) * 9000，假设agent从游戏开始到游戏终止时到达end_x位置，那么每一步的该部分奖励和为所有步的(progress - data.prev_progress) * 9000之和，等于最后到达end_x时的progress与游戏开始时的data.prev_progress之差，等于数值1，也就是说如果agent从游戏开始的x轴位置运动到游戏终点位置x_end那么最终所有步获得的该部分奖励之和为9000。

奖励函数的第二部分：

frame_limit = 18000

    if progress >= 1 then

        reward = reward + (1 - clip(scenario.frame / frame_limit, 0, 1)) * 1000

    end

奖励的第二部分可以视为对游戏时长的惩罚，如果游戏当前状态的frame数在0到18000之间那么我们按照比例给以奖励，如果游戏运行的frame数超过18000那么则不进行奖励。

游戏终止的判断条件：

function xpos_done()

    if data.lives < data.prev_lives then

        return true

    end

    data.prev_lives = data.lives

    if scenario.frame >= frame_limit then

        return true

    end

    return data.x > level_max_x[level_key()]

end

每一步都判断游戏lives变量是否变化，如果有则说明游戏回合结束。

    if data.lives < data.prev_lives then

        return true

    end

如果游戏帧超过设定则设定为游戏回合终止。对游戏的时长上限进行设定。

    if scenario.frame >= frame_limit then

        return true

    end

对游戏中agent的x轴最大值设定上限，如果agent的x轴位置大于设定的游戏最大x轴位置则判定为回合结束。

data.x > level_max_x[level_key()]

openAI的比赛retro contest的一些细节设置（Detail）的更多相关文章

Hololens 硬件细节 Hardware Detail
微软HoloLens是世界第一款完全无线缆的全息计算机.通过在新方式上赋予用户的全息体验,HoloLens重新定义了个人计算(Personal Computing).为了将3D全息图形固定到你周围的真 ...
yii2细节设置
1.设置默认的跳转登陆页面.默认的登陆成功页在项目的(backend/frontend的config中的main.php中的user组件中),添加loginUrl=>'admin/login' ...
C#中关于WebBrowser的一些细节设置
在winform中有一个控件可以显示html的内容,该控件就是webbrowser,设置它的DocumenText属性为HTML的内容即可. 在使用WebBrowser做UI的时候,我们有时不希望里面 ...
sublime text 的小细节设置,让你的代码更优美
这些属性都可以在首选项>设置-默认里修改下面也会介绍几个比较常用的几个插件字体大小: "font_size": 17 高亮编辑中的一行 "highlight_ ...
HoloLens开发手记－硬件细节 Hardware Detail
微软HoloLens是世界第一款完全无线缆的全息计算机.通过在新方式上赋予用户的全息体验,HoloLens重新定义了个人计算(Personal Computing).为了将3D全息图形固定到你周围的真 ...
ubuntu16.04细节设置
1.查看无线网卡名称 $ iwconfig ------------------ lo no wireless extensions. eth1 no wireless extensions. eth ...
面对最菜TI战队，OpenAI在Dota2上输的毫无还手之力
作者:Tony Peng 去年,OpenAI 的 1v1 AI 击败了世界顶尖选手 Dendi,OpenAI CTO Greg Brockman 承诺:明年,我们会带着 5v5 的 AI bot 重回 ...
LaTeX的一些宏包及细节知识
文章来源:LaTeX的一些宏包及细节知识http://blog.chinaunix.net/uid-20289887-id-1710422.html ps:我的机器上软件并不能直接运行通,下面“代码” ...
hdu4431 Mahjong
Mahjong Time Limit: 4000/2000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) Total Submi ...
it工程师常用英文自我介绍常用用语
Good morning ! It is really my honor to have this opportunity for an interview, I hope i can make ...

随机推荐

rsync备份
备份工具rsync 备份是太常见.且太重要的一个日常工作了. 备份源码.文档.数据库.等等. 类似cp命令拷贝,但是支持服务器之间的网络拷贝,且保证安全性. 学习背景超哥游戏公司要每天都要对代码备份 ...
项目管理--PMBOK 读书笔记（12）【项目采购管理】
1.采购计划三要素 1)采购管理计划:预审合格的卖方,供方选择标准: 2)采购 SOW:采购工作说明书应详细描述拟采购的产品.服务或成果: 3)采购文件 2.合同类型 1)总价合同:范围清楚, ...
windows powershell 解压 .gz文件
windows 10下解压.gz后缀文件打开windows powershell界面,(1)输入cd desktop(文件的存储位置,示例为存储在电脑桌面上), (2)输入tar -zxvf 需要解 ...
Windows记录登录日志
有的时候,我们希望系统记录登录的日志,以便查看有无他人动过自己的电脑. 步骤 1.在windows中搜索并打开"组策略". 2.点击计算机配置-->Windows设置--&g ...
.NET个人博客-使用Back进行消息推送
使用Back推送消息到你的iPhone 前言我的好友看了我的博客,给我提了个需求,让我搞个网站通知,我开始以为就是评论回复然后发送邮件通知.不过他告诉我网站通知是,当有人评论或者留言后,会通知到我这 ...
Oracle常用统计
测试, 这是测消息 1.按天 select to_char(t.STARTDATE+15/24, 'YYYY-MM-DD') as 天,sum(1) as 数量from HOLIDAY tgroup ...
升级到 MySQL 8.4，MySQL 启动报错：io_setup() failed with EAGAIN
问题最近碰到一个 case,一台主机上,部署了多个实例.之前使用的是 MySQL 8.0,启动时没有任何问题.但升级到 MySQL 8.4 后,部分实例在启动时出现了以下错误. [Warning] ...
核对不同文件夹所含内容的差异并提取缺失内容：Python代码
本文介绍基于Python语言,以一个大文件夹作为标准,对另一个大文件夹所包含的子文件夹或文件加以查漏补缺,并将查漏补缺的结果输出的方法. 首先,来明确一下本文所需实现的具体需求.现有一个大文件 ...
Salt安装部署
Salt安装部署一:salt简介早期运维人员会根据自己的生产环境来写特定脚本完成大量重复性工作,这些脚本复杂且难以维护.系统管理员面临的问题主要是1.系统配置管理,2.远程执行命令,因此诞生了 ...
vulnhub - BREACH: 1
vulnhub - BREACH: 1 描述作为多部分系列中的第一部分,Breach 1.0 旨在成为初学者到中级的 boot2root/CTF 挑战.解决将需要可靠的信息收集和持久性相结合.不遗余 ...

openAI的比赛retro contest的一些细节设置（Detail）

openAI的比赛retro contest的一些细节设置（Detail）的更多相关文章

随机推荐

热门专题