1503.02531-Distilling the Knowledge in a Neural Network.md

原来交叉熵还有一个tempature，这个tempature有如下的定义：

$$
q_i=\frac{e^{z_i/T}}{\sum_j{e^{z_j/T}}}
$$

其中T就是tempature，一般这个T取值就是1,如果提高：

In [6]: np.exp(np.array([1,2,3,4])/2)/np.sum(np.exp(np.array([1,2,3,4])/2))

Out[6]: array([0.10153632, 0.1674051 , 0.27600434, 0.45505423])

In [7]: mx.nd.softmax(mx.nd.array([1,2,3,4]))

Out[7]: 

[0.0320586 0.08714432 0.23688284 0.6439143 ]

<NDArray 4 @cpu(0)>

也就是

Using a higher value for T produces a softer probability distribution over classes.

拥有更高的tempature的系统，其entropy会更高，也就是混乱性更高，方向不趋于一致，而这种不一致性，其实是一种信息，

可以描述数据中更多结构的信息。大模型通过强制的正则化，使得最后输出的信息，entropy更低。因此

Our more general solution, called “distillation”, is to raise the temperature of the final softmax until the cumbersome model produces a suitably soft set of targets. We then use the same high temperature when training the small model to match these soft targets. We show later that matching the logits of the cumbersome model is actually a special case of distillation.

也就是在训练大模型的时候就强制高tempature？但是感觉这样会更加重这种问题才对？

训练大模型的时候，正常训练。其logits使用的时候，用高T，小模型训练的时候，也使用高T，但是验证的时候，使用T1.

In the simplest form of distillation, knowledge is transferred to the distilled model by training it on a transfer set and using a soft target distribution for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax. The same high temperature is used when training the distilled model, but after it has been trained it uses a temperature of 1.

可以同时使用softlabel和数据集的label来做训练，但是softlabel使用不同的T的时候，需要将softlabel的loss相应的乘以$T^2$

使用softtarget的好处是，softtarget携带了更多的信息，因此可以用更少的数据来训练。

多个大模型蒸馏出来的模型，可能比多个模型组合有更好的性能。

多个模型如何蒸馏？用多个模型的输出，作为最终蒸馏模型的target，多个target的loss相加。也就是一种多任务学习。

confusion matrix 这个东西可以被用来探查模型最容易弄错的是哪些分类。

看错了，似乎论文最后只是在讨论训练多个speciallist model，但是并没有谈到如何把这些models组合回一个大模型。这可能是个问题。

1503.02531-Distilling the Knowledge in a Neural Network.md的更多相关文章

Distilling the Knowledge in a Neural Network
url: https://arxiv.org/abs/1503.02531 year: NIPS 2014 简介将大模型的泛化能力转移到小模型的一种显而易见的方法是使用由大模型产生的类概率作 ...
【DKNN】Distilling the Knowledge in a Neural Network 第一次提出神经网络的知识蒸馏概念
原文链接小样本学习与智能前沿 . 在这个公众号后台回复"DKNN",即可获得课件电子资源. 文章已经表明,对于将知识从整体模型或高度正则化的大型模型转换为较小的蒸馏模型,蒸馏非常 ...
【论文考古】知识蒸馏 Distilling the Knowledge in a Neural Network
论文内容 G. Hinton, O. Vinyals, and J. Dean, "Distilling the Knowledge in a Neural Network." 2 ...
论文笔记：蒸馏网络（Distilling the Knowledge in Neural Network）
Distilling the Knowledge in Neural Network Geoffrey Hinton, Oriol Vinyals, Jeff Dean preprint arXiv: ...
论文笔记之：Progressive Neural Network Google DeepMind
Progressive Neural Network Google DeepMind 摘要:学习去解决任务的复杂序列 --- 结合 transfer (迁移),并且避免 catastrophic f ...
Recurrent Neural Network[Content]
下面的RNN,LSTM,GRU模型图来自这里简单的综述 1. RNN 图1.1 标准RNN模型的结构 2. BiRNN 3. LSTM 图3.1 LSTM模型的结构 4. Clockwork RNN ...
Recurrent Neural Network[survey]
0.引言我们发现传统的(如前向网络等)非循环的NN都是假设样本之间无依赖关系(至少时间和顺序上是无依赖关系),而许多学习任务却都涉及到处理序列数据,如image captioning,speech ...
[Tensorflow] Cookbook - Neural Network
In this chapter, we'll cover the following recipes: Implementing Operational Gates Working with Gate ...
(zhuan) Recurrent Neural Network
Recurrent Neural Network 2016年07月01日 Deep learning Deep learning 字数:24235 this blog from: http:/ ...

随机推荐

jdk1.8api帮助文档，转载
链接:https://pan.baidu.com/s/1jkDC68t6ha3PrSbx2BMevQ 密码:o425 转自https://blog.csdn.net/weixin_37012881/a ...
通过企业微信API接口发送消息
最近给公司测试组内部开发一个记账小工具,当账目出现问题的时候需要发送消息通知大家,前期主要采用的QQ发送通知消息,但是有一天突然无法连接到QQ服务器,运维的同学建议采用微信的方式对接然后进行告警,所以 ...
Cobbler全自动批量安装部署Linux系统
说明: Cobbler服务器系统:CentOS 5.10 64位 IP地址:192.168.21.128 需要安装部署的Linux系统: eth0(第一块网卡,用于外网)IP地址段:192.168.2 ...
Bus Hound抓包分析，基于HID设备（原创）
Bus Hound数据分析: CTL:表示8字节的USB控制传输的Setup包----------------------------重点分析控制传输setup(token包)和对应数据包(DATA ...
【Oracle】ORA-14400: 插入的分区关键字未映射到任何分区
问题描述: 工作中使用kettle将原始库中的数据抽取到标准库中,在抽取过程中报错:[ORA-14400: 插入的分区关键字未映射到任何分区]/[ORA-14400: inserted partiti ...
SSHD启动失败，错误码255
查看/etc/ssh/sshd_config 发现,Listen Address并不是我想要的ip,将其注释掉 sshd restart,结果返回 Permission denied (publick ...
C#使用Linq to Sqlite
1.首先到Nuget命令行运行: Install-Package linq2db.SQLite 2.在项目下添加一个文件夹:DataModels,用以存放数据库相关的文件,显得规整 3.找到CopyM ...
UIScrollView的AutoLayout约束
首先UIScrollview包含自身的frame和contentSize二个部分.frame决定其展示给用户的可见区域,contentSize决定其整个内容的大小.如果frame的宽高小于conten ...
SQL SERVER2008 数据库日志文件的收缩方法
最近公司的数据库随着业务量的增多,日志文件巨大(超过300G),造成磁盘空间不够用,进而后来的访问数据库请求无法访问. 网上类似的方法也很多,但不可行,如下是我实践过,可行的,将日志文件收缩至任意指定 ...
python request Payload 数据处理
普通的http的post请求的请求content-type类型是:Content-Type:text/html; charset=UTF-8, 而另外一种形式request payload,其Cont ...

1503.02531-Distilling the Knowledge in a Neural Network.md

1503.02531-Distilling the Knowledge in a Neural Network.md的更多相关文章

随机推荐

热门专题