Comparing Differently Trained Models
Comparing Differently Trained Models
At the end of the previous post, we mentioned that the solution found by L-BFGS made different errors compared to the model we trained with SGD and momentum. So, one question is what solution is better in terms of generalization and if they focus on different aspects, how do they differ for individual methods.
To make the analysis easier, but to be at least a little realistic, we train a linear SVM classifier (W, bias) for a “werewolf” theme. In other words, all movies with that theme are marked with “+1” and we sample random movies for the ‘rest’ that are marked with -1. For the features, we use the 1,500 most frequent keywords. All random seeds were fixed which means both models start at the same “point”.
In our first experiment, we only care to minimize the errors. The SGD method (I) uses standard momentum and a L1 penalty of 0.0005 in combination with mini-batches. The learning rate and momentum was kept at a fixed value. The L-BFGS method (II) minimizes the same loss function. Both methods were able to get an accuracy of 100% for the training data and the training has been stopped as soon as the error was zero.
(I) loss=1.64000 ||W||=3.56,
bias=-0.60811 (SGD)
(II) loss=0.04711 ||W||=3.75, bias=-0.58073
(L-BFGS)
As we can see, the L2 norm of the final weight vector is
similar, also the bias, but of course we do not care for absolute norms but
rather for the correlation of both solutions. For that reason, we converted both
weight vectors W to unit-norm and determined the cosine similarity: correlation
= W_sgd.T * W_lbfgs = 0.977.
Since we do not have any empirical data for such correlations, we analyzed
the magnitude of the features in the weight vectors. More precisely the top-5
most important features:
(I) werewolf=0.6652, vampire=0.2394,
creature=0.1886, forbidden-love=0.1392, teenagers=0.1372
(II)
werewolf=0.6698, vampire=0.2119, monster=0.1531, creature=0.1511,
teenagers=0.1279
If we also consider the top-12 features of both
models, which are pretty similar,
(I) werewolf, vampire, creature,
forbidden-love, teenagers, monster, pregnancy, undead, curse, supernatural,
mansion, bloodsucker
(II) werewolf, vampire, monster, creature, teenagers,
curse, forbidden-love, supernatural, pregnancy, hunting, undead,
beast
we can see some patterns here: First, a lot of the movies in
the dataset seem to combine the theme with love stories that may involve
teenagers. This makes sense because this is actually a very popular pattern
these days and second, vampires and werewolves are very likely to co-occur in
the same movie.
Those patterns were learned by both models, regardless of the actual
optimization method but with minor differences which can be seen by considering
the magnitude of the individual weights in W. However, as the correlation of the
parameters vectors confirmed, both solutions are pretty close together.
Bottom line, we should be careful with interpretations since the data at hand
was limited, but nevertheless the results confirmed that with proper
initializations and hyper-parameters, good solutions can be both achieved with
1st and 2nd order methods. Next, we will study the ability of models to
generalize for unseen data.
Comparing Differently Trained Models的更多相关文章
- 大规模视觉识别挑战赛ILSVRC2015各团队结果和方法 Large Scale Visual Recognition Challenge 2015
Large Scale Visual Recognition Challenge 2015 (ILSVRC2015) Legend: Yellow background = winner in thi ...
- (转)The Evolved Transformer - Enhancing Transformer with Neural Architecture Search
The Evolved Transformer - Enhancing Transformer with Neural Architecture Search 2019-03-26 19:14:33 ...
- (转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance
Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance 2018-1 ...
- Keras vs. PyTorch
We strongly recommend that you pick either Keras or PyTorch. These are powerful tools that are enjoy ...
- Understanding Tensorflow using Go
原文: https://pgaleone.eu/tensorflow/go/2017/05/29/understanding-tensorflow-using-go/ Tensorflow is no ...
- How to handle Imbalanced Classification Problems in machine learning?
How to handle Imbalanced Classification Problems in machine learning? from:https://www.analyticsvidh ...
- understanding backpropagation
几个有助于加深对反向传播算法直观理解的网页,包括普通前向神经网络,卷积神经网络以及利用BP对一般性函数求导 A Visual Explanation of the Back Propagation A ...
- 论文翻译——Deep contextualized word representations
Abstract We introduce a new type of deep contextualized word representation that models both (1) com ...
- 论文翻译——Character-level Convolutional Networks for Text Classification
论文地址 Abstract Open-text semantic parsers are designed to interpret any statement in natural language ...
随机推荐
- 计算机基础知识 一 Basic knowledge of computers One
计算机硬件由CPU(Central Processing Unit).存储器.输入设备.输出设备组成. CPU通常由控制单元(控制器)和算数逻辑单元(运算器)组成. 运算器:负责进行算数运算和逻辑运算 ...
- dokuwiki编辑器修改-color插件-添加按钮
需求 dokuwiki的编辑工具栏是以 MediaWiki 的为基础发展来的. 在它的编辑器color插件的颜色按钮中,我想添加新的按钮功能.如红色字体黄色背景的修饰,类似于涂中文字强调的意思. 步骤 ...
- 06-docker组件如何协作
容器启动过程如下: Docker 客户端执行 docker run 命令. Docker daemon 发现本地没有 httpd 镜像. daemon 从 Docker Hub 下载镜像. 下载完成, ...
- SCRUM 12.14
由于最近的课业较多,大家平时有些力不从心,对于工作完成得有限. 最近课业压力小了一些,我们决定从今天起,投入精力. 以下为我们的任务分配情况: 人员 任务 高雅智 清除缓存 彭林江 统计活跃用户数量 ...
- Leetcode——30.与所有单词相关联的字串【##】
@author: ZZQ @software: PyCharm @file: leetcode30_findSubstring.py @time: 2018/11/20 19:14 题目要求: 给定一 ...
- 团队作业五之旅游行业手机APP分析
深入分析旅游业手机APP——分析员王奕 在接到组长分配的任务的时候,我的内心是激动的.因为自己不擅长编程,所以还是比较喜欢这种“外围”的文字工作.但是,面对数量庞大的旅游业APP,一时间自己真的不知 ...
- 3D舞台实现
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title> ...
- 在 SQL Server 中从完整路径提取文件名(sql 玩转文件路径)
四个函数: --1.根据路径获取文件名 -- ============================================= -- Author: Paul Griffin -- Crea ...
- 微信 小程序组件 加入购物车全套 one js
// pages/shop/shop.js Page({ /** * 页面的初始数据 */ data: { carts: [ { teaname: '冠军乌龙茶-150g', image: '../. ...
- [代码]--给GridControl中的某列添加图片
要让GridControl的某列显示图片只需要数据源中有图片就可以正确显示 1.给DataSet添加一列,格式为image ds.Tables[].Columns.Add("SIGN&quo ...