Comparing Differently Trained Models
Comparing Differently Trained Models
At the end of the previous post, we mentioned that the solution found by L-BFGS made different errors compared to the model we trained with SGD and momentum. So, one question is what solution is better in terms of generalization and if they focus on different aspects, how do they differ for individual methods.
To make the analysis easier, but to be at least a little realistic, we train a linear SVM classifier (W, bias) for a “werewolf” theme. In other words, all movies with that theme are marked with “+1” and we sample random movies for the ‘rest’ that are marked with -1. For the features, we use the 1,500 most frequent keywords. All random seeds were fixed which means both models start at the same “point”.
In our first experiment, we only care to minimize the errors. The SGD method (I) uses standard momentum and a L1 penalty of 0.0005 in combination with mini-batches. The learning rate and momentum was kept at a fixed value. The L-BFGS method (II) minimizes the same loss function. Both methods were able to get an accuracy of 100% for the training data and the training has been stopped as soon as the error was zero.
(I) loss=1.64000 ||W||=3.56,
bias=-0.60811 (SGD)
(II) loss=0.04711 ||W||=3.75, bias=-0.58073
(L-BFGS)
As we can see, the L2 norm of the final weight vector is
similar, also the bias, but of course we do not care for absolute norms but
rather for the correlation of both solutions. For that reason, we converted both
weight vectors W to unit-norm and determined the cosine similarity: correlation
= W_sgd.T * W_lbfgs = 0.977.
Since we do not have any empirical data for such correlations, we analyzed
the magnitude of the features in the weight vectors. More precisely the top-5
most important features:
(I) werewolf=0.6652, vampire=0.2394,
creature=0.1886, forbidden-love=0.1392, teenagers=0.1372
(II)
werewolf=0.6698, vampire=0.2119, monster=0.1531, creature=0.1511,
teenagers=0.1279
If we also consider the top-12 features of both
models, which are pretty similar,
(I) werewolf, vampire, creature,
forbidden-love, teenagers, monster, pregnancy, undead, curse, supernatural,
mansion, bloodsucker
(II) werewolf, vampire, monster, creature, teenagers,
curse, forbidden-love, supernatural, pregnancy, hunting, undead,
beast
we can see some patterns here: First, a lot of the movies in
the dataset seem to combine the theme with love stories that may involve
teenagers. This makes sense because this is actually a very popular pattern
these days and second, vampires and werewolves are very likely to co-occur in
the same movie.
Those patterns were learned by both models, regardless of the actual
optimization method but with minor differences which can be seen by considering
the magnitude of the individual weights in W. However, as the correlation of the
parameters vectors confirmed, both solutions are pretty close together.
Bottom line, we should be careful with interpretations since the data at hand
was limited, but nevertheless the results confirmed that with proper
initializations and hyper-parameters, good solutions can be both achieved with
1st and 2nd order methods. Next, we will study the ability of models to
generalize for unseen data.
Comparing Differently Trained Models的更多相关文章
- 大规模视觉识别挑战赛ILSVRC2015各团队结果和方法 Large Scale Visual Recognition Challenge 2015
Large Scale Visual Recognition Challenge 2015 (ILSVRC2015) Legend: Yellow background = winner in thi ...
- (转)The Evolved Transformer - Enhancing Transformer with Neural Architecture Search
The Evolved Transformer - Enhancing Transformer with Neural Architecture Search 2019-03-26 19:14:33 ...
- (转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance
Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance 2018-1 ...
- Keras vs. PyTorch
We strongly recommend that you pick either Keras or PyTorch. These are powerful tools that are enjoy ...
- Understanding Tensorflow using Go
原文: https://pgaleone.eu/tensorflow/go/2017/05/29/understanding-tensorflow-using-go/ Tensorflow is no ...
- How to handle Imbalanced Classification Problems in machine learning?
How to handle Imbalanced Classification Problems in machine learning? from:https://www.analyticsvidh ...
- understanding backpropagation
几个有助于加深对反向传播算法直观理解的网页,包括普通前向神经网络,卷积神经网络以及利用BP对一般性函数求导 A Visual Explanation of the Back Propagation A ...
- 论文翻译——Deep contextualized word representations
Abstract We introduce a new type of deep contextualized word representation that models both (1) com ...
- 论文翻译——Character-level Convolutional Networks for Text Classification
论文地址 Abstract Open-text semantic parsers are designed to interpret any statement in natural language ...
随机推荐
- 基于.NET Standard的分布式自增ID算法--美团点评LeafSegment
概述 前一篇文章讲述了最流行的分布式ID生成算法snowflake,本篇文章根据美团点评分布式ID生成系统文章,介绍另一种相对更容易理解和编写的分布式ID生成方式. 实现原理 Leaf这个名字是来自德 ...
- RabbitMQ基础教程之基本使用篇
RabbitMQ基础教程之基本使用篇 最近因为工作原因使用到RabbitMQ,之前也接触过其他的mq消息中间件,从实际使用感觉来看,却不太一样,正好趁着周末,可以好好看一下RabbitMQ的相关知识点 ...
- 【转载】python %s %d %f
%s 字符串 string="hello" #%s打印时结果是hello print "string=%s" % string # output: s ...
- OAuth 2.0 Salesforce & Azure
最近在学习Salesforce,浅谈一下 OAuth 2.0 在Salesforce and Azure 之间的应用. 假设有这样一个场景,在Salesforce中需要用到Azure中的一些服务,那么 ...
- java数据结构之hashMap
初学JAVA的时候,就记得有句话两个对象的hashCode相同,不一定equal,但是两个对象equal,hashCode一定相同,当时一直不理解是什么意思,最近在极客时间上学习了课程<数据结构 ...
- NO.3:自学tensorflow之路------MNIST识别,神经网络拓展
引言 最近自学GRU神经网络,感觉真的不简单.为了能够快速跑完程序,给我的渣渣笔记本(GT650M)也安装了一个GPU版的tensorflow.顺便也更新了版本到了tensorflow-gpu 1.7 ...
- PAT甲题题解1099. Build A Binary Search Tree (30)-二叉树遍历
题目就是给出一棵二叉搜索树,已知根节点为0,并且给出一个序列要插入到这课二叉树中,求这棵二叉树层次遍历后的序列. 用结构体建立节点,val表示该节点存储的值,left指向左孩子,right指向右孩子. ...
- Final阶段用户使用报告
此作业要求参见:[https://edu.cnblogs.com/campus/nenu/2018fall/homework/2477] 组名:可以低头,但没必要 组长:付佳 组员:张俊余 李文涛 孙 ...
- 2-Eighteenth Scrum Meeting-20151218
任务安排 成员 今日完成 明日任务 闫昊 写完学习进度记录的数据库操作 写完学习进度记录的数据库操作 唐彬 编写与服务器交互的代码 和服务器老师交流讨论区后台接口 史烨轩 获取视频url 尝试使用 ...
- 对常用软件的评价(TGP腾讯游戏平台)
1,首先说下界面,这款软件的界面有些类似于QQ的界面,登录方式和QQ的方式是一样的,可以简单的说是一款给游戏用的QQ,就是里面的用户变成了游戏 2,功能,简单的说就是将你常玩的游戏放于这游戏平台的表面 ...