Comparing Differently Trained Models
Comparing Differently Trained Models
At the end of the previous post, we mentioned that the solution found by L-BFGS made different errors compared to the model we trained with SGD and momentum. So, one question is what solution is better in terms of generalization and if they focus on different aspects, how do they differ for individual methods.
To make the analysis easier, but to be at least a little realistic, we train a linear SVM classifier (W, bias) for a “werewolf” theme. In other words, all movies with that theme are marked with “+1” and we sample random movies for the ‘rest’ that are marked with -1. For the features, we use the 1,500 most frequent keywords. All random seeds were fixed which means both models start at the same “point”.
In our first experiment, we only care to minimize the errors. The SGD method (I) uses standard momentum and a L1 penalty of 0.0005 in combination with mini-batches. The learning rate and momentum was kept at a fixed value. The L-BFGS method (II) minimizes the same loss function. Both methods were able to get an accuracy of 100% for the training data and the training has been stopped as soon as the error was zero.
(I) loss=1.64000 ||W||=3.56,
bias=-0.60811 (SGD)
(II) loss=0.04711 ||W||=3.75, bias=-0.58073
(L-BFGS)
As we can see, the L2 norm of the final weight vector is
similar, also the bias, but of course we do not care for absolute norms but
rather for the correlation of both solutions. For that reason, we converted both
weight vectors W to unit-norm and determined the cosine similarity: correlation
= W_sgd.T * W_lbfgs = 0.977.
Since we do not have any empirical data for such correlations, we analyzed
the magnitude of the features in the weight vectors. More precisely the top-5
most important features:
(I) werewolf=0.6652, vampire=0.2394,
creature=0.1886, forbidden-love=0.1392, teenagers=0.1372
(II)
werewolf=0.6698, vampire=0.2119, monster=0.1531, creature=0.1511,
teenagers=0.1279
If we also consider the top-12 features of both
models, which are pretty similar,
(I) werewolf, vampire, creature,
forbidden-love, teenagers, monster, pregnancy, undead, curse, supernatural,
mansion, bloodsucker
(II) werewolf, vampire, monster, creature, teenagers,
curse, forbidden-love, supernatural, pregnancy, hunting, undead,
beast
we can see some patterns here: First, a lot of the movies in
the dataset seem to combine the theme with love stories that may involve
teenagers. This makes sense because this is actually a very popular pattern
these days and second, vampires and werewolves are very likely to co-occur in
the same movie.
Those patterns were learned by both models, regardless of the actual
optimization method but with minor differences which can be seen by considering
the magnitude of the individual weights in W. However, as the correlation of the
parameters vectors confirmed, both solutions are pretty close together.
Bottom line, we should be careful with interpretations since the data at hand
was limited, but nevertheless the results confirmed that with proper
initializations and hyper-parameters, good solutions can be both achieved with
1st and 2nd order methods. Next, we will study the ability of models to
generalize for unseen data.
Comparing Differently Trained Models的更多相关文章
- 大规模视觉识别挑战赛ILSVRC2015各团队结果和方法 Large Scale Visual Recognition Challenge 2015
Large Scale Visual Recognition Challenge 2015 (ILSVRC2015) Legend: Yellow background = winner in thi ...
- (转)The Evolved Transformer - Enhancing Transformer with Neural Architecture Search
The Evolved Transformer - Enhancing Transformer with Neural Architecture Search 2019-03-26 19:14:33 ...
- (转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance
Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance 2018-1 ...
- Keras vs. PyTorch
We strongly recommend that you pick either Keras or PyTorch. These are powerful tools that are enjoy ...
- Understanding Tensorflow using Go
原文: https://pgaleone.eu/tensorflow/go/2017/05/29/understanding-tensorflow-using-go/ Tensorflow is no ...
- How to handle Imbalanced Classification Problems in machine learning?
How to handle Imbalanced Classification Problems in machine learning? from:https://www.analyticsvidh ...
- understanding backpropagation
几个有助于加深对反向传播算法直观理解的网页,包括普通前向神经网络,卷积神经网络以及利用BP对一般性函数求导 A Visual Explanation of the Back Propagation A ...
- 论文翻译——Deep contextualized word representations
Abstract We introduce a new type of deep contextualized word representation that models both (1) com ...
- 论文翻译——Character-level Convolutional Networks for Text Classification
论文地址 Abstract Open-text semantic parsers are designed to interpret any statement in natural language ...
随机推荐
- Appium+python自动化4-元素定位uiautomatorviewer
前言 环境搭建好了,下一步元素定位,元素定位本篇主要介绍如何使用uiautomatorviewer,通过定位到页面上的元素,然后进行相应的点击等操作. uiautomatorviewer是androi ...
- SQL手工注入漏洞测试(Sql Server数据库)
还是先找到注入点,然后order by找出字段数:4 通过SQL语句中and 1=2 union select 1,2,3……,n联合查询,判断显示的是哪些字段,就是原本显示标题和内容时候的查询字段. ...
- WPF使用路径(URI)引用资源文件
Uri uri = new Uri("pack://application:,,,/程序集名称;component/Resources/bj.png", UriKind.Absol ...
- nodejs的事件驱动理解
// 引入 events 模块 var events = require('events'); // 创建 eventEmitter 对象 var eventEmitter = new events. ...
- Docker 将一堆镜像 导成一个文件
docker save istio/galley istio/citadel istio/mixer istio/sidecar_injector istio/proxy_init istio/pro ...
- NABCD模型
下面是我对我们团队实现的程序的最终期待. 1.N(Need) 这里做了用户需求,所面向的用户虽然是所有英语学习群体,但是主要用户还是那群英文小说阅读者,即希望通过英文小说的阅读来提升单词量的人群. 1 ...
- Java结束线程的三种方法(爱奇艺面试)
线程属于一次性消耗品,在执行完run()方法之后线程便会正常结束了,线程结束后便会销毁,不能再次start,只能重新建立新的线程对象,但有时run()方法是永远不会结束的.例如在程序中使用线程进行So ...
- Fitts’ Law / 菲茨定律(费茨法则)
Fitts’ Law / 菲茨定律(费茨法则) 补充一张雅虎ued绘制的关于Fitts’ Law的Q版小漫画,先初步了解下: Fitts’ Law / 菲茨定律(费茨法则) Fitts’ Law ...
- 一个由于springboot自动配置所产生的问题的解决
由于我的项目里面需要使用到solr,我做了一下solr和springboot的整合,结果启动项目的时候,就报错了...报错的信息的第一行提示如下: org.springframework.beans. ...
- BZOJ4628 BJOI2016IP地址(trie)
离线,每次修改相当于对该规则的所有匹配点的值+1,考虑在trie上打加法标记和匹配标记,匹配标记不下传,加法标记下传遇到匹配标记时清空.注意是用b时刻前缀-a时刻前缀,而不是(a-1)时刻前缀,具体我 ...