Comparing Differently Trained Models
Comparing Differently Trained Models
At the end of the previous post, we mentioned that the solution found by L-BFGS made different errors compared to the model we trained with SGD and momentum. So, one question is what solution is better in terms of generalization and if they focus on different aspects, how do they differ for individual methods.
To make the analysis easier, but to be at least a little realistic, we train a linear SVM classifier (W, bias) for a “werewolf” theme. In other words, all movies with that theme are marked with “+1” and we sample random movies for the ‘rest’ that are marked with -1. For the features, we use the 1,500 most frequent keywords. All random seeds were fixed which means both models start at the same “point”.
In our first experiment, we only care to minimize the errors. The SGD method (I) uses standard momentum and a L1 penalty of 0.0005 in combination with mini-batches. The learning rate and momentum was kept at a fixed value. The L-BFGS method (II) minimizes the same loss function. Both methods were able to get an accuracy of 100% for the training data and the training has been stopped as soon as the error was zero.
(I) loss=1.64000 ||W||=3.56,
bias=-0.60811 (SGD)
(II) loss=0.04711 ||W||=3.75, bias=-0.58073
(L-BFGS)
As we can see, the L2 norm of the final weight vector is
similar, also the bias, but of course we do not care for absolute norms but
rather for the correlation of both solutions. For that reason, we converted both
weight vectors W to unit-norm and determined the cosine similarity: correlation
= W_sgd.T * W_lbfgs = 0.977.
Since we do not have any empirical data for such correlations, we analyzed
the magnitude of the features in the weight vectors. More precisely the top-5
most important features:
(I) werewolf=0.6652, vampire=0.2394,
creature=0.1886, forbidden-love=0.1392, teenagers=0.1372
(II)
werewolf=0.6698, vampire=0.2119, monster=0.1531, creature=0.1511,
teenagers=0.1279
If we also consider the top-12 features of both
models, which are pretty similar,
(I) werewolf, vampire, creature,
forbidden-love, teenagers, monster, pregnancy, undead, curse, supernatural,
mansion, bloodsucker
(II) werewolf, vampire, monster, creature, teenagers,
curse, forbidden-love, supernatural, pregnancy, hunting, undead,
beast
we can see some patterns here: First, a lot of the movies in
the dataset seem to combine the theme with love stories that may involve
teenagers. This makes sense because this is actually a very popular pattern
these days and second, vampires and werewolves are very likely to co-occur in
the same movie.
Those patterns were learned by both models, regardless of the actual
optimization method but with minor differences which can be seen by considering
the magnitude of the individual weights in W. However, as the correlation of the
parameters vectors confirmed, both solutions are pretty close together.
Bottom line, we should be careful with interpretations since the data at hand
was limited, but nevertheless the results confirmed that with proper
initializations and hyper-parameters, good solutions can be both achieved with
1st and 2nd order methods. Next, we will study the ability of models to
generalize for unseen data.
Comparing Differently Trained Models的更多相关文章
- 大规模视觉识别挑战赛ILSVRC2015各团队结果和方法 Large Scale Visual Recognition Challenge 2015
Large Scale Visual Recognition Challenge 2015 (ILSVRC2015) Legend: Yellow background = winner in thi ...
- (转)The Evolved Transformer - Enhancing Transformer with Neural Architecture Search
The Evolved Transformer - Enhancing Transformer with Neural Architecture Search 2019-03-26 19:14:33 ...
- (转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance
Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance 2018-1 ...
- Keras vs. PyTorch
We strongly recommend that you pick either Keras or PyTorch. These are powerful tools that are enjoy ...
- Understanding Tensorflow using Go
原文: https://pgaleone.eu/tensorflow/go/2017/05/29/understanding-tensorflow-using-go/ Tensorflow is no ...
- How to handle Imbalanced Classification Problems in machine learning?
How to handle Imbalanced Classification Problems in machine learning? from:https://www.analyticsvidh ...
- understanding backpropagation
几个有助于加深对反向传播算法直观理解的网页,包括普通前向神经网络,卷积神经网络以及利用BP对一般性函数求导 A Visual Explanation of the Back Propagation A ...
- 论文翻译——Deep contextualized word representations
Abstract We introduce a new type of deep contextualized word representation that models both (1) com ...
- 论文翻译——Character-level Convolutional Networks for Text Classification
论文地址 Abstract Open-text semantic parsers are designed to interpret any statement in natural language ...
随机推荐
- Mvc4_语法基础介绍
@model MvcApplicationTest.Models.User @{ ViewBag.Title = "Index"; } <script type=" ...
- BugkuCTF web2
前言 写了这么久的web题,算是把它基础部分都刷完了一遍,以下的几天将持续更新BugkuCTF WEB部分的题解,为了不影响阅读,所以每道题的题解都以单独一篇文章的形式发表,感谢大家一直以来的支持和理 ...
- Markdown基本使用方法
最近开通了博客,看到网上好多推荐markdown的,而且博客园支持markdown,所以决定学习一下. 百度百科对markdown的介绍: Markdown是一种可以使用普通文本编辑器编写的标记语言, ...
- Istio全景监控与拓扑
根据Istio官方报告,Observe(可观察性)为其重要特性.Istio提供非侵入式的自动监控,记录应用内所有的服务. 我们知道在Istio的架构中,Mixer是管理和收集遥测信息的组件.每一次当请 ...
- 微软职位内部推荐-Senior Software Engineer - Front End
微软近期Open的职位: SharePoint is a multi-billion dollar enterprise business that has grown from an on-prem ...
- PAT甲题题解-1071. Speech Patterns (25)-找出现最多的单词
分割字符串的用法+map映射给出input中出现次数最多的单词,如果次数相同,给出按字典序最小的. 这里我用了自定义分隔符来读取字符串,方法如下: //按照定义的分隔符d来分割字符串,对str进行读取 ...
- centos运行C程序
gcc -o Hello Hello.c 编译成可执行文件 ./Hello 运行 win上也是一样
- GoldNumber游戏比赛成绩公布
比赛介绍:http://www.cnblogs.com/xinz/p/3347418.html 黄金点游戏: N个同学(N通常大于10),每人写一个0~100之间的有理数 (不包括0或100),交给裁 ...
- Pipeline Alpha版本项目展示
团队成员简介:http://www.cnblogs.com/cheneygroup/p/4830994.html 团队成员及博客: 李剑锋: Blog: http://www. ...
- Daily Scrumming* 2015.12.17(Day 9)
一.团队scrum meeting照片 二.成员工作总结 姓名 任务ID 迁入记录 江昊 任务1077 https://github.com/buaaclubs-team/temp-front/com ...