Comparing Differently Trained Models

At the end of the previous post, we mentioned that the solution found by L-BFGS made different errors compared to the model we trained with SGD and momentum. So, one question is what solution is better in terms of generalization and if they focus on different aspects, how do they differ for individual methods.

To make the analysis easier, but to be at least a little realistic, we train a linear SVM classifier (W, bias) for a “werewolf” theme. In other words, all movies with that theme are marked with “+1” and we sample random movies for the ‘rest’ that are marked with -1. For the features, we use the 1,500 most frequent keywords. All random seeds were fixed which means both models start at the same “point”.

In our first experiment, we only care to minimize the errors. The SGD method (I) uses standard momentum and a L1 penalty of 0.0005 in combination with mini-batches. The learning rate and momentum was kept at a fixed value. The L-BFGS method (II) minimizes the same loss function. Both methods were able to get an accuracy of 100% for the training data and the training has been stopped as soon as the error was zero.

(I) loss=1.64000 ||W||=3.56,
bias=-0.60811 (SGD)
(II) loss=0.04711 ||W||=3.75, bias=-0.58073
(L-BFGS)

As we can see, the L2 norm of the final weight vector is
similar, also the bias, but of course we do not care for absolute norms but
rather for the correlation of both solutions. For that reason, we converted both
weight vectors W to unit-norm and determined the cosine similarity: correlation
= W_sgd.T * W_lbfgs = 0.977.

Since we do not have any empirical data for such correlations, we analyzed
the magnitude of the features in the weight vectors. More precisely the top-5
most important features:

(I) werewolf=0.6652, vampire=0.2394,
creature=0.1886, forbidden-love=0.1392, teenagers=0.1372
(II)
werewolf=0.6698, vampire=0.2119, monster=0.1531, creature=0.1511,
teenagers=0.1279

If we also consider the top-12 features of both
models, which are pretty similar,

(I) werewolf, vampire, creature,
forbidden-love, teenagers, monster, pregnancy, undead, curse, supernatural,
mansion, bloodsucker
(II) werewolf, vampire, monster, creature, teenagers,
curse, forbidden-love, supernatural, pregnancy, hunting, undead,
beast

we can see some patterns here: First, a lot of the movies in
the dataset seem to combine the theme with love stories that may involve
teenagers. This makes sense because this is actually a very popular pattern
these days and second, vampires and werewolves are very likely to co-occur in
the same movie.

Those patterns were learned by both models, regardless of the actual
optimization method but with minor differences which can be seen by considering
the magnitude of the individual weights in W. However, as the correlation of the
parameters vectors confirmed, both solutions are pretty close together.

Bottom line, we should be careful with interpretations since the data at hand
was limited, but nevertheless the results confirmed that with proper
initializations and hyper-parameters, good solutions can be both achieved with
1st and 2nd order methods. Next, we will study the ability of models to
generalize for unseen data.

 

Comparing Differently Trained Models的更多相关文章

  1. 大规模视觉识别挑战赛ILSVRC2015各团队结果和方法 Large Scale Visual Recognition Challenge 2015

    Large Scale Visual Recognition Challenge 2015 (ILSVRC2015) Legend: Yellow background = winner in thi ...

  2. (转)The Evolved Transformer - Enhancing Transformer with Neural Architecture Search

    The Evolved Transformer - Enhancing Transformer with Neural Architecture Search 2019-03-26 19:14:33 ...

  3. (转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance

    Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance 2018-1 ...

  4. Keras vs. PyTorch

    We strongly recommend that you pick either Keras or PyTorch. These are powerful tools that are enjoy ...

  5. Understanding Tensorflow using Go

    原文: https://pgaleone.eu/tensorflow/go/2017/05/29/understanding-tensorflow-using-go/ Tensorflow is no ...

  6. How to handle Imbalanced Classification Problems in machine learning?

    How to handle Imbalanced Classification Problems in machine learning? from:https://www.analyticsvidh ...

  7. understanding backpropagation

    几个有助于加深对反向传播算法直观理解的网页,包括普通前向神经网络,卷积神经网络以及利用BP对一般性函数求导 A Visual Explanation of the Back Propagation A ...

  8. 论文翻译——Deep contextualized word representations

    Abstract We introduce a new type of deep contextualized word representation that models both (1) com ...

  9. 论文翻译——Character-level Convolutional Networks for Text Classification

    论文地址 Abstract Open-text semantic parsers are designed to interpret any statement in natural language ...

随机推荐

  1. 机器学习初入门03 - Matplotlib

    这一部分很简单,所以以代码的形式给出,在实际学习开发中,Matplotlib最好只把它当成一个画图的工具来用,没有必要深究其实现原理是什么. 一.折线图的绘制 import pandas as pd ...

  2. 微软职位内部推荐-Sr. SE - Office incubation

    微软近期Open的职位: Senior Software Engineer-Office Incubation Office China team is looking for experienced ...

  3. Linux内核分析——第一周学习笔记20135308

    第一周 计算机是如何工作的 第一节 存储程序计算机工作模型 1.冯·诺依曼结构模型:冯·诺依曼结构也称普林斯顿结构,是一种将程序指令存储器和数据存储器合并在一起的存储器结构.程序指令存储地址和数据存储 ...

  4. Navicat连接Mysql8.0失败:Client does not support authentication protocol requested by server...

    今天Mysql服务无法启动,看着网上的教程稀里糊涂的就用命令mysqld --initialize给初始化了,结果就是以前的表都没了,重新安装后,Navicat无法连接数据库 解决方法如下: 意思是直 ...

  5. Daily Scrum - 12/07

    Meeting Minutes 确认基本完成了UI组件的基本功能的动画实现: 准备开始实行UI组件的合并: 讨论了长期计划算法的难点,以及简单版本的实现方案. 督促大家更新TFS: Burndown ...

  6. ubuntu 12.04下 ns3的下载 安装

    这个的内容我主要是参考了 http://blog.sina.com.cn/s/blog_7ec2ab360102wwsk.html 这个链接的学习,基本上过程没有出现的问题. 就是这个链接少了测试的一 ...

  7. express框架结合jade模板引擎使用

    在views文件夹里新建一个jade.jade文件作为模板: html head title 哈哈 body #box ul li 标题1 li 标题2 li 标题3 li 标题4 #aside 在j ...

  8. DHCP介绍

    DHCP(Dynamic Host Configuration Protocol,动态主机配置协议)是一个局域网的网络协议,使用UDP协议工作, 主要有两个用途:给内部网络或网络服务供应商自动分配IP ...

  9. 转帖 IBM要推POWER9,来了解一下POWER处理器的前世今生

    https://blog.csdn.net/kwame211/article/details/76669555 先来说一下最新的POWER 9 在Hot Chips会议上首次提到的IBM Power ...

  10. 基于SOA的高并发和高可用分布式系统架构和组件详解

    基于SOA的分布式高可用架构和微服务架构,是时下如日中天的互联网企业级系统开发架构选择方案.在核心思想上,两者都主张对系统的横向细分和扩展,按不同的业务功能模块来对系统进行分割并且使用一定的手段实现服 ...