关于one-hot encoding思考

Many learning algorithms either learn a single weight per feature, or they use distances between samples. The former is the case for linear models such as logistic regression, which are easy to explain.

Suppose you have a dataset having only a single categorical feature "nationality", with values "UK", "French" and "US". Assume, without loss of generality, that these are encoded as 0, 1 and 2. You then have a weight w for this feature in a linear classifier, which will make some kind of decision based on the constraint w×x + b > 0, or equivalently w×x < b.

The problem now is that the weight w cannot encode a three-way choice. The three possible values of w×x are 0, w and 2×w. Either these three all lead to the same decision (they're all < b or ≥b) or "UK" and "French" lead to the same decision, or "French" and "US" give the same decision. There's no possibility for the model to learn that "UK" and "US" should be given the same label, with "French" the odd one out.（二分类问题，若dummy encoding，us和uk始终不能单独成为一类，而若one-hot encoding，则可以适应任何情况）

By one-hot encoding, you effectively blow up the feature space to three features, which will each get their own weights, so the decision function is now w[UK]x[UK] + w[FR]x[FR] + w[US]x[US] < b, where all the x's are booleans. In this space, such a linear function can express any sum/disjunction of the possibilities (e.g. "UK or US", which might be a predictor for someone speaking English).

Similarly, any learner based on standard distance metrics (such as k-nearest neighbors) between samples will get confused without one-hot encoding. With the naive encoding and Euclidean distance, the distance between French and US is 1. The distance between US and UK is 2. But with the one-hot encoding, the pairwise distances between [1, 0, 0], [0, 1, 0] and [0, 0, 1] are all equal to √2.

This is not true for all learning algorithms; decision trees and derived models such as random forests, if deep enough, can handle categorical variables without one-hot encoding.

dataframe one-hot encoding：pandas.get_dummies方法

参考：

https://gist.github.com/ramhiser/982ce339d5f8c9a769a0

http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.get_dummies.html

关于one-hot encoding思考的更多相关文章

关于.NET参数传递方式的思考
年关将近,整个人已经没有了工作和写作的激情,估计这个时候很多人跟我差不多,该相亲的相亲,该聚会喝酒的聚会喝酒,总之就是没有了干活的心思(我有很多想法,但就是叫不动我的手脚,所以我只能看着别人在做我想做 ...
关于过拟合、局部最小值、以及Poor Generalization的思考
Poor Generalization 这可能是实际中遇到的最多问题. 比如FC网络为什么效果比CNN差那么多啊,是不是陷入局部最小值啊?是不是过拟合啊?是不是欠拟合啊? 在操场跑步的时候,又从SVM ...
Spring之LoadTimeWeaver——一个需求引发的思考---转
原文地址:http://www.myexception.cn/software-architecture-design/602651.html Spring之LoadTimeWeaver——一个需求引 ...
关于学习是UIWebView的一些思考
前几天因为数据中加载有html语言的数据,关于html语言和UIWebView,有一些纠结,经过几天的研究,也有了一些自己的简单的见解. 我有两个页面需要加载html语言(注意,这里 ...
Python--Cmd窗口运行Python时提示Fatal Python error: Py_Initialize: can't initialize sys standard streams LookupError: unknown encoding: cp65001
源地址连接: http://www.tuicool.com/articles/ryuaUze 最近,我在把一个 Python 2 的视频下载工具 youku-lixian 改写成 Python 3,并 ...
基于纯Java代码的Spring容器和Web容器零配置的思考和实现（3） - 使用配置
经过<基于纯Java代码的Spring容器和Web容器零配置的思考和实现(1) - 数据源与事务管理>和<基于纯Java代码的Spring容器和Web容器零配置的思考和实现(2) - ...
file.encoding到底指的是什么呢？
转载请注明来源:http://blog.csdn.net/loongshawn/article/details/50918506 <Java利用System.getProperty(“file. ...
Java 小记 — Spring Boot 的实践与思考
前言本篇随笔用于记录我在学习 Java 和构建 Spring Boot 项目过程中的一些思考,包含架构.组件和部署方式等.下文仅为概要,待闲时逐一整理为详细文档. 1. 组件开源社区如火如荼,若在 ...
Android图表库MPAndroidChart(六)——换一种思考方式，水平条形图的实现过程
Android图表库MPAndroidChart(六)--换一种思考方式,水平条形图的实现过程一.基本实现我们之前实现了条形图,现在来看下水平条形图是怎么实现的,说白了就是横起来,看下效果: 说起 ...

随机推荐

linux文件权限更改命令chmod及数字权限
chmod -change file mode bits :更改文件权限 chmod是用来改变文件或者目录权限的命令,但只有文件的属主和超级用户(root)才有这种权限. 更改文件权限的2种方式: 一 ...
【php】子类覆盖超类方法，在超类里调用此方法会出现何种现象
<?php class A { public function getName() { echo $this->name(); } function name () { return 'l ...
MySQL 之视图、触发器、事务、存储过程、内置函数、流程控制、索引
本文内容: 视图触发器事务存储过程内置函数流程控制索引 ------------------------------------------------------------------ ...
使用selenium和phantomJS浏览器获取网页内容的小演示
# 使用selenium和phantomJS浏览器获取网页内容的小演示 # 导入包 from selenium import webdriver # 使用selenium库里的webdriver方法调 ...
Day06for循环和字符串的内置方法
Day06 1.for循环(迭代器循环) while循环条件循环,循环是否结束取决于条件的真假 for循环,迭代器循环,多用于循环取值,循环是否结束取决于被循环数据的元素个数 2.range(1,5 ...
hdu 5878
I Count Two Three Time Limit: 3000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others ...
.NET中常见的加解密方式
在互联网普及的初期,人们更关注单纯的连接性,以不受任何限制地建立互联网为最终目的.正如事情都具有两面性,互联网的便捷性给人们带来了负面问题,计算机病毒的侵害.信息泄露.网络欺诈等利用互联网的犯罪行为日 ...
HDU 4089 && UVa 1498 Activation 带环的概率DP
要在HDU上交的话,要用滚动数组优化一下空间. 这道题想了很久,也算是想明白了,就好好写一下吧. P1:激活游戏失败,再次尝试. P2:连接失服务器败,从队首排到队尾. P3:激活游戏成功,队首的人出 ...
Laya list 居中
1.将list放在一个box中,去除box的宽高,设其锚点为0.5,0.5 2.将box的锚点放到目标位置 3.在list渲染后,设定box的宽度为list的宽度
公钥密码之RSA密码算法大素数判定：Miller-Rabin判定法！
公钥密码之RSA密码算法大素数判定:Miller-Rabin判定法! 先存档再说,以后实验报告还得打印上交. Miller-Rabin大素数判定对于学算法的人来讲不是什么难事,主要了解其原理. 先来灌 ...

关于one-hot encoding思考

关于one-hot encoding思考的更多相关文章

随机推荐

热门专题