转发:https://blog.csdn.net/lujiandong1/article/details/49448051
处理离散型特征和连续型特征并存的情况,如何做归一化。
参考博客进行了总结:
https://www.quora.com/What-are-good-ways-to-handle-discrete-and-continuous-inputs-together
总结如下:
1、拿到获取的原始特征,必须对每一特征分别进行归一化,比如,特征A的取值范围是[-1000,1000],特征B的取值范围是[-1,1].
如果使用logistic回归,w1*x1+w2*x2,因为x1的取值太大了,所以x2基本起不了作用。
所以,必须进行特征的归一化,每个特征都单独进行归一化。
2、连续型特征归一化的常用方法:
   2.1:Rescale bounded continuous features: All continuous input that are bounded, rescale them to [-1, 1] through x = (2x - max - min)/(max - min).线性放缩到[-1,1]
  2.2:Standardize all continuous features: All continuous input should be standardized and by this I mean, for every continuous feature, compute its mean (u) and standard deviation (s) and do x = (x - u)/s.放缩到均值为0,方差为1
1、离散型特征的处理方法:

a) Binarize categorical/discrete features: For all categorical features, represent them as multiple boolean features. For example, instead of having one feature called marriage_status, have 3 boolean features - married_status_single, married_status_married, married_status_divorced and appropriately set these features to 1 or -1. As you can see, for every categorical feature, you are adding k binary feature where k is the number of values that the categorical feature takes.对于离散的特征基本就是按照one-hot编码,该离散特征有多少取值,就用多少维来表示该特征。

为什么使用one-hot编码来处理离散型特征,这是有理由的,不是随便拍脑袋想出来的!!!具体原因,分下面几点来阐述: 
1、Why do we binarize categorical features?
We binarize the categorical input so that they can be thought of as a vector from the Euclidean space (we call this as embedding the vector in the Euclidean space).使用one-hot编码,将离散特征的取值扩展到了欧式空间,离散特征的某个取值就对应欧式空间的某个点。
 
2、Why do we embed the feature vectors in the Euclidean space?
Because many algorithms for classification/regression/clustering etc. requires computing distances between features or similarities between features. And many definitions of distances and similarities are defined over features in Euclidean space. So, we would like our features to lie in the Euclidean space as well.将离散特征通过one-hot编码映射到欧式空间,是因为,在回归,分类,聚类等机器学习算法中,特征之间距离的计算或相似度的计算是非常重要的,而我们常用的距离或相似度的计算都是在欧式空间的相似度计算,计算余弦相似性,基于的就是欧式空间。

3、Why does embedding the feature vector in Euclidean space require us to binarize categorical features?
Let us take an example of a dataset with just one feature (say job_type as per your example) and let us say it takes three values 1,2,3.
Now, let us take three feature vectors x_1 = (1), x_2 = (2), x_3 = (3). What is the euclidean distance between x_1 and x_2, x_2 and x_3 & x_1 and x_3? d(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2. This shows that distance between job type 1 and job type 2 is smaller than job type 1 and job type 3. Does this make sense? Can we even rationally define a proper distance between different job types? In many cases of categorical features, we can properly define distance between different values that the categorical feature takes. In such cases, isn't it fair to assume that all categorical features are equally far away from each other?
Now, let us see what happens when we binary the same feature vectors. Then, x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1). Now, what are the distances between them? They are sqrt(2). So, essentially, when we binarize the input, we implicitly state that all values of the categorical features are equally away from each other.
将离散型特征使用one-hot编码,确实会让特征之间的距离计算更加合理。比如,有一个离散型特征,代表工作类型,该离散型特征,共有三个取值,不使用one-hot编码,其表示分别是x_1 = (1), x_2 = (2), x_3 = (3)。两个工作之间的距离是,(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2。那么x_1和x_3工作之间就越不相似吗?显然这样的表示,计算出来的特征的距离是不合理。那如果使用one-hot编码,则得到x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1),那么两个工作之间的距离就都是sqrt(2).即每两个工作之间的距离是一样的,显得更合理。
4、About the original question?
Note that our reason for why binarize the categorical features is independent of the number of the values the categorical features take, so yes, even if the categorical feature takes 1000 values, we still would prefer to do binarization.
对离散型特征进行one-hot编码是为了让距离的计算显得更加合理。
5、Are there cases when we can avoid doing binarization?
Yes. As we figured out earlier, the reason we binarize is because we want some meaningful distance relationship between the different values. As long as there is some meaningful distance relationship, we can avoid binarizing the categorical feature. For example, if you are building a classifier to classify a webpage as important entity page (a page important to a particular entity) or not and let us say that you have the rank of the webpage in the search result for that entity as a feature, then 1] note that the rank feature is categorical, 2] rank 1 and rank 2 are clearly closer to each other than rank 1 and rank 3, so the rank feature defines a meaningful distance relationship and so, in this case, we don't have to binarize the categorical rank feature.

More generally, if you can cluster the categorical values into disjoint subsets such that the subsets have meaningful distance relationship amongst them, then you don't have binarize fully, instead you can split them only over these clusters. For example, if there is a categorical feature with 1000 values, but you can split these 1000 values into 2 groups of 400 and 600 (say) and within each group, the values have meaningful distance relationship, then instead of fully binarizing, you can just add 2 features, one for each cluster and that should be fine.
将离散型特征进行one-hot编码的作用,是为了让距离计算更合理,但如果特征是离散的,并且不用one-hot编码就可以很合理的计算出距离,那么就没必要进行one-hot编码,比如,该离散特征共有1000个取值,我们分成两组,分别是400和600,两个小组之间的距离有合适的定义,组内的距离也有合适的定义,那就没必要用one-hot 编码
 
离散特征进行one-hot编码后,编码后的特征,其实每一维度的特征都可以看做是连续的特征。就可以跟对连续型特征的归一化方法一样,对每一维特征进行归一化。比如归一化到[-1,1]或归一化到均值为0,方差为1
 
有些情况不需要进行特征的归一化:
     It depends on your ML algorithms, some methods requires almost no efforts to normalize features or handle both continuous and discrete features, like tree based methods: c4.5, Cart, random Forrest, bagging or boosting. But most of parametric models (generalized linear models, neural network, SVM,etc) or methods using distance metrics (KNN, kernels, etc) will require careful work to achieve good results. Standard approaches including binary all features, 0 mean unit variance all continuous features, etc。
      基于树的方法是不需要进行特征的归一化,例如随机森林,bagging 和 boosting等。基于参数的模型或基于距离的模型,都是要进行特征的归一化。

处理离散型特征和连续型特征共存的情况 归一化 论述了对离散特征进行one-hot编码的意义的更多相关文章

  1. 常用连续型分布介绍及R语言实现

    常用连续型分布介绍及R语言实现 R的极客理想系列文章,涵盖了R的思想,使用,工具,创新等的一系列要点,以我个人的学习和体验去诠释R的强大. R语言作为统计学一门语言,一直在小众领域闪耀着光芒.直到大数 ...

  2. 【书签】连续型特征的归一化和离散特征的one-hot编码

    1. 连续型特征的常用的归一化方法.离散型特征one-hot编码的意义 2. 度量特征之间的相关性:余弦相似度和皮尔逊相关系数

  3. 【概率论与数理统计】小结4 - 一维连续型随机变量及其Python实现

    注:上一小节总结了离散型随机变量,这个小节总结连续型随机变量.离散型随机变量的可能取值只有有限多个或是无限可数的(可以与自然数一一对应),连续型随机变量的可能取值则是一段连续的区域或是整个实数轴,是不 ...

  4. seaborn 数据可视化(一)连续型变量可视化

    一.综述 Seaborn其实是在matplotlib的基础上进行了更高级的API封装,从而使得作图更加容易,图像也更加美观,本文基于seaborn官方API还有自己的一些理解.   1.1.样式控制: ...

  5. 连续型变量的推断性分析——t检验

    连续型变量的推断性分析方法主要有t检验和方差分析两种,这两种方法可以解决一些实际的分析问题,下面我们分别来介绍一下这两种方法 一.t检验(Student's t test) t检验也称student ...

  6. 2×c列联表|多组比例简式|卡方检验|χ2检验与连续型资料假设检验

    第四章 χ2检验 χ2检验与连续型资料假设检验的区别? 卡方检验的假设检验是什么? 理论值等于实际值 何条件下卡方检验的需要矫正?如何矫正? 卡方检验的自由度如何计算? Df=k-1而不是n-1 卡方 ...

  7. 什么是机器学习的特征工程?【数据集特征抽取(字典,文本TF-Idf)、特征预处理(标准化,归一化)、特征降维(低方差,相关系数,PCA)】

    2.特征工程 2.1 数据集 2.1.1 可用数据集 Kaggle网址:https://www.kaggle.com/datasets UCI数据集网址: http://archive.ics.uci ...

  8. python基础知识2——基本的数据类型——整型,长整型,浮点型,字符串

    磨人的小妖精们啊!终于可以归置下自己的大脑啦,在这里我要把--整型,长整型,浮点型,字符串,列表,元组,字典,集合,这几个知识点特别多的东西,统一的捯饬捯饬,不然一直脑袋里面乱乱的. 对于Python ...

  9. Python基础:数值(布尔型、整型、长整型、浮点型、复数)

    一.概述 Python中的 数值类型(Numeric Types)共有5种:布尔型(bool).整型(int).长整型(long).浮点型(float)和复数(complex). 数值类型支持的主要操 ...

随机推荐

  1. 【Ansible】记一次技术博客害死人的经历——ansible模板变量注入探究

    风和日丽,夏天的北京湿热并举,睁不开的眼睛里,横竖都看着是“吃人”. 带薪学习的日子不好过,要在几天内迅速掌握导师下发要求学习的技能,看着以前一起蹲IT坑的同事人来人往,用隔壁同性黄同学的话来说,就是 ...

  2. 【监控实践】【4.4】使用DMV和函数监控数据库状态和资源使用

    1.查看当前实例运行进程 -- 核心DMV.函数.系统SP:/* 所有进程请求:sys.dm_exec_requests 所有进程与连接:sys.sysprocesses 系统函数,查看sql:sys ...

  3. idea中模块累积编写

    idea中新建Empty Project名为myproject,新建模块mymodel1 要想复制该模块,再在该模块的基础上开发怎么弄? 选中该模块右键Copy,在Project空白区域右键Paste ...

  4. 【GDOI】2018题目及题解(未写完)

    我的游记:https://www.cnblogs.com/huangzihaoal/p/11154228.html DAY1 题目 T1 农场 [题目描述] [输入] 第一行,一个整数n. 第二行,n ...

  5. Python基础字符串前加u,r,b,f含义

    1.字符串前加 u 例:u"我是含有中文字符组成的字符串." 作用: 后面字符串以 Unicode 格式 进行编码,一般用在中文字符串前面,防止因为源码储存格式问题,导致再次使用时 ...

  6. 第十九篇 jQuery初步学习

    jQuery 初步学习   jQuery可以理解为是一种脚本,需要到网上下载,它是一个文件,后缀当然是js的文件,它里面封装了很多函数方法,我们直接调用即可,就比方说,我们用JS,写一个显示与隐藏,通 ...

  7. 从命令行运行postman脚本

    为什么要在命令行中运行 可以在无UI界面的服务器上运行 可以在持续集成系统上运行 运行准备 导出collection 安装nodejs和npm(或cnpm) 安装Newman 运行及生成测试报告支持4 ...

  8. dedecms 后台栏目全部展开 包括三级栏目

    include/typeunit.class.admin.php 搜索以下代码并删除 style='display:none'

  9. linux下mysql忘记密码解决方案

    一.写随笔的原因:之前自己服务器上的mysql很久不用了,忘记了密码,所以写一下解决方案,以供以后参考 二.具体的内容: 1. 检查mysql服务是否启动,如果启动,关闭mysql服务 运行命令:ps ...

  10. 模块之time与datetime

    模块之time与datetime import time print (time.clock()) print(time.process_time()) #测量处理器运算时间 print(time.a ...