integer encoding vs 1-hot (py)
https://github.com/szilard/benchm-ml/issues/1
glouppe commented on 7 May 2015
|
Thanks for the benchmarks! Proper handling of categorical variables is not an easy issue anyway.
When the categories are ordered, it makes more sense indeed to handle them as numerical variables. I dont have a strong argument as to why it may be also better when there is no natural ordering. I guess it could boil down to the fact that one-hot encoding splits are often very unbalanced, while integer encoded splits may be less unbalanced. |
Thanks @glouppe. I read somewhere a paper that AFAIR suggested to sort the (non-ordered) categoricals in order of their frequency in the data and encode them as integers as such. Any idea what that paper might be?
glouppe commented on 7 May 2015
|
Yes, it is Breiman's book :) When your output is binary, this strategy is in fact optimal (it will find the best subset among the values of the categorical variables) and linear. See section 3.6.3.2 of my thesis if you dont have the CART book. |
One-hot encoding could be helpful when the number of categories are small( in level of 10 to 100). In such case one-hot encoding can discover interesting interactions like (gender=male) AND (job = teacher).
While ordering them makes it harder to be discovered(need two split on job). However, indeed there is not a unified way handling categorical features in trees, and usually what tree was really good at was ordered continuous features anyway..
integer encoding vs 1-hot (py)的更多相关文章
- [已解决]关于python无法显示中文的问题:SyntaxError: Non-ASCII character '\xe4' in file test.py on line 3, but no encoding declared。
想在python代码中输出汉字.但是老是出现SyntaxError: Non-ASCII character '\xe4' in file test.py on line , but no encod ...
- 关于python无法显示中文的问题:SyntaxError: Non-ASCII character '\xe4' in file test.py on line 3, but no encoding declared。
[已解决]关于python无法显示中文的问题:SyntaxError: Non-ASCII character '\xe4' in file test.py on line 3, but no enc ...
- requests之headers 'Content-Type': 'text/html'误判encoding为'ISO-8859-1'导致中文text解码错误
0. requests不设置UA 访问baidu 得到 r.headers['Content-Type'] 是text/html 使用chrome UA: Content-Type:text/htm ...
- leetCode练题——13. Roman to Integer
1.题目13. Roman to Integer Roman numerals are represented by seven different symbols: I, V, X, L, C, D ...
- 【论文考古】量化SGD QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, "QSGD: Communication-Efficient SGD ...
- Python函数信息
Python函数func的信息可以通过func.func_*和func.func_code来获取 一.先看看它们的应用吧: 1.获取原函数名称: 1 >>> def yes():pa ...
- Scrapy学习-23-分布式爬虫
scrapy-redis分布式爬虫 分布式需要解决的问题 request队列集中管理 去重集中管理 存储管理 使用scrapy-redis实现分布式爬虫 github开源项目: https://g ...
- Flask入门系列(转载)
一.入门系列: Flask入门系列(一)–Hello World 项目开发中,经常要写一些小系统来辅助,比如监控系统,配置系统等等.用传统的Java写,太笨重了,连PHP都嫌麻烦.一直在寻找一个轻量级 ...
- 使用Java将搜狗词库文件(文件后缀为.scel)转为.txt文件
要做一个根据词库进行筛选主要词汇的功能,去搜狗下载专业词汇词库时,发现是.scel文件,且通过转换工具(http://tools.bugscaner.com/sceltotxt/)转换为txt时报错如 ...
随机推荐
- 微软原版WINDOWS10-LTSB-X64位操作系统的全新安装与优化
原版WINDOWS10_LTSB_X64位操作系统,安装U盘的制作 1.在一台能正常运行的电脑上,下载原版WINDOWS10_LTSB_X64位操作系统镜像(ISO)文件: 2.运行UltraISO. ...
- [ArgumentException: 可能证书“CN=JRNet01-PC”没有能够进行密钥交换的私钥,或者进程可能没有访问私钥的权限。有关详细信息,请参见内部异常。]
堆栈跟踪: [CryptographicException: 密钥集不存在. ] System.Security.Cryptography.Utils.CreateProvHandle(CspPara ...
- js客户端UI框架
Best jQuery UI http://b-jui.com/ jQuery EasyUI http://www.jeasyui.com/ bootstrap学习网: http://www.runo ...
- JAVA内置注解 基本注解
温故而知新,可以为师矣! 每天复习,或者学习一点小东西,也能水滴石穿! 今天复习5个JAVA内置基本注解(贴代码胜过千言万语): package com.lf.test; import java.ut ...
- 洛谷 4721 【模板】分治 FFT——分治FFT / 多项式求逆
题目:https://www.luogu.org/problemnew/show/P4721 分治FFT:https://www.cnblogs.com/bztMinamoto/p/9749557.h ...
- bzoj 1597 [Usaco2008 Mar]土地购买——斜率优化dp
题目:https://www.lydsy.com/JudgeOnline/problem.php?id=1597 又一道斜率优化dp.负数让我混乱.不过仔细想想还是好的. 还可以方便地把那个负号放到x ...
- selenium - xpath - 定位
前言: XPath 是一门在 XML 文档中查找信息的语言.XPath 可用来在 XML 文档中对元素和属性进行遍历. 看这里介绍:w3school 首先来看一下xpath常用的语法: 一.xpath ...
- ZipArchive扩展的使用和Guzzle依赖的安装使用
在项目开发的过程中,需要去远程下载录音文件 然后保存到自己的项目中,然后再把录音文件压缩打包,最后再下载给用户 1.Guzzle依赖的安装 guzzle官方文档:http://guzzle-cn.re ...
- POJ3020(最小边覆盖)
Antenna Placement Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 8924 Accepted: 4428 ...
- Navicat设定mysql定时任务步骤示例
怎样在Navicat中设置,是数据库按照记录中的日期更新状态字段 其实这个很常用,比如你网站里的某条记录的日期——比如说数据库中某条活动记录的审核日期字段已经过期,亦即当前时间已经超过审核日期,那么定 ...