https://github.com/szilard/benchm-ml/issues/1

glouppe commented on 7 May 2015

Thanks for the benchmarks! Proper handling of categorical variables is not an easy issue anyway.

I would expect faster, lower memory but decrease in AUC (or same in some cases).

When the categories are ordered, it makes more sense indeed to handle them as numerical variables. I dont have a strong argument as to why it may be also better when there is no natural ordering. I guess it could boil down to the fact that one-hot encoding splits are often very unbalanced, while integer encoded splits may be less unbalanced.

Thanks @glouppe. I read somewhere a paper that AFAIR suggested to sort the (non-ordered) categoricals in order of their frequency in the data and encode them as integers as such. Any idea what that paper might be?

glouppe commented on 7 May 2015

Yes, it is Breiman's book :) When your output is binary, this strategy is in fact optimal (it will find the best subset among the values of the categorical variables) and linear.

See section 3.6.3.2 of my thesis if you dont have the CART book.
http://orbi.ulg.ac.be/bitstream/2268/170309/1/thesis.pdf

One-hot encoding could be helpful when the number of categories are small( in level of 10 to 100). In such case one-hot encoding can discover interesting interactions like (gender=male) AND (job = teacher).

While ordering them makes it harder to be discovered(need two split on job). However, indeed there is not a unified way handling categorical features in trees, and usually what tree was really good at was ordered continuous features anyway..

 
 

 

integer encoding vs 1-hot (py)的更多相关文章

  1. [已解决]关于python无法显示中文的问题:SyntaxError: Non-ASCII character '\xe4' in file test.py on line 3, but no encoding declared。

    想在python代码中输出汉字.但是老是出现SyntaxError: Non-ASCII character '\xe4' in file test.py on line , but no encod ...

  2. 关于python无法显示中文的问题:SyntaxError: Non-ASCII character '\xe4' in file test.py on line 3, but no encoding declared。

    [已解决]关于python无法显示中文的问题:SyntaxError: Non-ASCII character '\xe4' in file test.py on line 3, but no enc ...

  3. requests之headers 'Content-Type': 'text/html'误判encoding为'ISO-8859-1'导致中文text解码错误

    0. requests不设置UA 访问baidu 得到 r.headers['Content-Type'] 是text/html  使用chrome UA: Content-Type:text/htm ...

  4. leetCode练题——13. Roman to Integer

    1.题目13. Roman to Integer Roman numerals are represented by seven different symbols: I, V, X, L, C, D ...

  5. 【论文考古】量化SGD QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

    D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, "QSGD: Communication-Efficient SGD ...

  6. Python函数信息

    Python函数func的信息可以通过func.func_*和func.func_code来获取 一.先看看它们的应用吧: 1.获取原函数名称: 1 >>> def yes():pa ...

  7. Scrapy学习-23-分布式爬虫

    scrapy-redis分布式爬虫 分布式需要解决的问题 request队列集中管理 去重集中管理 存储管理   使用scrapy-redis实现分布式爬虫 github开源项目: https://g ...

  8. Flask入门系列(转载)

    一.入门系列: Flask入门系列(一)–Hello World 项目开发中,经常要写一些小系统来辅助,比如监控系统,配置系统等等.用传统的Java写,太笨重了,连PHP都嫌麻烦.一直在寻找一个轻量级 ...

  9. 使用Java将搜狗词库文件(文件后缀为.scel)转为.txt文件

    要做一个根据词库进行筛选主要词汇的功能,去搜狗下载专业词汇词库时,发现是.scel文件,且通过转换工具(http://tools.bugscaner.com/sceltotxt/)转换为txt时报错如 ...

随机推荐

  1. P2P技术基础: 关于TCP打洞技术

    4 关于TCP打洞技术 建立穿越NAT设备的p2p的 TCP 连接只比UDP复杂一点点,TCP协议的“打洞”从协议层来看是与UDP的“打洞”过程非常相似的.尽管如此,基于TCP协议的打洞至今为止还没有 ...

  2. 抛弃Https让Cas以Http协议提供单点登录服务

    本文环境: 1.apache-tomcat-7.0.50-windows-x86 2.cas-server-3.4.11 3.cas-client-3.2.1 将cas-server-webapp-3 ...

  3. 剑指offer-第四章解决面试题思路(字符串的排序)

    题目:输入一个字符串,打印出该字符串的全排列. 思路:将整个字符串分成两部分,第一部分为一个字符,将该字符和该字符后面的字符(直到最后一个字符)依次交换,确定第一个字符:然后固定第一个字符,将后面的字 ...

  4. [独孤九剑]持续集成实践(三)- Jenkins安装与配置(Jenkins+MSBuild+GitHub)

    本系列文章包含: [独孤九剑]持续集成实践(一)- 引子 [独孤九剑]持续集成实践(二)– MSBuild语法入门 [独孤九剑]持续集成实践(三)- Jenkins安装与配置(Jenkins+MSBu ...

  5. LinuxCentos6安装MySql workbench

    参考地址:http://www.cnblogs.com/elaron/archive/2013/03/19/2968699.html

  6. AtCoder Grand Contest 017 题解

    A - Biscuits 题目: 给出 \(n\) 个物品,每个物品有一个权值. 问有多少种选取方式使得物品权值之和 \(\bmod\space 2\) 为 \(p\). \(n \leq 50\) ...

  7. UVA136 Ugly Numbers

    题意 PDF 分析 用堆和集合维护即可. 时间复杂度\(O(1500 \log n)\) 代码 #include<iostream> #include<cstdio> #inc ...

  8. Log4j日志配置说明

    一.Log4j简介 Log4j有三个主要的组件:Loggers(记录器),Appenders (输出源)和Layouts(布局).这里可简单理解为日志类别,日志要输出的地方和日志以何种形式输出.综合使 ...

  9. awk 内容

                                        awk相关内容                                       #只要文件中的路径,不要文件名: [ ...

  10. Data_Structure-绪论作业

    一.作业题目 仿照三元组或复数的抽象数据类型写出有理数抽象数据类型的描述 (有理数是其分子.分母均为整数且分母不为零的分数). 有理数基本运算: 构造有理数T,元素e1,e2分别被赋以分子.分母值 销 ...