以下是Coursera上的How to Win a Data Science Competition: Learn from Top Kagglers课程笔记。

Statistics and distance based features

该部分专注于此高级特征工程：计算由另一个分组的一个特征的各种统计数据和从给定点的邻域分析得到的特征。

groupby and nearest neighbor methods

例子：这里有一些CTR任务的数据

我们可以暗示广告有页面上的最低价格将吸引大部分注意力。页面上的其他广告不会很有吸引力。计算与这种含义相关的特征非常容易。我们可以为每个广告的每个用户和网页添加最低和最高价格。在这种情况下，具有最低价格的广告的位置也可以使用。

代码实现

More feature
How many pages user visited
Standard deviation of prices
Most visited page
Many, many more

如果没有特征可以像这样使用groupby呢？可以使用最近邻点

Neighbors

Explicit group is not needed
More flexible
Much harder to implement

Examples

Number of houses in 500m, 1000m,..
Average price per square meter in 500m, 1000m,..
Number of schools/supermarkets/parking lots in 500m, 1000m,..
Distance to colsest subway station

讲师在Springleaf比赛中使用了它。

KNN features in springleaf

Mean encode all the variables
For every point, find 2000 nearst neighbors using Bray-Curtis metric

\[\frac{\sum{|u_i - v_i|}}{\sum{|u_i + v_i|}}
\]

Calculate various features from those 2000 neighbors

Evaluate

Mean target of neatrest 5,10,15,500,2000, neighbors
Mean distance to 10 closest neighbors
Mean distance to 10 closest neighbors with target 1
Mean distance to 10 closest neighbors with target 0

Matrix factorizations for feature extraction

Example of feature fusion

Notes about Matrix Fatorization

Can be apply only for some columns
Can provide additional diversity
Good for ensembles
It is lossy transformation.Its' efficirncy depends on:
Particular task
Number of latent factors
- Usually 5-100

Implementtation

Serveral MF methods you can find in sklearn
SVD and PCA
Standart tools for Matrix Fatorization
TruncatedSVD
Works with sparse matrices
Non-negative Matrix Fatorization(NMF)
Ensures that all latent fators are non-negative
Good for counts-like data

NMF for tree-based methods

non-negative matrix factorization简称NMF，它以一种使数据更适合决策树的方式转换数据。

可以看出，NMF变换数据形成平行于轴的线。

因子分解

可以使用与线性模型的技巧来分解矩阵。

Conclusion

Matrix Factorization is a very general approach for dimensionality reduction and feature extraction
It can be applied for transforming categorical features into real-valued
Many of tricks trick suitable for linear models can be useful for MF

Feature interactions

特征值的所有组合

Example:banner selection

假设我们正在构建一个预测模型，在网站上显示的最佳广告横幅。

...	category_ad	category_site	...	is_clicked
...	auto_part	game_news	...	0
...	music_tickets	music_news	..	1
...	mobile_phones	auto_blog	...	0

将广告横幅本身的类别和横幅将显示的网站类别，进行组合将构成一个非常强的特征。

...	ad_site	...	is_clicked
...	auto_part \| game_news	...	0
...	music_tickets \| music_news	..	1
...	mobile_phones \| auto_blog	...	0

构建这两个特征的组合特征ad_site

从技术角度来看，有两种方法可以构建这种交互。

Example of interactions

方法1

方法2

相似的想法也可用于数值变量

事实上，这不限于乘法操作，还可以是其他的

Multiplication
Sum
Diff
Division
..

Practival Notes

We have a lot of possible interactions -N*N for N features.
a. Even more if use several types in interactions
Need ti reduce it's number
a. Dimensionality reduction
b. Feature selection

通过这种方法生成了大量的特征，可以使用特征选择或降维的方法减少特征。以下用特征选择举例说明

Interactions' order

We looked at 2nd order interactions.
Such approach can be generalized for higher orders.
It is hard to do generation and selection automatically.
Manual building of high-order interactions is some kind of art.

Extract features from DT

看一下决策树。让我们将每个叶子映射成二进制特征。对象叶子的索引可以用作新分类特征的值。如果我们不使用单个树而是使用它们的整体。例如，随机森林，那么这种操作可以应用于每个条目。这是一种提取高阶交互的强大方法。

How to use it

In sklearn:

tree_model.apply()

In xgboost:

booster.predict(pred_leaf=True)

Conclusion

We looked at ways to build an interaction of categorical attributes
Extended this approach to real-valued features
Learn how to extract features via decision trees

t-SNE

用于探索数据分析。可以被视为从数据中获取特征的方法。

Practical Notes

Result heavily depends on hyperparameters(perplexity)
Good practice is to use several projections with different perplexities(5-100)
Due to stochastic nature, tSNE provides different projections even for the same data\hyperparams
Train and test should be projected together
tSNE runs for a long time with a big number of features
it is common to do dimensionality reduction before projection.
Implementation of tSNE can be found in sklearn library.
But personally I perfer you use stand-alone implementation python package tsne due to its' faster speed.

Conclusion

tSNE is a great tool for visualization
It can be used as feature as well
Be careful with interpretation of results
Try different perplexities

矩阵分解：

矩阵分解方法概述（sklearn）

T-SNOW：

互动：

高级特征工程II的更多相关文章

高级特征工程I
Mean encodings 以下是Coursera上的How to Win a Data Science Competition: Learn from Top Kagglers课程笔记. 学习目标 ...
Python机器学习笔记使用sklearn做特征工程和数据挖掘
特征处理是特征工程的核心部分,特征工程是数据分析中最耗时间和精力的一部分工作,它不像算法和模型那样式确定的步骤,更多的是工程上的经验和权衡,因此没有统一的方法,但是sklearn提供了较为完整的特征处 ...
谷歌大规模机器学习：模型训练、特征工程和算法选择 (32PPT下载)
本文转自:http://mp.weixin.qq.com/s/Xe3g2OSkE3BpIC2wdt5J-A 谷歌大规模机器学习:模型训练.特征工程和算法选择 (32PPT下载) 2017-01-26 ...
《转发》特征工程——categorical特征和 continuous特征
from http://breezedeus.github.io/2014/11/15/breezedeus-feature-processing.html 请您移步原文观看,本文只供自己学习使用连 ...
机器学习-特征工程-Feature generation 和 Feature selection
概述:上节咱们说了特征工程是机器学习的一个核心内容.然后咱们已经学习了特征工程中的基础内容,分别是missing value handling和categorical data encoding的一些 ...
机器学习实战基础（十八）：sklearn中的数据预处理和特征工程（十一）特征选择之 Wrapper包装法
Wrapper包装法包装法也是一个特征选择和算法训练同时进行的方法,与嵌入法十分相似,它也是依赖于算法自身的选择,比如coef_属性或feature_importances_属性来完成特征选择.但不 ...
使用sklearn做单机特征工程
目录 1 特征工程是什么?2 数据预处理 2.1 无量纲化 2.1.1 标准化 2.1.2 区间缩放法 2.1.3 标准化与归一化的区别 2.2 对定量特征二值化 2.3 对定性特征哑编码 2.4 缺 ...
特征工程(Feature Enginnering)学习记要
最近学习特征工程(Feature Enginnering)的相关技术,主要包含两块:特征选取(Feature Selection)和特征抓取(Feature Extraction).这里记录一些要点 ...
【转】使用sklearn做单机特征工程
这里是原文说明:这是我用Markdown编辑的第一篇随笔目录 1 特征工程是什么? 2 数据预处理 2.1 无量纲化 2.1.1 标准化 2.1.2 区间缩放法 2.1.3 无量纲化与正则化的区别 ...

随机推荐

（二）LoadRunner目录分析
学习一个软件的适用,首先应该了解软件目录,对以后深入学习工具有很大的好处. 查看目录文件如下: Analysis Templates——分析模板(默认的模板,可以将自己的模板保存到该目录下) bin— ...
处理方法返回ModelAndView类型
1.请求 <a href="test">测试</a> 2.处理方法 @RequestMapping("/test") public Mo ...
nCompass-解决方案介绍
nCompass-解决方案介绍 1. IT运维的现状及痛点业务部门投诉系统不可用,各个部门盘查: 网络是通的:系统资源正常:应用进程状态都是正常的:数据库日志中也没有报错运维被动: 80%的故障 ...
MySQL概述及入门(一)
MySql概述及入门(一) 什么是MySQL? MySQL是当今主流的关系型数据库管理系统(记录是有行有列的数据库) , 可以与Oracle 和SQL Server 竞争 , 是最好RDBMS( ...
删除Win10菜单中的幽灵菜单（ms-resource:AppName/Text ）
新建一个 .bat文件,输入以下内容 @echo off taskkill /f /im explorer.exe taskkill /f /im shellexperiencehost.exe ti ...
comTest.json文件中内容，被NewsList.vue文件引入
本文目标:就是把扩散名为.json文件中数据,传递给NewsList.vue文件.主要通过导出,并传递给data(){}变紧新建文件名为:commTest.json { "schoolNa ...
[大数据技术]Kettle报OPTION SQL_SELECT_LIMIT=DEFAULT错误的解决办法
百度得到的解决方式都是说mysql通过jdbc链接的时候会进行测试’SET OPTION SQL_SELECT_LIMIT=DEFAULT’,但是5.6以后的版本弃用了set的方式. 我用的MySQL ...
报表生成(POI,jquery.table2excel.js，Echarts)
最近公司要弄个报表相关的功能,话不多说,先上图前一种是POI 生成的,后一种是Echarts生成的.报表我想大家都不陌生,基本上在公司业务中都会使用到.先说说POI,jquery.table2exc ...
P2055 [ZJOI2009]假期的宿舍【二分图/最大流】
题目描述学校放假了 · · · · · · 有些同学回家了,而有些同学则有以前的好朋友来探访,那么住宿就是一个问题. 比如 A 和 B 都是学校的学生,A 要回家,而 C 来看B,C 与 A 不认识 ...
【新人赛】阿里云恶意程序检测 -- 实践记录10.27 - TF-IDF模型调参 / 数据可视化
TF-IDF模型调参 1. 调TfidfVectorizer的参数 ngram_range, min_df, max_df: 上一篇博客调了ngram_range这个参数,得出了ngram_range ...

高级特征工程II

Statistics and distance based features

例子：这里有一些CTR任务的数据

Neighbors

KNN features in springleaf

Matrix factorizations for feature extraction

Notes about Matrix Fatorization

Implementtation

NMF for tree-based methods

因子分解

Conclusion

Feature interactions

Practival Notes

Interactions' order

Extract features from DT

Conclusion

t-SNE

Practical Notes

Conclusion

矩阵分解：

T-SNOW：

互动：

高级特征工程II的更多相关文章

随机推荐

热门专题