Data Transformation / Learning with Counts
机器学习中离散特征的处理方法
Updated: August 25, 2016
Learning with counts is an efficient way to create a compact set of features for a dataset, based on counts of the values. You can use the modules in this section to build a set of counts and features, and later update the counts and the features to take advantage of new data, or merge two sets of count data.
The basic idea underlying count-based featurization is simple: by calculating counts, you can quickly and easily get a summary of what columns contain the most important information. The module counts the number of times a value appears, and then provides that information as a feature for input to a model.
Example of Count-Based Learning
Imagine you’re trying to validate a credit card transaction. One crucial piece of information is where this transaction came from, and one of the most common encodings of that location is the postal code. However, there might be as many as 40,000 postal codes, zip codes, and geographical codes to account for. Does your model have the capacity to learn 40,000 more parameters? If you give it that capacity, do you now have enough training data to prevent it from overfitting?
If you had really good data with lots of samples, such fine-grained local granularity could be quite powerful. However, if you have only one sample of a fraudulent transaction from a small locality, does it mean that all of the transactions from that place are bad, or that you don’t have enough data?
One solution to this conundrum is to learn with counts. That is, rather than introduce 40,000 more features, you can observe the counts and proportions of fraud for each postal code. By using these counts as features, you gain a notion of the strength of the evidence for each value. Moreover, by encoding the relevant statistics of the counts, the learner can use the statistics to decide when to back off and use other features.
Count-based learning is very attractive for many reasons: You have fewer features, requiring fewer parameters, which makes for faster learning, faster prediction, smaller predictors, and less potential to overfit.
How Counts are Created
An example might help to demonstrate how count-based features are created and applied. This example is highly simplified, to give you an idea of the overall process, and how to use and interpret count-based features.
Suppose you have a table like this, with labels and inputs:
|
Label column |
Input value |
|---|---|
|
0 |
A |
|
0 |
A |
|
1 |
A |
|
0 |
B |
|
1 |
B |
|
1 |
B |
|
1 |
B |
Here is how count-based features are created:
Each case (or row, or sample) has a set of values in columns.
Here, the values are A, B, and so forth.
For a particular set of values, you find all the other cases in that dataset that have the same value.
In this case, there are three instances of A and four of B.
Next, you count their class memberships as features in themselves.
In this case, you get a small matrix, in which there are 2 cases where A=0, 1 case where A = 1, 1 case where B= 0, and 3 cases where B = 1.
When you create features based on this matrix, you get a variety of count-based features, including a calculation of the log-odds ratio as well as the counts for each target class:
|
Label |
0_0_Class000_Count |
0_0_Class001_Count |
0_0_Class000_LogOdds |
0_0_IsBackoff |
|---|---|---|---|---|
|
0 |
2 |
1 |
0.510826 |
0 |
|
0 |
2 |
1 |
0.510826 |
0 |
|
1 |
2 |
1 |
0.510826 |
0 |
|
0 |
1 |
3 |
-0.8473 |
0 |
|
1 |
1 |
3 |
-0.8473 |
0 |
|
1 |
1 |
3 |
-0.8473 |
0 |
|
1 |
1 |
3 |
-0.8473 |
0 |
Examples
The following article from the Microsoft Machine Learning team provides a detailed walkthrough of how to use counts in machine learning, and compares the efficacy of count-based modeling with other methods.
Technical Notes
How is the log-loss value calculated?
The Log-loss value is not the plain log-odds; the prior distribution is used to smooth the log-odds computation.
Suppose you have a data set used for binary classification. In this dataset, the prior frequency for class 0 is p_0, and the prior frequency for class 1 is p_1 = 1 – p_0. For a certain training example feature, the count for class 0 is x_0, and the count for class 1 is x_1.
Under these assumptions, the log-odds is computed as:
LogOdds = Log(x_0 + c * p_0) – Log (x_1 + c * p_1)
Where:
c is the prior coefficient, which can be set by the user.
Log uses the natural base.
In other words, for each class i:
Log_odds[i] = Log( (count[i] + prior_coefficient * prior_frequency[i]) / (sum_of_counts - count[i]) + prior_coefficient * (1 - prior_frequency[i]))
If the prior coefficient is positive, the log odds can be different from Log(count[i] / (sum_of_counts – count[i])).
Why are the log odds not computed for some items?
By default, all items with a count less than 10 are collected in a single bucket called the "garbage bin". You can change this behavior value by using the Garbage bin threshold option in the Modify Count Table Parameters module.
List of Modules
The Learning with Counts category includes the following modules:
|
Module |
Description |
|---|---|
|
Creates a count table and count-based features from a dataset, and saves it as a transformation |
|
|
Exports count table from a counting transform This module supports backward compatibility with experiments that create count-based features using Build Count Table (deprecated) and Count Featurizer (deprecated). |
|
|
Imports an existing count table This module supports backward compatibility with experiments that create count-based features using Build Count Table (deprecated) and Count Featurizer (deprecated). It supports conversion of count tables to count transformations. |
|
|
Merges two sets of count-based features |
|
|
Modifies count-based features derived from an existing count table |
Data Transformation / Learning with Counts的更多相关文章
- 【转】The most comprehensive Data Science learning plan for 2017
I joined Analytics Vidhya as an intern last summer. I had no clue what was in store for me. I had be ...
- 《从0到1学习Flink》—— Flink Data transformation(转换)
前言 在第一篇介绍 Flink 的文章 <<从0到1学习Flink>-- Apache Flink 介绍> 中就说过 Flink 程序的结构 Flink 应用程序结构就是如上图 ...
- Flink 从 0 到 1 学习 —— Flink Data transformation(转换)
toc: true title: Flink 从 0 到 1 学习 -- Flink Data transformation(转换) date: 2018-11-04 tags: Flink 大数据 ...
- Flink Data transformation(转换)
Flink Data transformation 算子学习 1.Source:数据源,Flink在流处理和批处理上的source大概有4类: 基于本地集合的source.基于文件的source.基于 ...
- Intermediate Python for Data Science learning 2 - Histograms
Histograms from:https://campus.datacamp.com/courses/intermediate-python-for-data-science/matplotlib? ...
- Intermediate Python for Data Science learning 1 - Basic plots with matplotlib
Basic plots with matplotlib from:https://campus.datacamp.com/courses/intermediate-python-for-data-sc ...
- Intro to Python for Data Science Learning 8 - NumPy: Basic Statistics
NumPy: Basic Statistics from:https://campus.datacamp.com/courses/intro-to-python-for-data-science/ch ...
- Intro to Python for Data Science Learning 7 - 2D NumPy Arrays
2D NumPy Arrays from:https://campus.datacamp.com/courses/intro-to-python-for-data-science/chapter-4- ...
- Intro to Python for Data Science Learning 5 - Packages
Packages From:https://campus.datacamp.com/courses/intro-to-python-for-data-science/chapter-3-functio ...
随机推荐
- 转载:Hadoop安装教程_单机/伪分布式配置_Hadoop2.6.0/Ubuntu14.04
原文 http://www.powerxing.com/install-hadoop/ 当开始着手实践 Hadoop 时,安装 Hadoop 往往会成为新手的一道门槛.尽管安装其实很简单,书上有写到, ...
- 简述Session
Session的原理 1.session技术的概述 * session是服务器端技术 * 服务器在运行时可以为每一个用户的浏览器创建一个其独享的session对象 * 由于session为用户浏览器独 ...
- Eclipse debug高级技巧(转)
Debug视图 认识debug视图,红色部分框为线程堆栈视图,黄色部分框为表达式.断点.变量视图,蓝色部分为代码视图. 线程堆栈视图 分别介绍一下这几个按钮的含义: 1.表示当前实现继续运行直到下一个 ...
- Android数据库的运用
很简单的应用,所以我直接简单明了. android中数据库的运用: 1.定义类继承SQLiteOpenHelper ps: public class DBHelper exten ...
- AngularJs之$scope对象(作用域)
一.作用域 AngularJs中的$scope对象是模板的域模型,也称为作用域实例.通过为其属性赋值,可以传递数据给模板渲染. 每个$scope都是Scope类的实例,Scope类有很多方法,用于 ...
- Cannot get a connection, pool exhausted解决办法
http://blog.163.com/it_message/blog/static/8892051200908102032653/ 连接池(Tomcat+oracle),运行一段时间后就会出现 Ca ...
- [系统开发] 基于Ansible的产品上线系统
前言: 应部门急需,开发了一套基于Ansible Playbook的产品上线系统.由于时间很紧,UI直接套用了之前开发的一套perl cgi模板,后续计划用 django 重新编写. 个人感觉该系统的 ...
- DHTMLX-第一弹
DHTMLX 半年前结束了漂泊的工作生活回到了公司,跟着团队开发新产品.忙碌起来就很少有时间静下来好好写点东西.公司既然要开发新的产品当然也会接触一些新的东西,也会面临新的挑战.接下来就将项目前端所用 ...
- maven 项目无法发布,无法编译的解决办法
1 Web Deployment Assembly信息都合理2 重新clear项目,让JAVA代码重新生成.class文件在target目录中
- mongodb高级查询
前几篇,老玩家绕道即可,新手晚上闲着也是蛋疼,不如把命令敲一边,这样你就会对MongoDB有一定的掌握啦.如果没有安装MongoDB去看我的上一篇博客 MongoDB下载安装与简单增删改查 前奏:启 ...