[amazonaccess 1]logistic.py 特征提取
---恢复内容开始---
本文件对应logistic.py
amazonaccess介绍:
根据入职员工的定位(员工角色代码、角色所属家族代码等特征)判断员工是否有访问某资源的权限
logistic.py(python)的关键:
1.通过组合组合几个特征来获取新的特征
例如:组合MGR_ID ROLE_FAMILY得到新特征 hash((85475,290919))=1071656665
2.greedy feature selection
i. 首先从候选特征中选择1个在训练集上表现最好的特征,将其加入好特征goodfeatures中,并将该特征从中候选特征中排除
ii. 从候选特征中选择一个特征与goodfeatures中特征一起,选取在训练数据集中表现最好的特征,加入goodfeatures中,并将该特征从中候选特征中排除
iii.继续选取,直到在训练集上的表现不再增加为止
3.One Hot Encoding
例如:对数据离散数据 [23 33 33 44]进行编码
i. 首先relable,转换为 [0 1 1 2]
ii.对0进行编码 0 0 1 对应 23
对1进行编码 0 1 0 对应 33
对2进行编码 1 0 0 对应 44
这样在最后使用线性模型的时候,离散数据的每个标签都会对应一个权重
代码流程:
1.读取数据,去除ROLE_CODE属性
learner = 'log'
print "Reading dataset..."
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
submit=learner + str(SEED) + '.csv'
#去除ROLE_CODE特征,因为train和test数据需要同时做变换,所以合到一块
all_data = np.vstack((train_data.ix[:,1:-1], test_data.ix[:,1:-1]))
num_train = np.shape(train_data)[0]
2.对数据进行relable
# Transform data
print "Transforming data..."
# Relabel the variable values to smallest possible so that I can use bincount
# on them later.
relabler = preprocessing.LabelEncoder()
for col in range(len(all_data[0,:])):
relabler.fit(all_data[:, col])
all_data[:, col] = relabler.transform(all_data[:, col])
3.组合特征生成新特征,这里分别组合了2个特征和3个特征,分别生成(28-2)和(56-12)个新特征,并与原特征合并
在组合特征时,排除了(ROLE_FAMILY,ROLE_FAMILY_DESC)和(ROLE_ROLLUP_1,ROLE_ROLLUP_2)组合
因为特征中很多标签对应的数据只有1条或2条,将这些数据合并到个标签中
组合特征的函数
def group_data(data, degree=3, hash=hash):
"""
numpy.array -> numpy.array Groups all columns of data into all combinations of triples
"""
new_data = []
m,n = data.shape
for indicies in combinations(range(n), degree):
#去除ROLE_TITLE和ROLE_FAMILY组合
if 5 in indicies and 7 in indicies:
print "feature Xd"
#去除ROLE_ROLLUP_1和ROLE_ROLLUP_2组合
elif 2 in indicies and 3 in indicies:
print "feature Xd"
else:
new_data.append([hash(tuple(v)) for v in data[:,indicies]])
return array(new_data).T
合并数据只有1条或两条的标签
dp = group_data(all_data, degree=2)
for col in range(len(dp[0,:])):
relabler.fit(dp[:, col])
dp[:, col] = relabler.transform(dp[:, col])
uniques = len(set(dp[:,col]))
maximum = max(dp[:,col])
print col
if maximum < 65534:
count_map = np.bincount((dp[:, col]).astype('uint16'))
for n,i in enumerate(dp[:, col]):
#只有1条数据的标签,合并
if count_map[i] <= 1:
dp[n, col] = uniques
#只有2条数据的标签,合并
elif count_map[i] == 2:
dp[n, col] = uniques+1
else:
for n,i in enumerate(dp[:, col]):
if (dp[:, col] == i).sum() <= 1:
dp[n, col] = uniques
elif (dp[:, col] == i).sum() == 2:
dp[n, col] = uniques+1
print uniques # unique values
uniques = len(set(dp[:,col]))
print uniques
relabler.fit(dp[:, col])
dp[:, col] = relabler.transform(dp[:, col])
将新特征和原特征合并
# Collect the training features together
y = array(train_data.ACTION)
X = all_data[:num_train]
X_2 = dp[:num_train]
X_3 = dt[:num_train] # Collect the testing features together
X_test = all_data[num_train:]
X_test_2 = dp[num_train:]
X_test_3 = dt[num_train:] X_train_all = np.hstack((X, X_2, X_3))
X_test_all = np.hstack((X_test, X_test_2, X_test_3))
4.one hot encoding
def OneHotEncoder(data, keymap=None):
"""
OneHotEncoder takes data matrix with categorical columns and
converts it to a sparse binary matrix. Returns sparse binary matrix and keymap mapping categories to indicies.
If a keymap is supplied on input it will be used instead of creating one
and any categories appearing in the data that are not in the keymap are
ignored
"""
if keymap is None:
keymap = []
for col in data.T:
uniques = set(list(col))
keymap.append(dict((key, i) for i, key in enumerate(uniques)))
total_pts = data.shape[0]
outdat = []
for i, col in enumerate(data.T):
km = keymap[i]
num_labels = len(km)
spmat = sparse.lil_matrix((total_pts, num_labels))
for j, val in enumerate(col):
if val in km:
spmat[j, km[val]] = 1
outdat.append(spmat)
outdat = sparse.hstack(outdat).tocsr()
return outdat, keymap # Xts holds one hot encodings for each individual feature in memory
# speeding up feature selection
Xts = [OneHotEncoder(X_train_all[:,[i]])[0] for i in range(num_features)]
5.greedy feature selection
print "Performing greedy feature selection..."
score_hist = []
N = 10
good_features = set([])
# Greedy feature selection loop
while len(score_hist) < 2 or score_hist[-1][0] > score_hist[-2][0]:
scores = []
for f in range(len(Xts)):
if f not in good_features:
feats = list(good_features) + [f]
Xt = sparse.hstack([Xts[j] for j in feats]).tocsr()
score = cv_loop(Xt, y, model, N)
scores.append((score, f))
print "Feature: %i Mean AUC: %f" % (f, score)
good_features.add(sorted(scores)[-1][1])
score_hist.append(sorted(scores)[-1])
print "Current features: %s" % sorted(list(good_features)) # Remove last added feature from good_features
good_features.remove(score_hist[-1][1])
good_features = sorted(list(good_features))
print "Selected features %s" % good_features
gf = open("feats" + submit, 'w')
print >>gf, good_features
gf.close()
print len(good_features), " features"
6.通过validation选取最优参数,logistic regression为regularization strength
print "Performing hyperparameter selection..."
# Hyperparameter selection loop
score_hist = []
Xt = sparse.hstack([Xts[j] for j in good_features]).tocsr()
if learner == 'NB':
Cvals = [0.001, 0.003, 0.006, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1]
else:
Cvals = np.logspace(-4, 4, 15, base=2) # for logistic
for C in Cvals:
if learner == 'NB':
model.alpha = C
else:
model.C = C
score = cv_loop(Xt, y, model, N)
score_hist.append((score,C))
print "C: %f Mean AUC: %f" %(C, score)
bestC = sorted(score_hist)[-1][1]
print "Best C value: %f" % (bestC)
7.预测
print "Performing One Hot Encoding on entire dataset..."
Xt = np.vstack((X_train_all[:,good_features], X_test_all[:,good_features]))
Xt, keymap = OneHotEncoder(Xt)
X_train = Xt[:num_train]
X_test = Xt[num_train:] if learner == 'NB':
model.alpha = bestC
else:
model.C = bestC print "Training full model..."
print "Making prediction and saving results..."
model.fit(X_train, y)
preds = model.predict_proba(X_test)[:,1]
create_test_submission(submit, preds)
preds = model.predict_proba(X_train)[:,1]
create_test_submission('Train'+submit, preds)
---恢复内容结束---
[amazonaccess 1]logistic.py 特征提取的更多相关文章
- 【机器学习实战】第5章 Logistic回归
第5章 Logistic回归 Logistic 回归 概述 Logistic 回归虽然名字叫回归,但是它是用来做分类的.其主要思想是: 根据现有数据对分类边界线建立回归公式,以此进行分类. 须知概念 ...
- 【机器学习实战】第5章 Logistic回归(逻辑回归)
第5章 Logistic回归 <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/ ...
- Airbnb新用户的民宿预定结果预测
1. 背景 关于这个数据集,在这个挑战中,您将获得一个用户列表以及他们的人口统计数据.web会话记录和一些汇总统计信息.您被要求预测新用户的第一个预订目的地将是哪个国家.这个数据集中的所有用户都来自美 ...
- sklearn机器学习-泰坦尼克号
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...
- 逻辑回归原理_挑战者飞船事故和乳腺癌案例_Python和R_信用评分卡(AAA推荐)
sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程) https://study.163.com/course/introduction.htm?courseId=1005269003&a ...
- 02-14 scikit-learn库之逻辑回归
目录 scikit-learn库之逻辑回归 一.LogisticRegression 1.1 使用场景 1.2 代码 1.3 参数详解 1.4 属性 1.5 方法 二.LogisticRegressi ...
- Sklearn使用良心完整入门教程
The complete .ipynb file can be download through my share in onedrive:https://1drv.ms/u/s!Al86h1dThX ...
- 《机器学习_02_线性模型_Logistic回归》
import numpy as np import os os.chdir('../') from ml_models import utils import matplotlib.pyplot as ...
- 基于Python的卷积神经网络和特征提取
基于Python的卷积神经网络和特征提取 用户1737318发表于人工智能头条订阅 224 在这篇文章中: Lasagne 和 nolearn 加载MNIST数据集 ConvNet体系结构与训练 预测 ...
随机推荐
- (2012年旧文)纪念史蒂夫乔布斯---IT界的普罗米修斯
谈苹果与乔布斯系列一 IT界的普罗米修斯 纪念PC界的先驱 史蒂夫乔布斯 2012-4-5 清明节,纪念IT时代的开创人—伟大的史蒂夫 乔布斯. 没有乔布斯,计算机还是属于一群科技人士的工具,没有漂 ...
- how to use http.Agent in node.js
Actually now that I look at the Agent code, it looks like it sets maxSockets on a per domain basis i ...
- 调用Windows属性窗口(居然是通过注册表来调用的)
简述 在Windows系统下.可以通过:右键 -> 属性,来查看文件/文件夹对应的属性信息,包括:常规.安全.详细信息等. 简述 共有类型 共有类型 首先,需要包含头文件: #include & ...
- 【C++第二课】---C到C++的函数升级
C++中对C语言在函数使用方面做了很大的升级 一﹑内联函数 1.C++中推荐使用内联函数来替代宏片段代码 2.C++中使用关键字inline声明内联函数 例如: inline int func(int ...
- Unix/Linux环境C编程入门教程(30) 字符串操作那些事儿
函数介绍 rindex(查找字符串中最后一个出现的指定字符) 相关函数 index,memchr,strchr,strrchr 表头文件 #include<string.h> 定义函数 c ...
- HTML5 Audio时代的MIDI音乐文件播放
大家都知道,HTML5 Audio标签能够支持wav, webm, mp3, ogg, acc等格式,但是有个很重要的音乐文件格式midi(扩展名mid)却在各大浏览器中都没有内置的支持,因为mid文 ...
- 推荐使用Tiny Framework web开发UI组件
TINY FRAMEWORK 基于组件化的J2EE开发框架,from:http://www.tinygroup.org/ 名字 Tiny名称的来历 取名Tiny是取其微不足道,微小之意. Tiny ...
- 数组中出现次数超过一半的数字 -java
数组中出现次数超过一半的数字 -java 方法一: 数组排序,然后中间值肯定是要查找的值. 排序最小的时间复杂度(快速排序)O(NlogN),加上遍历. 方法二: 使用散列表的方式,也就是统计每个数组 ...
- Android的Activity切换动画特效库SwitchLayout,视图切换动画库,媲美IOS
由于看了IOS上面很多开发者开发的APP的视图界面切换动画体验非常好,这些都是IOS自带的,但是Android的Activity等视图切换动画并没有提供原生的,所以特此写了一个可以媲美IOS视图切换动 ...
- Sereja and Coat Rack(水)
Sereja and Coat Rack Time Limit:1000MS Memory Limit:262144KB 64bit IO Format:%I64d & %I6 ...