[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合

Datasets can often contain components of that require different feature extraction and processing pipelines. This scenario might occur when:

  • 1.Your dataset consists of heterogeneous data types (e.g. raster images and text captions)
  • 2.Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines.

This example demonstrates how to use sklearn.feature_extraction.FeatureUnion on a dataset containing different types of features. We use the 20-newsgroups dataset and compute standard bag-of-words features for the subject line and body in separate pipelines as well as ad hoc features on the body. We combine them (with weights) using a FeatureUnion and finally train a classifier on the combined set of features.

The choice of features is not particularly helpful, but serves to illustrate the technique.

# Author: Matt Terry <matt.terry@gmail.com>
#
# License: BSD 3 clause
from __future__ import print_function import numpy as np from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_footer
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_quoting
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC class ItemSelector(BaseEstimator, TransformerMixin):
"""For data grouped by feature, select subset of data at a provided key. The data is expected to be stored in a 2D data structure, where the first
index is over features and the second is over samples. i.e. >> len(data[key]) == n_samples Please note that this is the opposite convention to scikit-learn feature
matrixes (where the first index corresponds to sample). ItemSelector only requires that the collection implement getitem
(data[key]). Examples include: a dict of lists, 2D numpy array, Pandas
DataFrame, numpy record array, etc. >> data = {'a': [1, 5, 2, 5, 2, 8],
'b': [9, 4, 1, 4, 1, 3]}
>> ds = ItemSelector(key='a')
>> data['a'] == ds.transform(data) ItemSelector is not designed to handle data grouped by sample. (e.g. a
list of dicts). If your data is structured this way, consider a
transformer along the lines of `sklearn.feature_extraction.DictVectorizer`. Parameters
----------
key : hashable, required
The key corresponding to the desired value in a mappable.
"""
def __init__(self, key):
self.key = key def fit(self, x, y=None):
return self def transform(self, data_dict):
return data_dict[self.key] class TextStats(BaseEstimator, TransformerMixin):
"""Extract features from each document for DictVectorizer""" def fit(self, x, y=None):
return self def transform(self, posts):
return [{'length': len(text),
'num_sentences': text.count('.')}
for text in posts] class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
"""Extract the subject & body from a usenet post in a single pass. Takes a sequence of strings and produces a dict of sequences. Keys are
`subject` and `body`.
"""
def fit(self, x, y=None):
return self def transform(self, posts):
features = np.recarray(shape=(len(posts),),
dtype=[('subject', object), ('body', object)])
for i, text in enumerate(posts):
headers, _, bod = text.partition('\n\n')
bod = strip_newsgroup_footer(bod)
bod = strip_newsgroup_quoting(bod)
features['body'][i] = bod prefix = 'Subject:'
sub = ''
for line in headers.split('\n'):
if line.startswith(prefix):
sub = line[len(prefix):]
break
features['subject'][i] = sub return features pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()), # Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[ # Pipeline for pulling features from the post's subject line
('subject', Pipeline([
('selector', ItemSelector(key='subject')),
('tfidf', TfidfVectorizer(min_df=50)),
])), # Pipeline for standard bag-of-words model for body
('body_bow', Pipeline([
('selector', ItemSelector(key='body')),
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
])), # Pipeline for pulling ad hoc features from post's body
('body_stats', Pipeline([
('selector', ItemSelector(key='body')),
('stats', TextStats()), # returns a list of dicts
('vect', DictVectorizer()), # list of dicts -> feature matrix
])), ], # weight components in FeatureUnion
transformer_weights={
'subject': 0.8,
'body_bow': 0.5,
'body_stats': 1.0,
},
)), # Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
]) # limit the list of categories to make running this example faster.
categories = ['alt.atheism', 'talk.religion.misc']
train = fetch_20newsgroups(random_state=1,
subset='train',
categories=categories,
)
test = fetch_20newsgroups(random_state=1,
subset='test',
categories=categories,
) pipeline.fit(train.data, train.target)
y = pipeline.predict(test.data)
print(classification_report(y, test.target))

[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合的更多相关文章

  1. [占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较

    [占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较

  2. scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类 (python代码)

    scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...

  3. Scikit Learn: 在python中机器学习

    转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...

  4. Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一)

    原文:Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一) 拓展压缩包的使用方式详细介绍 1:将拓展包解压:ThinkPHP3.1.2_Extend.zip   --> 将其下的 \ ...

  5. (原创)(三)机器学习笔记之Scikit Learn的线性回归模型初探

    一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价 模型训练好后,度量模型拟合效果的 ...

  6. (原创)(四)机器学习笔记之Scikit Learn的Logistic回归初探

    目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优 一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...

  7. Scikit Learn

    Scikit Learn Scikit-Learn简称sklearn,基于 Python 语言的,简单高效的数据挖掘和数据分析工具,建立在 NumPy,SciPy 和 matplotlib 上.

  8. [占位-未完成]scikit-learn一般实例之十二:用于RBF核的显式特征映射逼近

    It shows how to use RBFSampler and Nystroem to approximate the feature map of an RBF kernel for clas ...

  9. Linear Regression with Scikit Learn

    Before you read  This is a demo or practice about how to use Simple-Linear-Regression in scikit-lear ...

随机推荐

  1. Node.js:Buffer浅谈

    Javascript在客户端对于unicode编码的数据操作支持非常友好,但是对二进制数据的处理就不尽人意.Node.js为了能够处理二进制数据或非unicode编码的数据,便设计了Buffer类,该 ...

  2. dubbox微服务实例及引发的“血案”

    Dubbo 是阿里巴巴公司开源的一个高性能优秀的服务框架,使得应用可通过高性能的 RPC 实现服务的输出和输入功能,可以和 Spring框架无缝集成. 主要核心部件: Remoting: 网络通信框架 ...

  3. Hbase的伪分布式安装

    Hbase安装模式介绍 单机模式 1> Hbase不使用HDFS,仅使用本地文件系统 2> ZooKeeper与Hbase运行在同一个JVM中 分布式模式– 伪分布式模式1> 所有进 ...

  4. Mysql存储引擎比较

    Mysql作为一个开源的免费数据库,在平时项目当中会经常使用到,而在项目当中我们的着重点一般在设计使用数据库上而非mysql本身上,所以在提到mysql的存储引擎时,一般都不曾知道,这里经过网上相关文 ...

  5. 基于Composer Player 模型加载和相关属性设置

    主要是基于达索软件Composer Player.的基础上做些二次开发. public class ComposerToolBarSetting { public bool AntiAliasingO ...

  6. PLSql Oracle配置

    1.安装Oracle客户端或者服务端 2.配置环境变量 <1>.一般如果安装了Oracle客户端或者服务端的话,在环境变种的Path中有Oracle的安装路径(计算机-属性-高级系统设置- ...

  7. Struts的拦截器

    Struts的拦截器 1.什么是拦截器 Struts的拦截器和Servlet过滤器类似,在执行Action的execute方法之前,Struts会首先执行Struts.xml中引用的拦截器,在执行完所 ...

  8. ASP.NET Core 中间件详解及项目实战

    前言 在上篇文章主要介绍了DotNetCore项目状况,本篇文章是我们在开发自己的项目中实际使用的,比较贴合实际应用,算是对中间件的一个深入使用了,不是简单的Hello World,如果你觉得本篇文章 ...

  9. 从EF的使用中探讨业务模型能否脱离单一存储层完全抽象存在

    上次赶时间,就很流水账地写了上次项目对EF的一次实践应用模式,因为太长了,也没能探讨太多,所以再继续扩展. 这次想探讨的是,实体,如果作为类似于领域模型的业务模型存在,它的数据能否来自不同的数据源.这 ...

  10. 实现ABP中Person类的权限功能

    菜单项的显示功能已经完全OK了.那么我们就开始制作视图功能吧. 首先测试接口是否正常 我们通过代码生成器将权限和application中大部分功能已经实现了.那么我们来测试下这些接口ok不. 浏览/a ...