[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合

Datasets can often contain components of that require different feature extraction and processing pipelines. This scenario might occur when:

  • 1.Your dataset consists of heterogeneous data types (e.g. raster images and text captions)
  • 2.Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines.

This example demonstrates how to use sklearn.feature_extraction.FeatureUnion on a dataset containing different types of features. We use the 20-newsgroups dataset and compute standard bag-of-words features for the subject line and body in separate pipelines as well as ad hoc features on the body. We combine them (with weights) using a FeatureUnion and finally train a classifier on the combined set of features.

The choice of features is not particularly helpful, but serves to illustrate the technique.

# Author: Matt Terry <matt.terry@gmail.com>
#
# License: BSD 3 clause
from __future__ import print_function import numpy as np from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_footer
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_quoting
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC class ItemSelector(BaseEstimator, TransformerMixin):
"""For data grouped by feature, select subset of data at a provided key. The data is expected to be stored in a 2D data structure, where the first
index is over features and the second is over samples. i.e. >> len(data[key]) == n_samples Please note that this is the opposite convention to scikit-learn feature
matrixes (where the first index corresponds to sample). ItemSelector only requires that the collection implement getitem
(data[key]). Examples include: a dict of lists, 2D numpy array, Pandas
DataFrame, numpy record array, etc. >> data = {'a': [1, 5, 2, 5, 2, 8],
'b': [9, 4, 1, 4, 1, 3]}
>> ds = ItemSelector(key='a')
>> data['a'] == ds.transform(data) ItemSelector is not designed to handle data grouped by sample. (e.g. a
list of dicts). If your data is structured this way, consider a
transformer along the lines of `sklearn.feature_extraction.DictVectorizer`. Parameters
----------
key : hashable, required
The key corresponding to the desired value in a mappable.
"""
def __init__(self, key):
self.key = key def fit(self, x, y=None):
return self def transform(self, data_dict):
return data_dict[self.key] class TextStats(BaseEstimator, TransformerMixin):
"""Extract features from each document for DictVectorizer""" def fit(self, x, y=None):
return self def transform(self, posts):
return [{'length': len(text),
'num_sentences': text.count('.')}
for text in posts] class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
"""Extract the subject & body from a usenet post in a single pass. Takes a sequence of strings and produces a dict of sequences. Keys are
`subject` and `body`.
"""
def fit(self, x, y=None):
return self def transform(self, posts):
features = np.recarray(shape=(len(posts),),
dtype=[('subject', object), ('body', object)])
for i, text in enumerate(posts):
headers, _, bod = text.partition('\n\n')
bod = strip_newsgroup_footer(bod)
bod = strip_newsgroup_quoting(bod)
features['body'][i] = bod prefix = 'Subject:'
sub = ''
for line in headers.split('\n'):
if line.startswith(prefix):
sub = line[len(prefix):]
break
features['subject'][i] = sub return features pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()), # Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[ # Pipeline for pulling features from the post's subject line
('subject', Pipeline([
('selector', ItemSelector(key='subject')),
('tfidf', TfidfVectorizer(min_df=50)),
])), # Pipeline for standard bag-of-words model for body
('body_bow', Pipeline([
('selector', ItemSelector(key='body')),
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
])), # Pipeline for pulling ad hoc features from post's body
('body_stats', Pipeline([
('selector', ItemSelector(key='body')),
('stats', TextStats()), # returns a list of dicts
('vect', DictVectorizer()), # list of dicts -> feature matrix
])), ], # weight components in FeatureUnion
transformer_weights={
'subject': 0.8,
'body_bow': 0.5,
'body_stats': 1.0,
},
)), # Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
]) # limit the list of categories to make running this example faster.
categories = ['alt.atheism', 'talk.religion.misc']
train = fetch_20newsgroups(random_state=1,
subset='train',
categories=categories,
)
test = fetch_20newsgroups(random_state=1,
subset='test',
categories=categories,
) pipeline.fit(train.data, train.target)
y = pipeline.predict(test.data)
print(classification_report(y, test.target))

[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合的更多相关文章

  1. [占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较

    [占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较

  2. scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类 (python代码)

    scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...

  3. Scikit Learn: 在python中机器学习

    转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...

  4. Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一)

    原文:Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一) 拓展压缩包的使用方式详细介绍 1:将拓展包解压:ThinkPHP3.1.2_Extend.zip   --> 将其下的 \ ...

  5. (原创)(三)机器学习笔记之Scikit Learn的线性回归模型初探

    一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价 模型训练好后,度量模型拟合效果的 ...

  6. (原创)(四)机器学习笔记之Scikit Learn的Logistic回归初探

    目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优 一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...

  7. Scikit Learn

    Scikit Learn Scikit-Learn简称sklearn,基于 Python 语言的,简单高效的数据挖掘和数据分析工具,建立在 NumPy,SciPy 和 matplotlib 上.

  8. [占位-未完成]scikit-learn一般实例之十二:用于RBF核的显式特征映射逼近

    It shows how to use RBFSampler and Nystroem to approximate the feature map of an RBF kernel for clas ...

  9. Linear Regression with Scikit Learn

    Before you read  This is a demo or practice about how to use Simple-Linear-Regression in scikit-lear ...

随机推荐

  1. lua 学习笔记(1)

    一.lua函数赋值与函数调用         在lua中函数名也是作为一种变量出现的,即函数和所有其他值一样都是匿名的,当要使用某个函数时,需要将该函数赋值给一个变量,这样在函数块的其他地方就可以通过 ...

  2. JSP 标准标签库(JSTL)

    JSP 标准标签库(JSTL) JSP标准标签库(JSTL)是一个JSP标签集合,它封装了JSP应用的通用核心功能. JSTL支持通用的.结构化的任务,比如迭代,条件判断,XML文档操作,国际化标签, ...

  3. Android开发学习——动画

    帧动画> 一张张图片不断的切换,形成动画效果* 在drawable目录下定义xml文件,子节点为animation-list,在这里定义要显示的图片和每张图片的显示时长              ...

  4. 仿陌陌的ios客户端+服务端源码项目

    软件功能:模仿陌陌客户端,功能很相似,注册.登陆.上传照片.浏览照片.浏览查找附近会员.关注.取消关注.聊天.语音和文字聊天,还有拼车和搭车的功能,支持微博分享和查找好友. 后台是php+mysql, ...

  5. 如何开启MySQL 5.7.12 的二进制日志

    1. 打开/etc下的my.cnf文件 2. 编辑它,添加内容: log_bin=binary-log   #二进制日志的文件名 server_id=1  #必须指定server_id,这是MySQL ...

  6. ASP.NET 5 Beta 7 版本

    在 VS2015 发布的同时,微软也发布了 ASP.NET 5 的路线图(详见ASP.NET 5 Schedule and Roadmap : https://github.com/aspnet/ho ...

  7. 吸顶大法 -- UWP中的工具栏吸顶的实现方式之一

    如果一个页面中有很长的列表/内容,很多应用都会在用户向下滚动时隐藏页面的头,给用户留出更多的阅读空间,同时提供一个方便的吸顶工具栏,比如淘宝中的店铺页面. 下面是一个比较简单的实现,如果有同学有更好的 ...

  8. Spring5:@Autowired注解、@Resource注解和@Service注解

    什么是注解 传统的Spring做法是使用.xml文件来对bean进行注入或者是配置aop.事物,这么做有两个缺点: 1.如果所有的内容都配置在.xml文件中,那么.xml文件将会十分庞大:如果按需求分 ...

  9. 实现了一个百度首页的彩蛋——CSS3 Animation简介

    在百度搜索中有这样一个彩蛋:搜索“旋转”,“跳跃”,“反转”等词语,会出现相应的动画效果(搜索“反转”后的效果).查看源码可以发现,这些效果正是通过CSS3的animation属性实现的. 实现这个彩 ...

  10. 这些年一直记不住的 Java I/O

    参考资料 该文中的内容来源于 Oracle 的官方文档.Oracle 在 Java 方面的文档是非常完善的.对 Java 8 感兴趣的朋友,可以从这个总入口 Java SE 8 Documentati ...