[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合

Datasets can often contain components of that require different feature extraction and processing pipelines. This scenario might occur when:

  • 1.Your dataset consists of heterogeneous data types (e.g. raster images and text captions)
  • 2.Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines.

This example demonstrates how to use sklearn.feature_extraction.FeatureUnion on a dataset containing different types of features. We use the 20-newsgroups dataset and compute standard bag-of-words features for the subject line and body in separate pipelines as well as ad hoc features on the body. We combine them (with weights) using a FeatureUnion and finally train a classifier on the combined set of features.

The choice of features is not particularly helpful, but serves to illustrate the technique.

# Author: Matt Terry <matt.terry@gmail.com>
#
# License: BSD 3 clause
from __future__ import print_function import numpy as np from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_footer
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_quoting
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC class ItemSelector(BaseEstimator, TransformerMixin):
"""For data grouped by feature, select subset of data at a provided key. The data is expected to be stored in a 2D data structure, where the first
index is over features and the second is over samples. i.e. >> len(data[key]) == n_samples Please note that this is the opposite convention to scikit-learn feature
matrixes (where the first index corresponds to sample). ItemSelector only requires that the collection implement getitem
(data[key]). Examples include: a dict of lists, 2D numpy array, Pandas
DataFrame, numpy record array, etc. >> data = {'a': [1, 5, 2, 5, 2, 8],
'b': [9, 4, 1, 4, 1, 3]}
>> ds = ItemSelector(key='a')
>> data['a'] == ds.transform(data) ItemSelector is not designed to handle data grouped by sample. (e.g. a
list of dicts). If your data is structured this way, consider a
transformer along the lines of `sklearn.feature_extraction.DictVectorizer`. Parameters
----------
key : hashable, required
The key corresponding to the desired value in a mappable.
"""
def __init__(self, key):
self.key = key def fit(self, x, y=None):
return self def transform(self, data_dict):
return data_dict[self.key] class TextStats(BaseEstimator, TransformerMixin):
"""Extract features from each document for DictVectorizer""" def fit(self, x, y=None):
return self def transform(self, posts):
return [{'length': len(text),
'num_sentences': text.count('.')}
for text in posts] class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
"""Extract the subject & body from a usenet post in a single pass. Takes a sequence of strings and produces a dict of sequences. Keys are
`subject` and `body`.
"""
def fit(self, x, y=None):
return self def transform(self, posts):
features = np.recarray(shape=(len(posts),),
dtype=[('subject', object), ('body', object)])
for i, text in enumerate(posts):
headers, _, bod = text.partition('\n\n')
bod = strip_newsgroup_footer(bod)
bod = strip_newsgroup_quoting(bod)
features['body'][i] = bod prefix = 'Subject:'
sub = ''
for line in headers.split('\n'):
if line.startswith(prefix):
sub = line[len(prefix):]
break
features['subject'][i] = sub return features pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()), # Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[ # Pipeline for pulling features from the post's subject line
('subject', Pipeline([
('selector', ItemSelector(key='subject')),
('tfidf', TfidfVectorizer(min_df=50)),
])), # Pipeline for standard bag-of-words model for body
('body_bow', Pipeline([
('selector', ItemSelector(key='body')),
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
])), # Pipeline for pulling ad hoc features from post's body
('body_stats', Pipeline([
('selector', ItemSelector(key='body')),
('stats', TextStats()), # returns a list of dicts
('vect', DictVectorizer()), # list of dicts -> feature matrix
])), ], # weight components in FeatureUnion
transformer_weights={
'subject': 0.8,
'body_bow': 0.5,
'body_stats': 1.0,
},
)), # Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
]) # limit the list of categories to make running this example faster.
categories = ['alt.atheism', 'talk.religion.misc']
train = fetch_20newsgroups(random_state=1,
subset='train',
categories=categories,
)
test = fetch_20newsgroups(random_state=1,
subset='test',
categories=categories,
) pipeline.fit(train.data, train.target)
y = pipeline.predict(test.data)
print(classification_report(y, test.target))

[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合的更多相关文章

  1. [占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较

    [占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较

  2. scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类 (python代码)

    scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...

  3. Scikit Learn: 在python中机器学习

    转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...

  4. Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一)

    原文:Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一) 拓展压缩包的使用方式详细介绍 1:将拓展包解压:ThinkPHP3.1.2_Extend.zip   --> 将其下的 \ ...

  5. (原创)(三)机器学习笔记之Scikit Learn的线性回归模型初探

    一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价 模型训练好后,度量模型拟合效果的 ...

  6. (原创)(四)机器学习笔记之Scikit Learn的Logistic回归初探

    目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优 一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...

  7. Scikit Learn

    Scikit Learn Scikit-Learn简称sklearn,基于 Python 语言的,简单高效的数据挖掘和数据分析工具,建立在 NumPy,SciPy 和 matplotlib 上.

  8. [占位-未完成]scikit-learn一般实例之十二:用于RBF核的显式特征映射逼近

    It shows how to use RBFSampler and Nystroem to approximate the feature map of an RBF kernel for clas ...

  9. Linear Regression with Scikit Learn

    Before you read  This is a demo or practice about how to use Simple-Linear-Regression in scikit-lear ...

随机推荐

  1. 快速搭建springmvc+spring data jpa工程

    一.前言 这里简单讲述一下如何快速使用springmvc和spring data jpa搭建后台开发工程,并提供了一个简单的demo作为参考. 二.创建maven工程 http://www.cnblo ...

  2. html5 与视频

    1.视频支持格式. 有3种视频格式被浏览器广泛支持:.ogg,.mp4,.webm. Theora+Vorbis=.ogg  (Theora:视频编码器,Vorbis:音频编码器) H.264+$$$ ...

  3. Windows下Visual studio 2013 编译 Audacity

    编译的Audacity版本为2.1.2,由于实在windows下编译,其源代码可以从Github上取得 git clone https://github.com/audacity/audacity. ...

  4. 解决VS2008在win7找不到输入序列号的地方

    1.VS2008在Windows7 打开维护界面看不到可以输序列号的地方. 因为微软把他隐藏了. 2.我们可以借用工具把他显示出来 下载地址:http://www.zlsoft.com/techbbs ...

  5. 【转】为什么我们都理解错了HTTP中GET与POST的区别

    GET和POST是HTTP请求的两种基本方法,要说它们的区别,接触过WEB开发的人都能说出一二. 最直观的区别就是GET把参数包含在URL中,POST通过request body传递参数. 你可能自己 ...

  6. 小程序用户反馈 - HotApp小程序统计仿微信聊天用户反馈组件,开源

    用户反馈是小程序开发必要的一个功能,但是和自己核心业务没关系,主要是产品运营方便收集用户的对产品的反馈.HotApp推出了用户反馈的组件,方便大家直接集成使用 源码下载地址: https://gith ...

  7. React Native环境配置之Windows版本搭建

    接近年底了,回想这一年都做了啥,学习了啥,然后突然发现,这一年买了不少书,看是看了,就没有完整看完的.悲催. 然后,最近项目也不是很紧了,所以抽空学习了H5.自学啃书还是很无趣的,虽然Head Fir ...

  8. C语言可以开发哪些项目?

    C语言是我们大多数人的编程入门语言,对其也再熟悉不过了,不过很多初学者在学习的过程中难免会出现迷茫,比如:不知道C语言可以开发哪些项目,可以应用在哪些实际的开发中--,这些迷茫也导致了我们在学习的过程 ...

  9. 【转】组件化的Web王国

    本文由 埃姆杰 翻译.未经许可,禁止转载!英文出处:Future Insights. 内容提要 使用许多独立组件构建应用程序的想法并不新鲜.Web Component的出现,是重新回顾基于组件的应用程 ...

  10. (转载)linux下各个文件夹的作用

    linux下的文件结构,看看每个文件夹都是干吗用的/bin 二进制可执行命令 /dev 设备特殊文件 /etc 系统管理和配置文件 /etc/rc.d 启动的配置文件和脚本 /home 用户主目录的基 ...