[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合

Datasets can often contain components of that require different feature extraction and processing pipelines. This scenario might occur when:

  • 1.Your dataset consists of heterogeneous data types (e.g. raster images and text captions)
  • 2.Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines.

This example demonstrates how to use sklearn.feature_extraction.FeatureUnion on a dataset containing different types of features. We use the 20-newsgroups dataset and compute standard bag-of-words features for the subject line and body in separate pipelines as well as ad hoc features on the body. We combine them (with weights) using a FeatureUnion and finally train a classifier on the combined set of features.

The choice of features is not particularly helpful, but serves to illustrate the technique.

# Author: Matt Terry <matt.terry@gmail.com>
#
# License: BSD 3 clause
from __future__ import print_function import numpy as np from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_footer
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_quoting
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC class ItemSelector(BaseEstimator, TransformerMixin):
"""For data grouped by feature, select subset of data at a provided key. The data is expected to be stored in a 2D data structure, where the first
index is over features and the second is over samples. i.e. >> len(data[key]) == n_samples Please note that this is the opposite convention to scikit-learn feature
matrixes (where the first index corresponds to sample). ItemSelector only requires that the collection implement getitem
(data[key]). Examples include: a dict of lists, 2D numpy array, Pandas
DataFrame, numpy record array, etc. >> data = {'a': [1, 5, 2, 5, 2, 8],
'b': [9, 4, 1, 4, 1, 3]}
>> ds = ItemSelector(key='a')
>> data['a'] == ds.transform(data) ItemSelector is not designed to handle data grouped by sample. (e.g. a
list of dicts). If your data is structured this way, consider a
transformer along the lines of `sklearn.feature_extraction.DictVectorizer`. Parameters
----------
key : hashable, required
The key corresponding to the desired value in a mappable.
"""
def __init__(self, key):
self.key = key def fit(self, x, y=None):
return self def transform(self, data_dict):
return data_dict[self.key] class TextStats(BaseEstimator, TransformerMixin):
"""Extract features from each document for DictVectorizer""" def fit(self, x, y=None):
return self def transform(self, posts):
return [{'length': len(text),
'num_sentences': text.count('.')}
for text in posts] class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
"""Extract the subject & body from a usenet post in a single pass. Takes a sequence of strings and produces a dict of sequences. Keys are
`subject` and `body`.
"""
def fit(self, x, y=None):
return self def transform(self, posts):
features = np.recarray(shape=(len(posts),),
dtype=[('subject', object), ('body', object)])
for i, text in enumerate(posts):
headers, _, bod = text.partition('\n\n')
bod = strip_newsgroup_footer(bod)
bod = strip_newsgroup_quoting(bod)
features['body'][i] = bod prefix = 'Subject:'
sub = ''
for line in headers.split('\n'):
if line.startswith(prefix):
sub = line[len(prefix):]
break
features['subject'][i] = sub return features pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()), # Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[ # Pipeline for pulling features from the post's subject line
('subject', Pipeline([
('selector', ItemSelector(key='subject')),
('tfidf', TfidfVectorizer(min_df=50)),
])), # Pipeline for standard bag-of-words model for body
('body_bow', Pipeline([
('selector', ItemSelector(key='body')),
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
])), # Pipeline for pulling ad hoc features from post's body
('body_stats', Pipeline([
('selector', ItemSelector(key='body')),
('stats', TextStats()), # returns a list of dicts
('vect', DictVectorizer()), # list of dicts -> feature matrix
])), ], # weight components in FeatureUnion
transformer_weights={
'subject': 0.8,
'body_bow': 0.5,
'body_stats': 1.0,
},
)), # Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
]) # limit the list of categories to make running this example faster.
categories = ['alt.atheism', 'talk.religion.misc']
train = fetch_20newsgroups(random_state=1,
subset='train',
categories=categories,
)
test = fetch_20newsgroups(random_state=1,
subset='test',
categories=categories,
) pipeline.fit(train.data, train.target)
y = pipeline.predict(test.data)
print(classification_report(y, test.target))

[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合的更多相关文章

  1. [占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较

    [占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较

  2. scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类 (python代码)

    scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...

  3. Scikit Learn: 在python中机器学习

    转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...

  4. Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一)

    原文:Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一) 拓展压缩包的使用方式详细介绍 1:将拓展包解压:ThinkPHP3.1.2_Extend.zip   --> 将其下的 \ ...

  5. (原创)(三)机器学习笔记之Scikit Learn的线性回归模型初探

    一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价 模型训练好后,度量模型拟合效果的 ...

  6. (原创)(四)机器学习笔记之Scikit Learn的Logistic回归初探

    目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优 一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...

  7. Scikit Learn

    Scikit Learn Scikit-Learn简称sklearn,基于 Python 语言的,简单高效的数据挖掘和数据分析工具,建立在 NumPy,SciPy 和 matplotlib 上.

  8. [占位-未完成]scikit-learn一般实例之十二:用于RBF核的显式特征映射逼近

    It shows how to use RBFSampler and Nystroem to approximate the feature map of an RBF kernel for clas ...

  9. Linear Regression with Scikit Learn

    Before you read  This is a demo or practice about how to use Simple-Linear-Regression in scikit-lear ...

随机推荐

  1. 我为NET狂官方面试题-数据库篇

    求结果:select "1"? 查找包含"objs"的表?查找包含"o"的数据库? 求今天距离2002年有多少年,多少天? 请用一句SQL获 ...

  2. 移动硬盘不能识别的常见7种解决方案 ~ By 逆天经验

    服务器汇总:http://www.cnblogs.com/dunitian/p/4822808.html#iis 服务器异常: http://www.cnblogs.com/dunitian/p/45 ...

  3. 解决PHP-问题:Class 'SimpleXMLElement' not found in

    1.问题 在ubuntu 16.10中,学习PHP,学习到PHP如何生成XML文件时候,碰到了这个问题: PHP Fatal error: Class 'ClassName\SimpleXMLElem ...

  4. 【Java每日一题】20170105

    20170104问题解析请点击今日问题下方的"[Java每日一题]20170105"查看(问题解析在公众号首发,公众号ID:weknow619) package Jan2017; ...

  5. JAVA FreeMarker工具类

    FreeMarkerUtil.java package pers.kangxu.datautils.utils; import java.io.File; import java.io.StringW ...

  6. java常用的设计模式

    设计模式:一个程序员对设计模式的理解:"不懂"为什么要把很简单的东西搞得那么复杂.后来随着软件开发经验的增加才开始明白我所看到的"复杂"恰恰就是设计模式的精髓所 ...

  7. ASP.NET MVC 5 系列 学习笔记 目录 (持续更新...)

    前言: 记得当初培训的时候,学习的还是ASP.NET,现在回想一下,图片水印.统计人数.过滤器....HttpHandler是多么的经典! 不过后来接触到了MVC,便立马爱上了它.Model-View ...

  8. SQL Server 批量删除存储过程

    原理很简单的'drop proc xxx'即可,下面有提供了两种方式来删除存储过程,其实本质是相同的,方法一是生成删除的sql后直接执行了,方法二会生成SQL,但需要检查后执行,个人推荐第二种做法. ...

  9. vim+vundle配置

    Linux环境下写代码虽然没有IDE,但通过给vim配置几个插件也足够好用.一般常用的插件主要包括几类,查找文件,查找符号的定义或者声明(函数,变量等)以及自动补全功能.一般流程都是下载需要的工具,然 ...

  10. v14.0\AspNet\Microsoft.Web.AspNet.Props 找不到

    错误 E:\Github\AutoMapper\src\AutoMapper\AutoMapper.CoreCLR.kproj : error  : 未找到导入的项目"C:\Program ...