[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合

Datasets can often contain components of that require different feature extraction and processing pipelines. This scenario might occur when:

  • 1.Your dataset consists of heterogeneous data types (e.g. raster images and text captions)
  • 2.Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines.

This example demonstrates how to use sklearn.feature_extraction.FeatureUnion on a dataset containing different types of features. We use the 20-newsgroups dataset and compute standard bag-of-words features for the subject line and body in separate pipelines as well as ad hoc features on the body. We combine them (with weights) using a FeatureUnion and finally train a classifier on the combined set of features.

The choice of features is not particularly helpful, but serves to illustrate the technique.

# Author: Matt Terry <matt.terry@gmail.com>
#
# License: BSD 3 clause
from __future__ import print_function import numpy as np from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_footer
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_quoting
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC class ItemSelector(BaseEstimator, TransformerMixin):
"""For data grouped by feature, select subset of data at a provided key. The data is expected to be stored in a 2D data structure, where the first
index is over features and the second is over samples. i.e. >> len(data[key]) == n_samples Please note that this is the opposite convention to scikit-learn feature
matrixes (where the first index corresponds to sample). ItemSelector only requires that the collection implement getitem
(data[key]). Examples include: a dict of lists, 2D numpy array, Pandas
DataFrame, numpy record array, etc. >> data = {'a': [1, 5, 2, 5, 2, 8],
'b': [9, 4, 1, 4, 1, 3]}
>> ds = ItemSelector(key='a')
>> data['a'] == ds.transform(data) ItemSelector is not designed to handle data grouped by sample. (e.g. a
list of dicts). If your data is structured this way, consider a
transformer along the lines of `sklearn.feature_extraction.DictVectorizer`. Parameters
----------
key : hashable, required
The key corresponding to the desired value in a mappable.
"""
def __init__(self, key):
self.key = key def fit(self, x, y=None):
return self def transform(self, data_dict):
return data_dict[self.key] class TextStats(BaseEstimator, TransformerMixin):
"""Extract features from each document for DictVectorizer""" def fit(self, x, y=None):
return self def transform(self, posts):
return [{'length': len(text),
'num_sentences': text.count('.')}
for text in posts] class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
"""Extract the subject & body from a usenet post in a single pass. Takes a sequence of strings and produces a dict of sequences. Keys are
`subject` and `body`.
"""
def fit(self, x, y=None):
return self def transform(self, posts):
features = np.recarray(shape=(len(posts),),
dtype=[('subject', object), ('body', object)])
for i, text in enumerate(posts):
headers, _, bod = text.partition('\n\n')
bod = strip_newsgroup_footer(bod)
bod = strip_newsgroup_quoting(bod)
features['body'][i] = bod prefix = 'Subject:'
sub = ''
for line in headers.split('\n'):
if line.startswith(prefix):
sub = line[len(prefix):]
break
features['subject'][i] = sub return features pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()), # Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[ # Pipeline for pulling features from the post's subject line
('subject', Pipeline([
('selector', ItemSelector(key='subject')),
('tfidf', TfidfVectorizer(min_df=50)),
])), # Pipeline for standard bag-of-words model for body
('body_bow', Pipeline([
('selector', ItemSelector(key='body')),
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
])), # Pipeline for pulling ad hoc features from post's body
('body_stats', Pipeline([
('selector', ItemSelector(key='body')),
('stats', TextStats()), # returns a list of dicts
('vect', DictVectorizer()), # list of dicts -> feature matrix
])), ], # weight components in FeatureUnion
transformer_weights={
'subject': 0.8,
'body_bow': 0.5,
'body_stats': 1.0,
},
)), # Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
]) # limit the list of categories to make running this example faster.
categories = ['alt.atheism', 'talk.religion.misc']
train = fetch_20newsgroups(random_state=1,
subset='train',
categories=categories,
)
test = fetch_20newsgroups(random_state=1,
subset='test',
categories=categories,
) pipeline.fit(train.data, train.target)
y = pipeline.predict(test.data)
print(classification_report(y, test.target))

[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合的更多相关文章

  1. [占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较

    [占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较

  2. scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类 (python代码)

    scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...

  3. Scikit Learn: 在python中机器学习

    转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...

  4. Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一)

    原文:Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一) 拓展压缩包的使用方式详细介绍 1:将拓展包解压:ThinkPHP3.1.2_Extend.zip   --> 将其下的 \ ...

  5. (原创)(三)机器学习笔记之Scikit Learn的线性回归模型初探

    一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价 模型训练好后,度量模型拟合效果的 ...

  6. (原创)(四)机器学习笔记之Scikit Learn的Logistic回归初探

    目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优 一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...

  7. Scikit Learn

    Scikit Learn Scikit-Learn简称sklearn,基于 Python 语言的,简单高效的数据挖掘和数据分析工具,建立在 NumPy,SciPy 和 matplotlib 上.

  8. [占位-未完成]scikit-learn一般实例之十二:用于RBF核的显式特征映射逼近

    It shows how to use RBFSampler and Nystroem to approximate the feature map of an RBF kernel for clas ...

  9. Linear Regression with Scikit Learn

    Before you read  This is a demo or practice about how to use Simple-Linear-Regression in scikit-lear ...

随机推荐

  1. JavaScript自定义媒体播放器

    使用<audio>和<video>元素的play()和pause()方法,可以手工控制媒体文件的播放.组合使用属性.事件和这两个方法,很容易创建一个自定义的媒体播放器,如下面的 ...

  2. Redis链表实现

    链表在 Redis 中的应用非常广泛, 比如列表键的底层实现之一就是链表: 当一个列表键包含了数量比较多的元素, 又或者列表中包含的元素都是比较长的字符串时, Redis 就会使用链表作为列表键的底层 ...

  3. JQuery 选择器

    选择器是JQuery的根基,在JQuery中,对事件的处理,遍历DOM和AJAX操作都依赖于选择器.如果能够熟练地使用选择器,不仅能简化代码,而且还可以事半功倍. JQuery选择器的优势 1.简洁的 ...

  4. spring remoting源码分析--Hessian分析

    1. Caucho 1.1 概况 spring-remoting代码的情况如下: 本节近分析caucho模块. 1.2 分类 其中以hession为例,Hessian远程服务调用过程: Hessian ...

  5. 普通程序员如何转向AI方向

    眼下,人工智能已经成为越来越火的一个方向.普通程序员,如何转向人工智能方向,是知乎上的一个问题.本文是我对此问题的一个回答的归档版.相比原回答有所内容增加. 一. 目的 本文的目的是给出一个简单的,平 ...

  6. 破解SQLServer for Linux预览版的3.5GB内存限制 (RHEL篇)

    微软发布了SQLServer for Linux,但是安装竟然需要3.5GB内存,这让大部分云主机用户都没办法尝试这个新东西 这篇我将讲解如何破解这个内存限制 要看关键的可以直接跳到第6步,只需要替换 ...

  7. Lind.DDD.LindMQ的一些想法

    回到目录 很久就想写一套属于自己的消息队列组件,前段时候看了汤雪华同学的EQueue,感觉还是不错的,他也是看了rabbitMQ之后写的Equeue,在设计上与前者有类似的地方,而大叔这次准备写一个L ...

  8. LeetCode - Two Sum

    Two Sum 題目連結 官網題目說明: 解法: 從給定的一組值內找出第一組兩數相加剛好等於給定的目標值,暴力解很簡單(只會這樣= =),兩個迴圈,只要找到相加的值就跳出. /// <summa ...

  9. 《月之猎人 (Moon Hunters)》主角设计

    原文链接 游戏开发人员,你们好! 我是 Kitfox Games 工作室的总监 Tanya,我们的工作室位于加拿大的蒙特利尔,拥有六名员工. 我们 3 月份发布了<月之猎人>游戏的桌面版, ...

  10. hadoop 2.4 遇到的问题

    不管出什么问题,首先查看日志. 在启动过hadoop的前提下,打开浏览器,输入http://localhost:50070 点击Utilities下的logs,选择hadoop-root-datano ...