[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合

Datasets can often contain components of that require different feature extraction and processing pipelines. This scenario might occur when:

1.Your dataset consists of heterogeneous data types (e.g. raster images and text captions)
2.Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines.

This example demonstrates how to use sklearn.feature_extraction.FeatureUnion on a dataset containing different types of features. We use the 20-newsgroups dataset and compute standard bag-of-words features for the subject line and body in separate pipelines as well as ad hoc features on the body. We combine them (with weights) using a FeatureUnion and finally train a classifier on the combined set of features.

The choice of features is not particularly helpful, but serves to illustrate the technique.

# Author: Matt Terry <matt.terry@gmail.com>

#

# License: BSD 3 clause

from __future__ import print_function

import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.datasets import fetch_20newsgroups

from sklearn.datasets.twenty_newsgroups import strip_newsgroup_footer

from sklearn.datasets.twenty_newsgroups import strip_newsgroup_quoting

from sklearn.decomposition import TruncatedSVD

from sklearn.feature_extraction import DictVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import classification_report

from sklearn.pipeline import FeatureUnion

from sklearn.pipeline import Pipeline

from sklearn.svm import SVC

class ItemSelector(BaseEstimator, TransformerMixin):

    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first

    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to scikit-learn feature

    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem

    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas

    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],

               'b': [9, 4, 1, 4, 1, 3]}

    >> ds = ItemSelector(key='a')

    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a

    list of dicts).  If your data is structured this way, consider a

    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters

    ----------

    key : hashable, required

        The key corresponding to the desired value in a mappable.

    """

    def __init__(self, key):

        self.key = key

    def fit(self, x, y=None):

        return self

    def transform(self, data_dict):

        return data_dict[self.key]

class TextStats(BaseEstimator, TransformerMixin):

    """Extract features from each document for DictVectorizer"""

    def fit(self, x, y=None):

        return self

    def transform(self, posts):

        return [{'length': len(text),

                 'num_sentences': text.count('.')}

                for text in posts]

class SubjectBodyExtractor(BaseEstimator, TransformerMixin):

    """Extract the subject & body from a usenet post in a single pass.

    Takes a sequence of strings and produces a dict of sequences.  Keys are

    `subject` and `body`.

    """

    def fit(self, x, y=None):

        return self

    def transform(self, posts):

        features = np.recarray(shape=(len(posts),),

                               dtype=[('subject', object), ('body', object)])

        for i, text in enumerate(posts):

            headers, _, bod = text.partition('\n\n')

            bod = strip_newsgroup_footer(bod)

            bod = strip_newsgroup_quoting(bod)

            features['body'][i] = bod

            prefix = 'Subject:'

            sub = ''

            for line in headers.split('\n'):

                if line.startswith(prefix):

                    sub = line[len(prefix):]

                    break

            features['subject'][i] = sub

        return features

pipeline = Pipeline([

    # Extract the subject & body

    ('subjectbody', SubjectBodyExtractor()),

    # Use FeatureUnion to combine the features from subject and body

    ('union', FeatureUnion(

        transformer_list=[

            # Pipeline for pulling features from the post's subject line

            ('subject', Pipeline([

                ('selector', ItemSelector(key='subject')),

                ('tfidf', TfidfVectorizer(min_df=50)),

            ])),

            # Pipeline for standard bag-of-words model for body

            ('body_bow', Pipeline([

                ('selector', ItemSelector(key='body')),

                ('tfidf', TfidfVectorizer()),

                ('best', TruncatedSVD(n_components=50)),

            ])),

            # Pipeline for pulling ad hoc features from post's body

            ('body_stats', Pipeline([

                ('selector', ItemSelector(key='body')),

                ('stats', TextStats()),  # returns a list of dicts

                ('vect', DictVectorizer()),  # list of dicts -> feature matrix

            ])),

        ],

        # weight components in FeatureUnion

        transformer_weights={

            'subject': 0.8,

            'body_bow': 0.5,

            'body_stats': 1.0,

        },

    )),

    # Use a SVC classifier on the combined features

    ('svc', SVC(kernel='linear')),

])

# limit the list of categories to make running this example faster.

categories = ['alt.atheism', 'talk.religion.misc']

train = fetch_20newsgroups(random_state=1,

                           subset='train',

                           categories=categories,

                           )

test = fetch_20newsgroups(random_state=1,

                          subset='test',

                          categories=categories,

                          )

pipeline.fit(train.data, train.target)

y = pipeline.predict(test.data)

print(classification_report(y, test.target))

[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合的更多相关文章

[占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较
[占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较
scikit learn 模块调参 pipeline+girdsearch 数据举例：文档分类（python代码）
scikit learn 模块调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...
Scikit Learn: 在python中机器学习
转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...
Thinkphp框架拓展包使用方式详细介绍--验证码实例（十一）
原文:Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一) 拓展压缩包的使用方式详细介绍 1:将拓展包解压:ThinkPHP3.1.2_Extend.zip --> 将其下的 \ ...
(原创)（三）机器学习笔记之Scikit Learn的线性回归模型初探
一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价模型训练好后,度量模型拟合效果的 ...
(原创)（四）机器学习笔记之Scikit Learn的Logistic回归初探
目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...
Scikit Learn
Scikit Learn Scikit-Learn简称sklearn,基于 Python 语言的,简单高效的数据挖掘和数据分析工具,建立在 NumPy,SciPy 和 matplotlib 上.
[占位-未完成]scikit-learn一般实例之十二:用于RBF核的显式特征映射逼近
It shows how to use RBFSampler and Nystroem to approximate the feature map of an RBF kernel for clas ...
Linear Regression with Scikit Learn
Before you read This is a demo or practice about how to use Simple-Linear-Regression in scikit-lear ...

随机推荐

JavaScript Array对象
介绍Js的Array 数组对象. 目录 1. 介绍:介绍 Array 数组对象的说明.定义方式以及属性. 2. 实例方法:介绍 Array 对象的实例方法:concat.every.filter.fo ...
前端学HTTP之安全HTTP
前面的话 HTTP的主要不足包括通信使用明文(不加密),内容可能会被窃听:不验证通信方的身份,有可能遭遇伪装:无法证明报文的完整性,有可能被篡改基本认证和摘要认证能够使得用户识别后较安全的访问服务器 ...
基于Oracle安装Zabbix
软件版本 Oracle Enterprise Linux 7.1 64bit Oracle Enterprise Edition 12.1.0.2 64bit Zabbix 3.2.1 准备工作上传 ...
步入angularjs directive（指令）--准备工作熟悉hasOwnProperty
在讲解directive之前,先做一下准备工作,为何要这样呢? 因为我们不是简单的说说directive怎么用,还要知道为什么这么用!(今天我们先磨磨刀!). 首先我们讲讲js 基础的知识--hasO ...
【NLP】Python NLTK处理原始文本
Python NLTK 处理原始文本作者:白宁超 2016年11月8日22:45:44 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集的大量公开 ...
VS15 preview 5打开文件夹自动生成slnx.VC.db SQLite库疑惑？求解答
用VS15 preview 5打开文件夹(详情查看博客http://www.cnblogs.com/zsy/p/5962242.html中配置),文件夹下多一个slnx.VC.db文件,如下图: 本文 ...
winform 窗体圆角设计
网上看到的很多winform窗体圆角设计代码都比较累赘,这里分享一个少量代码就可以实现的圆角.主要运用了System.Drawing.Drawing2D. 效果图代码如下. private void ...
多线程条件通行工具——CountDownLatch
CountDownLatch的作用是,线程进入等待后,需要计数器达到0才能通行. CountDownLatch(int)构造方法,指定初始计数. await()等待计数减至0. await(long, ...
vim环境变量配置、背景色配置
我们使用vi或者vim的时候,如果想要显示行号,可能会这样做:切换到命令模式,然后输入set nu,再按回车键就显示了:还有就是咱们在编写程序的时候,有的时候会希望按下回车键后,光标不是每次都在行首, ...
Configure a bridge interface over a VLAN tagged bonded interface
SOLUTION VERIFIED February 5 2014 KB340153 Environment Red Hat Enterprise Linux 6 (All Versions) Red ...

[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合

[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合的更多相关文章

随机推荐

热门专题