scikit-learn:4.2. Feature extraction(特征提取,不是特征选择)
http://scikit-learn.org/stable/modules/feature_extraction.html
带病在网吧里。
。。。。。
写。求支持。
。。
1、首先澄清两个概念:特征提取和特征选择(
Feature extraction is very different from Feature
selection
)。
the former consists in transforming arbitrary data, such as text or images, into numerical
features usable for machine learning. The latter is a machine learning technique applied on these features(从已经提取的特征中选择更好的特征).
以下分为四大部分来讲。主要还是4、text feature extraction
2、loading features form dicts
class DictVectorizer。举个样例就好:
>>> measurements = [
... {'city': 'Dubai', 'temperature': 33.},
... {'city': 'London', 'temperature': 12.},
... {'city': 'San Fransisco', 'temperature': 18.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
class DictVectorizer对于提取某个特定词汇附近的feature
windows很实用,比如增加我们通过一个已有的algorithm提取了word ‘sat’ 在句子‘The cat sat on the mat.’中的PoS(Part
of Speech)特征。例如以下:
>>> pos_window = [
... {
... 'word-2': 'the',
... 'pos-2': 'DT',
... 'word-1': 'cat',
... 'pos-1': 'NN',
... 'word+1': 'on',
... 'pos+1': 'PP',
... },
... # in a real application one would extract many such dictionaries
... ]
上面的PoS特征就能够vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a text.TfidfTransformer for
normalization):
>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized
<1x6 sparse matrix of type '<... 'numpy.float64'>'
with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[ 1., 1., 1., 1., 1., 1.]])
>>> vec.get_feature_names()
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']
3、feature hashing
The class FeatureHasher is
a high-speed, low-memory vectorizer that uses a technique known as feature
hashing, or the “hashing trick”.
因为hash。所以仅仅保存feature的interger index。而不保存原来feature的string名字。所以没有inverse_transform方法。
FeatureHasher 接收dict对,即 (feature, value) 对,或者strings,由构造函数的參数input_type决定.结果是scipy.sparse matrix。假设是strings,则value默认取1,比如 ['feat1', 'feat2', 'feat2'] 被解释为[('feat1', 1), ('feat2', 2)].
4、text feature extraction
由于内容太多,分开写了。參考着篇博客:http://blog.csdn.net/mmc2015/article/details/46997379
5、image feature extraction
提取部分图片(Patch extraction):
The extract_patches_2d function从图片中提取小块,存储成two-dimensional
array, or three-dimensional with color information along the third axis. 使用reconstruct_from_patches_2d.
可以将全部的小块重构成原图:
>>> import numpy as np
>>> from sklearn.feature_extraction import image >>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
>>> one_image[:, :, 0] # R channel of a fake RGB picture
array([[ 0, 3, 6, 9],
[12, 15, 18, 21],
[24, 27, 30, 33],
[36, 39, 42, 45]]) >>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
... random_state=0)
>>> patches.shape
(2, 2, 2, 3)
>>> patches[:, :, :, 0]
array([[[ 0, 3],
[12, 15]], [[15, 18],
[27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
[27, 30]])
重构方式例如以下:
>>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
>>> np.testing.assert_array_equal(one_image, reconstructed)
The PatchExtractor class和 extract_patches_2d,一样,仅仅只是能够同一时候接受多个图片作为输入:
>>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
>>> patches = image.PatchExtractor((2, 2)).transform(five_images)
>>> patches.shape
(45, 2, 2, 3)
图片像素的连接(Connectivity graph of an image):
主要是依据像素的区别来推断图片的每两个像素点是否连接。
。。
。
。
The function img_to_graph returns
such a matrix from a 2D or 3D image. Similarly, grid_to_graph build
a connectivity matrix for images given the shape of these image.
头疼。。。。
碎觉。
。。
scikit-learn:4.2. Feature extraction(特征提取,不是特征选择)的更多相关文章
- Feature extraction - sklearn文本特征提取
http://blog.csdn.net/pipisorry/article/details/41957763 文本特征提取 词袋(Bag of Words)表征 文本分析是机器学习算法的主要应用领域 ...
- ufldl学习笔记和编程作业:Feature Extraction Using Convolution,Pooling(卷积和汇集特征提取)
ufldl学习笔记与编程作业:Feature Extraction Using Convolution,Pooling(卷积和池化抽取特征) ufldl出了新教程,感觉比之前的好,从基础讲起.系统清晰 ...
- (原创)(三)机器学习笔记之Scikit Learn的线性回归模型初探
一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价 模型训练好后,度量模型拟合效果的 ...
- (原创)(四)机器学习笔记之Scikit Learn的Logistic回归初探
目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优 一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...
- Scikit Learn: 在python中机器学习
转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...
- A Survey of Shape Feature Extraction Techniques中文翻译
Yang, Mingqiang, Kidiyo Kpalma, and Joseph Ronsin. "A survey of shape feature extraction techni ...
- 《Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks》论文笔记
论文题目<Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Ne ...
- Software: MPEG-7 Feature Extraction Library
Software MPEG-7 Feature Extraction Library : This library is adapted from MPEG-7 XM Reference Softwa ...
- Feature Engineering versus Feature Extraction: Game On!
Feature Engineering versus Feature Extraction: Game On! "Feature engineering" is a fancy t ...
随机推荐
- offsetWidth clientWidth scrollWidth 三者之间的区别和联系
scrollWidth:对象的实际内容的宽度,不包边线宽度,会随对象中内容超过可视区后而变大. clientWidth:对象内容的可视区的宽度,不包滚动条等边线,会随对象显示大小的变化而改变. off ...
- 用Vue创建一个新的项目
vue的安装 Vue.js不支持IE8及以下版本.因为Vue.js使用了ECMAScript5特性,IE8显然不能模拟.Vue.js支持所有兼容ECMAScript5的浏览器. 在用Vue.js构建大 ...
- Linux 虚拟内存和物理内存的理解【转】
转自:http://www.cnblogs.com/dyllove98/archive/2013/06/12/3132940.html 首先,让我们看下虚拟内存: 第一层理解 1. 每 ...
- CodeForces 424D: ...(二分)
题意:给出一个n*m的矩阵,内有一些数字.当你从一个方格走到另一个方格时,按这两个方格数字的大小,有(升,平,降)三种费用.你需要在矩阵中找到边长大于2的一个矩形,使得按这个矩形顺时针行走一圈的费用, ...
- HDU 4405: Aeroplane chess
类型:概率DP 题意:一条直线下飞行棋,色子六个面等概率.同时存在一些飞机航线,到了某个点可以直接飞到后面的另一个点,可以连飞,保证一个点至多一条航线.求到达或者超过终点 所需要 掷色子的期望次数. ...
- WCF的学习之旅
一.WCF的简单介绍 Windows Communication Foundation(WCF)是由微软发展的一组数据通信的应用程序开发接口,可以翻译为Windows通讯接口,它是MS为SOA (S ...
- cookies-判断用户是否是第一次进入页面
在郑州美莱的活动项目中客户当时有提到该需求.尽管最后去掉了该需求,但是还是花了我不少时间研究,因为之前的项目我前端没有用到过cookies,吓死宝宝了 $(document).ready(functi ...
- Jmeter(四十九)_常用的性能测试监听器
概述 jmeter中提供了很多性能数据的监听器,我们通过监听器可以来分析性能瓶颈 本文以500线程的阶梯加压测试结果来描述图表. 常用监听器 1:Transactions per Second 监听动 ...
- 2017-11-07-noip模拟题
T1 数学老师的报复 矩阵快速幂模板,类似于菲波那切数列的矩阵 [1,1]*[A,1 B,0] #include <cstdio> #define LL long long inline ...
- CF623
AIM Tech Round (Div. 1) <br > 这真是一套极好的题目啊.....虽然我不会做 <br > 代码戳这里 <br > A.Graph and ...