scikit-learn：4.2.3. Text feature extraction

http://scikit-learn.org/stable/modules/feature_extraction.html

4.2节内容太多，因此将文本特征提取单独作为一块。

1、the bag of words representation

将raw data表示成长度固定的数字特征向量，scikit-learn提供了三个方式：

tokenizing：给每个token（字、词。粒度自己把握）一个整数索引id

counting：每一个token在每一个文档中出现的次数

normalizing：依据每一个token在样本/文档中出现的次数规范化/权重化 token的重要性。

又一次理解什么是feature、什么事sample：

each individual token occurrence frequency (normalized or not) is treated as a feature.
the vector of all the token frequencies for a given document is considered a multivariate sample.

Bag of Words or “Bag of n-grams” representation：

general
process (tokenization, counting and normalization) of turning a collection of text documents into numerical feature
vectors，while completelyignoring the relative position information of the words in the document.

2、sparsity

每一个文档中的词。仅仅是整个语料库中全部词，的非常小的一部分，这样造成feature
vector的稀疏性（非常多值为0）。为了解决存储和运算速度的问题。使用python的scipy.sparse包。

3、common
vectorizer usage

CountVectorizer同一时候实现tokenizing和counting。

參数非常多，但默认的就非常合理了，适合大多数情况，详细參考：http://blog.csdn.net/mmc2015/article/details/46866537

>>> vectorizer = CountVectorizer(min_df=1)

>>> vectorizer

CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',

        dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',

        lowercase=True, max_df=1.0, max_features=None, min_df=1,

        ngram_range=(1, 1), preprocessor=None, stop_words=None,

        strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',

        tokenizer=None, vocabulary=None)

这边的样例说明了它的使用：

http://blog.csdn.net/mmc2015/article/details/46857887

包含fit_transform、transform、get_feature_names()、ngram_range=(min,max)、vocabulary_.get()等。。。。

4、tf-idf
term weighting

解决(e.g.
“the”, “a”, “is” in English) 某些词出现次数太多，却又不是我们关注的词的问题。

the text.TfidfTransformer class实现了mormalization：

>>> from sklearn.feature_extraction.text import TfidfTransformer

>>> transformer = TfidfTransformer()

>> counts = [[3, 0, 1],

...           [2, 0, 0],

...           [3, 0, 0],

...           [4, 0, 0],

...           [3, 2, 0],

...           [3, 0, 2]]

...

>>> tfidf = transformer.fit_transform(counts)

>>> tfidf

<6x3 sparse matrix of type '<... 'numpy.float64'>'

    with 9 stored elements in Compressed Sparse ... format>

>>> tfidf.toarray()

array([[ 0.85...,  0.  ...,  0.52...],

       [ 1.  ...,  0.  ...,  0.  ...],

       [ 1.  ...,  0.  ...,  0.  ...],

       [ 1.  ...,  0.  ...,  0.  ...],

       [ 0.55...,  0.83...,  0.  ...],

       [ 0.63...,  0.  ...,  0.77...]])

>>> transformer.idf_  #idf_保存fit之后的结果

array([ 1. ...,  2.25...,  1.84...])

another class called TfidfVectorizer that
combines all the options of CountVectorizer andTfidfTransformer in
a single model:

假设对于binary
occurrence的feature，使用CountVectorizer的參数设置为binary更好。

。。bernoulli Naive Bayes也更适合做estimator。

5、Decoding
text files

text是由character组成，但file则由bytes组成，所以要让scikit-learn工作，首先要告诉他file的编码，那么 CountVectorizer就会自己主动解码了。默认的编码方式是UTF-8。解码后的character set称为Unicode。假设你载入的file编码方式不是UTF-8，有没有设置encoding參数，则会出现UnicodeDecodeError。

假设编码错误，try：

Find out what the actual encoding of the text is. The file might come with a header or README that tells you the encoding, or there might be some standard encoding you can assume based on where the text comes from.
You may be able to find out what kind of encoding it is in general using the UNIX command file.
The Python chardet module comes with a script called chardetect.py that
will guess the specific encoding, though you cannot rely on its guess being correct.
You could try UTF-8 and disregard the errors. You can decode byte strings with bytes.decode(errors='replace') to
replace all decoding errors with a meaningless character, or set decode_error='replace' in the vectorizer.
This may damage the usefulness of your features.
Real text may come from a variety of sources that may have used different encodings, or even be sloppily decoded in a different encoding than the one it was encoded with. This is common in text retrieved from the Web. The Python
package ftfy can automatically sort out some classes of decoding
errors, so you could try decoding the unknown text as latin-1 and then using ftfy to
fix errors.
If the text is in a mish-mash of encodings that is simply too hard to sort out (which is the case for the 20 Newsgroups dataset), you can fall back on a simple single-byte encoding such as latin-1.
Some text may display incorrectly, but at least the same sequence of bytes will always represent the same feature.

For example, the following snippet uses chardet (not shipped with scikit-learn, must be installed separately)
to figure out the encoding of three texts. It then vectorizes the texts and prints the learned vocabulary. The output is not shown here.

>>>

>>> import chardet

>>> text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"

>>> text2 = b"holdselig sind deine Ger\xfcche"

>>> text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"

>>> decoded = [x.decode(chardet.detect(x)['encoding'])

...            for x in (text1, text2, text3)]

>>> v = CountVectorizer().fit(decoded).vocabulary_

>>> for term in v: print(v)

(Depending on the version of chardet, it might get the first one wrong.)

6、应用和实例

推荐看一下第三个样例。

In particular in a supervised setting it can be successfully combined with fast and scalable linear models to train document classifiers, for instance:

Classification
of text documents using sparse features

In an unsupervised setting it can be used to group similar documents together by applying clustering algorithms such as K-means:

Clustering
text documents using k-means

Finally it is possible to discover the main topics of a corpus by relaxing the hard assignment constraint of clustering, for instance by using Non-negative
matrix factorization (NMF or NNMF):

Topics
extraction with Non-Negative Matrix Factorization

7、bag
of words的缺陷

misspelling、word
derivations、word order dependece。

拼写错误（word wprd wrod）、词汇的变形（word words、arrive arriving）、词汇之间的顺序及依赖关系。

使用N-gram而不要单单使用unigram。

另外，还能够使用这里http://blog.csdn.net/mmc2015/article/details/46730289提到的词干分析方法。

给个样例，以char_wb为例了：

>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1)

>>> counts = ngram_vectorizer.fit_transform(['words', 'wprds'])

>>> ngram_vectorizer.get_feature_names() == (

...     [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])

True

>>> counts.toarray().astype(int)

array([[1, 1, 1, 0, 1, 1, 1, 0],

       [1, 1, 0, 1, 1, 1, 0, 1]])

下三部分有时间写。

。

8、Vectorizing a large text corpus with the hashing trick。使用hashing技巧vectorizing大语料库

使用上面提到的vectorization方法尽管简单，但该方法是基于in-
memory mapping from the string tokens to the integer feature indices (the vocabulary_ attribute)。这导致处理大数据集时会出现非常多问题：memory
use、access slow。。

。

通过结合sklearn.feature_extraction.FeatureHasher class的hashing
trick和CountVectorizer能够解决这些问题。

hash和countVectorizer结合的产物是 HashingVectorizer,。

HashingVectorizer is
stateless, meaning that you don’t have to call fit on
it（直接使用transform就可以）:

>>>

>>> from sklearn.feature_extraction.text import HashingVectorizer

>>> hv = HashingVectorizer(n_features=10)

>>> hv.transform(corpus)

...

<4x10 sparse matrix of type '<... 'numpy.float64'>'

    with 16 stored elements in Compressed Sparse ... format>

默认的n_features是2**20（one million features）。假设内存有问题，能够略微小一点，比方2**18，而不会造成太多的冲突。。

HashingVectorizer,有两个缺点一定须要注意：

1）不提供IDF加权。由于是stateless。

假设须要的话，能够在pipeline中append一个 TfidfTransformer 。

2）不提供inverse_transform方法，由于hash的单向属性。即，不能訪问原来的string特征，仅仅能訪问特征的整数索引了。。。。

9、Performing out-of-core scaling with HashingVectorizer

HashingVectorizer,也有长处——能够进行out-of-core学习，这对于内存放不下的数据集来说很故意。

策略是，mini-batches fit：Each
mini-batch is vectorized usingHashingVectorizer so
as to guarantee that the input space of the estimator has always the same dimensionality. The amount of memory used at any time is thus bounded by the size of a mini-batch.

这边有个样例能够參考一下：http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#example-applications-plot-out-of-core-classification-py

10、Customizing the vectorizer classes

自己定义vectorizer。主要体如今怎样提取token吧：

>>> def my_tokenizer(s):

...     return s.split()

...

>>> vectorizer = CountVectorizer(tokenizer=my_tokenizer)

>>> vectorizer.build_analyzer()(u"Some... punctuation!") == (

...     ['some...', 'punctuation!'])

True

以下的内容不翻译：

In particular we name:

preprocessor: a callable that takes an entire document as input (as a single
string), and returns a possibly transformed version of the document, still as an entire string. This can be used to remove HTML tags, lowercase the entire document, etc.

tokenizer: a callable that takes the output from the preprocessor and splits
it into tokens, then returns a list of these.

analyzer: a callable that replaces the preprocessor and tokenizer. The default
analyzers all call the preprocessor and tokenizer, but custom analyzers will skip this. N-gram extraction and stop word filtering take place at the analyzer level, so a custom analyzer may have to reproduce these steps.

想要使上面的三者起作用。最好override build_preprocessor, build_tokenizer` and build_analyzer factory
methods，而不是简单地传递过去custom functions。一些小技巧例如以下：

If documents are pre-tokenized by an external package, then store them in files (or strings) with the tokens separated by whitespace and pass analyzer=str.split

Fancy token-level analysis such as stemming, lemmatizing, compound splitting, filtering based on part-of-speech, etc. are not included in the scikit-learn codebase, but can be added by customizing either the tokenizer or the analyzer. Here’s a CountVectorizer with
a tokenizer and lemmatizer using NLTK:

>>>

>>> from nltk import word_tokenize

>>> from nltk.stem import WordNetLemmatizer

>>> class LemmaTokenizer(object):

...     def __init__(self):

...         self.wnl = WordNetLemmatizer()

...     def __call__(self, doc):

...         return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

...

>>> vect = CountVectorizer(tokenizer=LemmaTokenizer())

因为中文不是靠空格切割，所以使用custom vectorizer是很必要的。

。。！！！

文本特征提取完成。。

。

scikit-learn：4.2.3. Text feature extraction的更多相关文章

机器学习---文本特征提取之词袋模型（Machine Learning Text Feature Extraction Bag of Words）
假设有一段文本:"I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good frie ...
文本特征提取---词袋模型，TF-IDF模型，N-gram模型（Text Feature Extraction Bag of Words TF-IDF N-gram ）
假设有一段文本:"I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good frie ...
Feature extraction - sklearn文本特征提取
http://blog.csdn.net/pipisorry/article/details/41957763 文本特征提取词袋(Bag of Words)表征文本分析是机器学习算法的主要应用领域 ...
scikit-learn：4.2. Feature extraction（特征提取，不是特征选择）
http://scikit-learn.org/stable/modules/feature_extraction.html 带病在网吧里. ..... 写.求支持. .. 1.首先澄清两个概念:特征 ...
scikit learn 模块调参 pipeline+girdsearch 数据举例：文档分类（python代码）
scikit learn 模块调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...
(原创)（三）机器学习笔记之Scikit Learn的线性回归模型初探
一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价模型训练好后,度量模型拟合效果的 ...
(原创)（四）机器学习笔记之Scikit Learn的Logistic回归初探
目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...
Scikit Learn: 在python中机器学习
转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...
翻译：打造基于Sublime Text 3的全能python开发环境
原文地址:https://realpython.com/blog/python/setting-up-sublime-text-3-for-full-stack-python-development/ ...

随机推荐

无线遥控检测仪 A890-RES
本产品为无线遥控接收器发射器的生产调试项目开发而设计,能自动识别接收并显示遥控器的所有信息:频率.芯片类型.周期.地址码.数据码,并能自动计算振荡阻值,35组自动保存.315M.433M 双频同时待机 ...
Android的Master/Detail风格界面中实现自定义ListView的单选
原文在这里:http://duduli.iteye.com/blog/1453576 可以实现多选,那么如何实现单选呢,这里我写了一个非常简单的方法: public void onListItemCl ...
一段js的思考
写了很久JS,还以为这段代码可以正常输出,谁知道输出超乎我的形象: <!DOCTYPE html> <html> <head> <meta charset=& ...
使用hadoop实现关联商品统计
转载请注明出处:http://blog.csdn.net/xiaojimanman/article/details/40184581 近期几天一直在看hadoop相关的书籍,眼下略微有点感觉,自己就仿 ...
Meanshift算法
[转载自Liqizhou],原文地址 Mean Shift算法,一般是指一个迭代的步骤,即先算出当前点的偏移均值,移动该点到其偏移均值,然后以此为新的起始点,继续移动,直到满足一定的条件结束. 1. ...
设计原则：消除Switch...Case的过程，可能有点过度设计了。
备注不要重复自己,也不要重复别人,一旦养成了“拷贝和粘贴”的习惯,写程序的时候非常容易导致重复,好在一直暗示自己要稍后进行重构,本文给出一个重构的示例. 需求需求:按照年.月和日显示销售数据,根据 ...
[Clojure] A Room-Escape game, playing with telnet and pure-text commands - Part 1
Code path: https://github.com/bluesilence/Lisp/tree/master/clojure/projects/room-escape As I have be ...
ConnectivityManager详解
常用方法: 1.监听网络连接(Wi-Fi, GPRS, UMTS, etc),当网络发生改变时发送广播(broadcase)进行通知 2.通过该类查询网络连接状态常用方法: getActiveNet ...
Java多线程学习（吐血超具体总结）
林炳文Evankaka原创作品. 转载请注明出处http://blog.csdn.net/evankaka 写在前面的话:此文仅仅能说是java多线程的一个入门.事实上Java里头线程全然能够写一本书 ...
3 cocos2dx 3.0 源码分析-mainLoop详细
简述: 我靠上面图是不是太大了, 有点看不清了. 总结一下过程: 之前说过的appController 之后经过了若干初始化, 最后调用了displayLinker 的定时调用, 这里调用了函数 ...

scikit-learn：4.2.3. Text feature extraction

scikit-learn：4.2.3. Text feature extraction的更多相关文章

随机推荐

热门专题