scikit-learn：CountVectorizer提取tf都做了什么

from: https://blog.csdn.net/mmc2015/article/details/46866537

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

class sklearn.feature_extraction.text.CountVectorizer(

input=u'content',

encoding=u'utf-8',

decode_error=u'strict',

strip_accents=None,

lowercase=True,

preprocessor=None,

tokenizer=None,

stop_words=None,

token_pattern=u'(?u)\b\w\w+\b',

ngram_range=(1, 1),

analyzer=u'word',

max_df=1.0,

min_df=1,

max_features=None,

vocabulary=None,

binary=False,

dtype=<type 'numpy.int64'>)

[source]

作用：Convert a collection of text documents to a matrix of token counts（计算词汇的数量，即tf）；结果由 scipy.sparse.coo_matrix进行稀疏表示。

看下参数就知道CountVectorizer在提取tf时都做了什么：

strip_accents : {‘ascii’, ‘unicode’, None}：是否除去“音调”，不知道什么是“音调”？看：http://textmechanic.com/?reqp=1&reqr=nzcdYz9hqaSbYaOvrt==

lowercase : boolean, True by default：计算tf前，先将所有字符转化为小写。这个参数一般为True。

preprocessor : callable or None (default)：复写the preprocessing (string transformation) stage，但保留tokenizing and n-grams generation steps.这个参数可以自己写。

tokenizer : callable or None (default)：复写the string tokenization step，但保留preprocessing and n-grams generation steps.这个参数可以自己写。

stop_words : string {‘english’}, list, or None (default)：如果是‘english’, a built-in stop word list for English is used。如果是a list，那么最终的tokens中将去掉list中的所有的stop word。如果是None, 不处理停顿词；但参数 max_df可以设置为 [0.7, 1.0) 之间，进而根据intra corpus document frequency(df) of terms自动detect and filter stop words。这个参数要根据自己的需求调整。

token_pattern : string：正则表达式，默认筛选长度大于等于2的字母和数字混合字符（select tokens of 2 or more alphanumeric characters ），参数analyzer设置为word时才有效。

ngram_range : tuple (min_n, max_n)：n-values值得上下界，默认是ngram_range=(1, 1)，该范围之内的n元feature都会被提取出来！这个参数要根据自己的需求调整。

analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable：特征基于wordn-grams还是character n-grams。如果是callable是自己复写的从the raw, unprocessed input提取特征的函数。

max_df : float in range [0.0, 1.0] or int, default=1.0：

min_df : float in range [0.0, 1.0] or int, default=1：按比例，或绝对数量删除df超过max_df或者df小于min_df的word tokens。有效的前提是参数vocabulary设置成Node。

max_features : int or None, default=None：选择tf最大的max_features个特征。有效的前提是参数vocabulary设置成Node。

vocabulary : Mapping or iterable, optional：自定义的特征word tokens，如果不是None，则只计算vocabulary中的词的tf。还是设为None靠谱。

binary : boolean, default=False：如果是True，tf的值只有0和1，表示出现和不出现，useful for discrete probabilistic models that model binary events rather than integer counts.。

dtype : type, optional：Type of the matrix returned by fit_transform() or transform().。

结论：

CountVectorizer提取tf都做了这些：去音调、转小写、去停顿词、在word（而不是character，也可自己选择参数）基础上提取所有ngram_range范围内的特征，同时删去满足“max_df, min_df,max_features”的特征的tf。当然，也可以选择tf为binary。

这样应该就放心CountVectorizer处理结果是不是自己想要的了。。。。哇哈哈。

最后看下两个函数：

`fit`(raw_documents[, y])	Learn a vocabulary dictionary of all tokens in the raw documents.
`fit_transform`(raw_documents[, y])	Learn the vocabulary dictionary and return term-document matrix.

fit(raw_documents, y=None)[source]¶

Learn a vocabulary dictionary of all tokens in the raw documents.

Parameters:	raw_documents : iterable An iterable which yields either str, unicode or file objects.
Returns:	self :

fit_transform(raw_documents, y=None)[source]

Learn the vocabulary dictionary and return term-document matrix.

This is equivalent to fit followed by transform, but more efficiently implemented.

Parameters:	raw_documents : iterable An iterable which yields either str, unicode or file objects.
Returns:	X : array, [n_samples, n_features] Document-term matrix.

scikit-learn：CountVectorizer提取tf都做了什么的更多相关文章

scikit learn 模块调参 pipeline+girdsearch 数据举例：文档分类（python代码）
scikit learn 模块调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...
Scikit Learn: 在python中机器学习
转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...
gcc都做了什么优化
直接上程序: setjmp和longjmp是处理函数嵌套调用的,goto语句不能跨越函数,所以不选择goto. #include <setjmp.h> int setjmp(jmp_buf ...
configure, make, make install都做了什么
1. 我的理解./configure: 确保接下来的make以及make install所依赖的文件没有问题make: build编译连接生成可执行程序make install: 将编译好的可执行 ...
(原创)（三）机器学习笔记之Scikit Learn的线性回归模型初探
一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价模型训练好后,度量模型拟合效果的 ...
(原创)（四）机器学习笔记之Scikit Learn的Logistic回归初探
目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...
从架构演进的角度聊聊Spring Cloud都做了些什么？
Spring Cloud作为一套微服务治理的框架,几乎考虑到了微服务治理的方方面面,之前也写过一些关于Spring Cloud文章,主要偏重各组件的使用,本次分享主要解答这两个问题:Spring Cl ...
Java对象的创建 —— new之后JVM都做了什么？
Java对象创建过程 1. 类加载检查虚拟机遇到一条new指令时,首先将去检查这个指令的参数是否能在常量池中定位到一个类的符号引用,并且检查这个符号引用代表的类是否已经被加载.解析和初始化过.如果没 ...
从架构演进的角度聊聊Spring Cloud都做了些什么
1.从架构演进的角度聊聊Spring Cloud都做了些什么?2.中小型互联网公司微服务实践-经验和教训3.Spring Cloud在国内中小型公司能用起来吗?

随机推荐

RabbitMQ的交换机类型（三）
RabbitMQ的交换机类型共有四种,是根据其路由过程的不同而划分成的分别是Direct Exchange(直连交换机), Fanout Exchange(扇型交换机), Topic Excha ...
JPA中的主键生成策略
通过annotation(注解)来映射hibernate实体的,基于annotation的hibernate主键标识为@Id, 其生成规则由@GeneratedValue设定的.这里的@id和@Gen ...
Shell中比较判断
一.shell判断数组中是否包含某个元素:ary=(1 2 3)a=2if [[ "${ary[@]}" =~ "$a" ]] ; then echo & ...
【Day3】5.Python中的lxml模块
import lxml.etree as le with open('edu.html','r',encoding='utf-8') as f: html = f.read() html_x = le ...
elasticsearch联想加搜索实例
//搜索框具体的ajax如下: <form class="form-wrapper cf"> <img src="__PUBLIC__/Home/img ...
linux下top命令的使用
top命令是Linux下常用的性能分析工具,能够实时显示系统中各个进程的资源占用状况,类似于Windows的任务管理器视图参数含义 top视图分为两部分:操作系统资源概况信息和进程信息.首先分析资源 ...
Introduction to Go Modules
转:https://roberto.selbach.ca/intro-to-go-modules/ git init git add * git commit -am "First comm ...
texture2dArray
https://medium.com/@calebfaith/how-to-use-texture-arrays-in-unity-a830ae04c98b http://cdn.imgtec.com ...
springboot2.0入门（二）-- 基础项目构建+插件的使用
一.idea中新建第一个HelloWorld项目点击next: 下一步在这里可以选择我们需要依赖的第三方软件类库,包括spring-boot-web,mysql驱动,mybatis等.我们这里暂时 ...
openstack使用
管理员登陆: 身份管理--->创建项目身份管理--->创建用户(角色:_member_)(管理指定项目) 管理员--->云主机类型--->创建云主机管理员--->镜像 ...

scikit-learn：CountVectorizer提取tf都做了什么

scikit-learn：CountVectorizer提取tf都做了什么的更多相关文章

随机推荐

热门专题