sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程)

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

https://www.pythonprogramming.net/investigating-nltk-tutorial/

算法测试后发现许多不准确,即偏见,负面评价更多。

In this tutorial, we discuss a few issues. The most major issue is that we have a fairly biased algorithm. You can test this yourself by commenting-out the shuffling of the documents, then training against the first 1900, and leaving the last 100 (all positive) reviews. Test, and you will find you have very poor accuracy.

Conversely, you can test against the first 100 data sets, all negative, and train against the following 1900. You will find very high accuracy here. This is a bad sign. It could mean a lot of things, and there are many options for us to fix it.

我们需要用新的数据集来建模

That said, the project I have in mind for us suggests we go ahead and use a different data set anyways, so we will do that. In the end, we will find this new data set still contains some bias, and that is that it picks up negative things more often. The reason for this is that negative reviews tend to be "more negative" than positive reviews are positive. Handling this can be done with some simple weighting, but it can also get complex fast. Maybe a tutorial for another day. For now, we're going to just grab a new dataset, which we'll be discussing in the next tutorial.

不同数据集需要不同分类器,没有统一万能的分类器,为了Twitter建模情感分析,我们需要Twitter的训练数据。Twitter数据特点是文字更短。

So now it is time to train on a new data set. Our goal is to do Twitter sentiment, so we're hoping for a data set that is a bit shorter per positive and negative statement. It just so happens that I have a data set of 5300+ positive and 5300+ negative movie reviews, which are much shorter. These should give us a bit more accuracy from the larger training set, as well as be more fitting for tweets from Twitter.

下载文件的链接downloads for the short reviews

I have hosted both files here, you can find them by going to the downloads for the short reviews. Save these files as positive.txt and negative.txt.

Now, we can build our new data set in a very similar way as before. What needs to change?

We need a new methodology for creating our "documents" variable, and then we also need a new way to create the "all_words" variable. No problem, really, here's how I did it:

python风控评分卡建模和风控常识(博客主亲自录制视频教程)

nltk30_Investigating bias with NLTK的更多相关文章

  1. 【NLP】干货!Python NLTK结合stanford NLP工具包进行文本处理

    干货!详述Python NLTK下如何使用stanford NLP工具包 作者:白宁超 2016年11月6日19:28:43 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的 ...

  2. 【NLP】Python NLTK处理原始文本

    Python NLTK 处理原始文本 作者:白宁超 2016年11月8日22:45:44 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集的大量公开 ...

  3. 【NLP】Python NLTK获取文本语料和词汇资源

    Python NLTK 获取文本语料和词汇资源 作者:白宁超 2016年11月7日13:15:24 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集 ...

  4. 【NLP】Python NLTK 走进大秦帝国

    Python NLTK 走进大秦帝国 作者:白宁超 2016年10月17日18:54:10 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集的大量公 ...

  5. 运行nltk示例 Resource u'tokenizers punkt english.pickle' not found解决

    nltk安装完毕后,编写如下示例程序并运行,报Resource u'tokenizers/punkt/english.pickle' not found错误 import nltk sentence ...

  6. Python文本处理nltk基础

    自然语言处理 -->计算机数据 ,计算机可以处理vector,matrix 向量矩阵. NLTK 自然语言处理库,自带语料,词性分析,分类,分词等功能. 简单版的wrapper,比如textbl ...

  7. 自然语言27_Converting words to Features with NLTK

    https://www.pythonprogramming.net/words-as-features-nltk-tutorial/ Converting words to Features with ...

  8. python 安装nltk,使用(英文分词处理,词干化等)(Green VPN)

    安装pip命令之后: sudo pip install -U pyyaml nltk import nltk nltk.download() 等待ing 目前访问不了,故使用Green VPN htt ...

  9. Error=Bias+Variance

    首先 Error = Bias + Variance Error反映的是整个模型的准确度,Bias反映的是模型在样本上的输出与真实值之间的误差,即模型本身的精准度,Variance反映的是模型每一次输 ...

随机推荐

  1. Beta周王者荣耀交流协会第五次Scrum会议

    1. 立会照片 成员王超,高远博,冉华,王磊,王玉玲,任思佳,袁玥全部到齐. master:王磊 2. 时间跨度 2017年11月14日 19:00 — 19:50 ,总计50分钟. 3. 地点 一食 ...

  2. Javascript实现大整数加法

    记得之前面试还被问到过用两个字符串实现两个大整数相加,当时还特别好奇好好的整数相加,为什么要用字符串去执行.哈哈,感觉当时自己还是很无知的,面试官肯定特别的无奈.今天在刷算法的时候,无意中看到了为什么 ...

  3. vs2013+python+ cocos2d-x-3.3rc0环境搭建

    1.vs2013安装一路next,安装即可,时间1~2个小时 2.解压cocos2d-x-3.3rc0   build文件夹里会有名为  cocos2d-win32.vc2012的sln文件  打开  ...

  4. binding(转)

    1,Data Binding在WPF中的地位 程序的本质是数据+算法.数据会在存储.逻辑和界面三层之间流通,所以站在数据的角度上来看,这三层都很重要.但算法在3层中的分布是不均匀的,对于一个3层结构的 ...

  5. c++ imooc自学计划

    一.视频学习相关的课程列表: C++远征之起航篇http://www.imooc.com/learn/342: C++远征之离港篇http://www.imooc.com/learn/381: C++ ...

  6. 关于javascript异步编程的理解

    在开发手机app的时候,要使用ajax想向后台发送数据.然后遇到了一个bug,通过这个bug,理解了ajax异步请求的工作原理.下面是登录页面的源代码. <!DOCTYPE html> & ...

  7. 利用CNN进行多分类的文档分类

    # coding: utf-8 import tensorflow as tf class TCNNConfig(object): """CNN配置参数"&qu ...

  8. js作用域相关笔记

    1.js引擎.编译器.作用域. 引擎:负责JS全过程的编译和执行: 编译器:负责语法分析和代码生成: 作用域:负责收集并维护声明组成的查询,以及当前执行代码对这些变量的访问权限(简言之,作用域就是用于 ...

  9. sed ,awk , cut三剑客的区别

    sed: sed只能截取文件中以行的来截取数据,,(grep命令可以过滤到某一行) 例如: [root@localhost ~]# sed  -n  '2,3p'  /etc/passwd       ...

  10. 停止ipv6

    在Centos5.5默认的状态下,ipv6是被启用的.因为我们不使用ipv6,所以,可以停止ipv6,以最大限度地保证安全和快速.首先确认一下ipv6是不是处于被启动的状态.[root@sample ...