nltk30_Investigating bias with NLTK
sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程)
https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

https://www.pythonprogramming.net/investigating-nltk-tutorial/
算法测试后发现许多不准确,即偏见,负面评价更多。
In this tutorial, we discuss a few issues. The most major issue is that we have a fairly biased algorithm. You can test this yourself by commenting-out the shuffling of the documents, then training against the first 1900, and leaving the last 100 (all positive) reviews. Test, and you will find you have very poor accuracy.
Conversely, you can test against the first 100 data sets, all negative, and train against the following 1900. You will find very high accuracy here. This is a bad sign. It could mean a lot of things, and there are many options for us to fix it.
我们需要用新的数据集来建模
That said, the project I have in mind for us suggests we go ahead and use a different data set anyways, so we will do that. In the end, we will find this new data set still contains some bias, and that is that it picks up negative things more often. The reason for this is that negative reviews tend to be "more negative" than positive reviews are positive. Handling this can be done with some simple weighting, but it can also get complex fast. Maybe a tutorial for another day. For now, we're going to just grab a new dataset, which we'll be discussing in the next tutorial.
不同数据集需要不同分类器,没有统一万能的分类器,为了Twitter建模情感分析,我们需要Twitter的训练数据。Twitter数据特点是文字更短。
So now it is time to train on a new data set. Our goal is to do Twitter sentiment, so we're hoping for a data set that is a bit shorter per positive and negative statement. It just so happens that I have a data set of 5300+ positive and 5300+ negative movie reviews, which are much shorter. These should give us a bit more accuracy from the larger training set, as well as be more fitting for tweets from Twitter.
下载文件的链接downloads for the short reviews
I have hosted both files here, you can find them by going to the downloads for the short reviews. Save these files as positive.txt and negative.txt.
Now, we can build our new data set in a very similar way as before. What needs to change?
We need a new methodology for creating our "documents" variable, and then we also need a new way to create the "all_words" variable. No problem, really, here's how I did it:
python风控评分卡建模和风控常识(博客主亲自录制视频教程)
nltk30_Investigating bias with NLTK的更多相关文章
- 【NLP】干货!Python NLTK结合stanford NLP工具包进行文本处理
干货!详述Python NLTK下如何使用stanford NLP工具包 作者:白宁超 2016年11月6日19:28:43 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的 ...
- 【NLP】Python NLTK处理原始文本
Python NLTK 处理原始文本 作者:白宁超 2016年11月8日22:45:44 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集的大量公开 ...
- 【NLP】Python NLTK获取文本语料和词汇资源
Python NLTK 获取文本语料和词汇资源 作者:白宁超 2016年11月7日13:15:24 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集 ...
- 【NLP】Python NLTK 走进大秦帝国
Python NLTK 走进大秦帝国 作者:白宁超 2016年10月17日18:54:10 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集的大量公 ...
- 运行nltk示例 Resource u'tokenizers punkt english.pickle' not found解决
nltk安装完毕后,编写如下示例程序并运行,报Resource u'tokenizers/punkt/english.pickle' not found错误 import nltk sentence ...
- Python文本处理nltk基础
自然语言处理 -->计算机数据 ,计算机可以处理vector,matrix 向量矩阵. NLTK 自然语言处理库,自带语料,词性分析,分类,分词等功能. 简单版的wrapper,比如textbl ...
- 自然语言27_Converting words to Features with NLTK
https://www.pythonprogramming.net/words-as-features-nltk-tutorial/ Converting words to Features with ...
- python 安装nltk,使用(英文分词处理,词干化等)(Green VPN)
安装pip命令之后: sudo pip install -U pyyaml nltk import nltk nltk.download() 等待ing 目前访问不了,故使用Green VPN htt ...
- Error=Bias+Variance
首先 Error = Bias + Variance Error反映的是整个模型的准确度,Bias反映的是模型在样本上的输出与真实值之间的误差,即模型本身的精准度,Variance反映的是模型每一次输 ...
随机推荐
- a标签的href为空的问题
在表格里写一个a标签链接刷新表格的时候,没注意,把a标签的href设置为""空字符串,导致每次刷新表格之后会再刷新一次整体页面,找了很久都没发现问题出在哪里,最后无意之间,鼠标在一 ...
- Scrum Meeting 10.29
成员 今日活动 明日计划 用时 徐越 配置tomcat+eclipse 将上届后端代码迁移到服务器 4h 赵庶宏 与数据库连接的java代码学习及编写,测试代码 进行数据库的建立并学习数据库方面的知识 ...
- Fifteen scrum meeting 2015-11-21
最近几日因为其他作业着实拖延了很久更新工程进度. 闫昊: 完成:学习讨论区开发 即将进行:讨论区代码开发 唐彬: 完成:学习学习进度部分开发 即将进行:学习进度功能开发 史烨轩: 完成:学习下载功能设 ...
- Failed to execute request because the App-Domain could not be created. Error: 0x8007000e 存储空间不足,无法完成此操作。
今天业务发现,服务器上的网站无法访问了. 页面报错: Server Application UnavailableThe web application you are attempting to a ...
- jquery-numberformatter插件
项目地址:https://code.google.com/p/jquery-numberformatter/ 非jquery版:https://github.com/andrewgp/jsNumber ...
- T4模板_入门
T4模板作为VS自带的一套代码生成器,功能有多强大我也不知道,最近查找了一些资料学习一下,做个笔记 更详细的资料参见: MSDN: http://msdn.microsoft.com/zh-cn/li ...
- 软工网络15团队作业4——敏捷冲刺日志的集合贴(Alpha阶段)
Alpha阶段 第 1 篇 Scrum 冲刺博客 第 2 篇 Scrum 冲刺博客 第 3 篇 Scrum 冲刺博客 第 4 篇 Scrum 冲刺博客 第 5 篇 Scrum 冲刺博客 第 6 篇 S ...
- Delphi CreateMutex 防止程序多次运行
windows是个多用户多任务的操作系统,支持多个程序同时运行,如果你的程序不想让用户同时运行一个以上, 那应该怎样做呢? 本文将介绍避免用户同时运行多个程序的例子. 需要用到的函数CreateMut ...
- SQL Server 中几个有用的特殊函数
在SQL Server 的使用过程中,发现几个很有用,但不太常用(或细节不太清楚)的函数(存储过程): isnumeric,isdate,patindex,newid,collate,sp_execu ...
- Cannot open the disk 'D:\win7-ie8\Windows 7 x64.vmdk' or one of the snapshot
使用机子过程中断电,开机后使用虚拟机提示[Cannot open the disk 'D:\win7-ie8\Windows 7 x64.vmdk' or one of the snapshot],找 ...
