《机器学习实战》中贝叶斯分类中导入RSS源例子
跟着书中代码往下写在这里卡住了,考虑到可能还会有其他同学也遇到了这样的问题,记下来分享。
先吐槽一下,相信大部分网友在这里卡住的主要原因是伟大的GFW,所以无论是软件翻墙还是肉身翻墙的小伙伴们估计是无论如何也看不到这篇博文的,不想往下看的请自觉使用翻墙技能。
怎么安装feedparser?
按书中提供的网址直接安装feedparser会提示错误说没有setuptools,然后去找setuptools,官方的说法是windows最好用ez_setup.py安装,我确实下载不下来官网的那个ez_etup.py,这个帖子给出了解决方案:http://adesquared.wordpress.com/2013/07/07/setting-up-python-and-easy_install-on-windows-7/
将这个文件直接拷贝到C:\\python27文件夹中,输入命令行:python ez_setup.py install
然后转到放feedparser安装文件的文件夹中,命令行输入:python setup.py install
作者提供的RSS源链接“http://newyork.craigslist.org/stp/index.rss”不可访问怎么办?
书中作者的意思是以来自源 http://newyork.craigslist.org/stp/index.rss 中的文章作为分类为1的文章,以来自源 http://sfbay.craigslist.org/stp/index.rss 中的文章作为分类为0的文章
为了能够跑通示例代码,可以找两可用的RSS源作为替代。
我用的是这两个源:
NASA Image of the Day:http://www.nasa.gov/rss/dyn/image_of_the_day.rss
Yahoo Sports - NBA - Houston Rockets News:http://sports.yahoo.com/nba/teams/hou/rss.xml
也就是说,如果算法运行正确的话,所有来自于 nasa 的文章将会被分类为1,所有来自于yahoo sports的休斯顿火箭队新闻将会分类为0
使用自己定义的RSS源,当程序运行到trainNB0(array(trainMat),array(trainClasses))时会报错,怎么办?
从书中作者的例子来看,作者使用的源中文章数量较多,len(ny['entries']) 为 100,我找的几个 RSS 源只有10-20个左右。
>>> import feedparser
>>>ny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> ny['entries']
>>> len(ny['entries'])
100
因为作者的一个RSS源有100篇文章,所以他可以在代码中剔除了30个“停用词”,随机选择20篇文章作为测试集,但是当我们使用替代RSS源时我们只有10篇文章却要取出20篇文章作为测试集,这样显然是会出错的。只要自己调整下测试集的数量就可以让代码跑通;如果文章中的词太少,减少剔除的“停用词”数量可以提高算法的准确度。
如果不想将出现频率排序最高的30个单词移除,该如何去掉“停用词”?
可以把要去掉的停用词存放到txt文件中,使用时读取(替代移除高频词的代码)。具体需要停用哪些词可以参考这里 http://www.ranks.nl/stopwords
以下代码想正常运行需要将停用词保存至stopword.txt中。
我的txt中保存了以下单词,效果还不错:
a
about
above
after
again
against
all
am
an
and
any
are
aren't
as
at
be
because
been
before
being
below
between
both
but
by
can't
cannot
could
couldn't
did
didn't
do
does
doesn't
doing
don't
down
during
each
few
for
from
further
had
hadn't
has
hasn't
have
haven't
having
he
he'd
he'll
he's
her
here
here's
hers
herself
him
himself
his
how
how's
i
i'd
i'll
i'm
i've
if
in
into
is
isn't
it
it's
its
itself
let's
me
more
most
mustn't
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan't
she
she'd
she'll
she's
should
shouldn't
so
some
such
than
that
that's
the
their
theirs
them
themselves
then
there
there's
these
they
they'd
they'll
they're
they've
this
those
through
to
too
under
until
up
very
was
wasn't
we
we'd
we'll
we're
we've
were
weren't
what
what's
when
when's
where
where's
which
while
who
who's
whom
why
why's
with
won't
would
wouldn't
you
you'd
you'll
you're
you've
your
yours
yourself
yourselves
'''
Created on Oct 19, 2010 @author: Peter
'''
from numpy import * def loadDataSet():
postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0,1,0,1,0,1] #1 is abusive, 0 not
return postingList,classVec def createVocabList(dataSet):
vocabSet = set([]) #create empty set
for document in dataSet:
vocabSet = vocabSet | set(document) #union of the two sets
return list(vocabSet) def bagOfWords2Vec(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] += 1
else: print "the word: %s is not in my Vocabulary!" % word
return returnVec def trainNB0(trainMatrix,trainCategory):
numTrainDocs = len(trainMatrix)
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory)/float(numTrainDocs)
p0Num = ones(numWords); p1Num = ones(numWords) #change to ones()
p0Denom = 2.0; p1Denom = 2.0 #change to 2.0
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
p1Vect = log(p1Num/p1Denom) #change to log()
p0Vect = log(p0Num/p0Denom) #change to log()
return p0Vect,p1Vect,pAbusive def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
p1 = sum(vec2Classify * p1Vec) + log(pClass1) #element-wise mult
p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
if p1 > p0:
return 1
else:
return 0 def bagOfWords2VecMN(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] += 1
return returnVec def testingNB():
print '*** load dataset for training ***'
listOPosts,listClasses = loadDataSet()
print 'listOPost:\n',listOPosts
print 'listClasses:\n',listClasses
print '\n*** create Vocab List ***'
myVocabList = createVocabList(listOPosts)
print 'myVocabList:\n',myVocabList
print '\n*** Vocab show in post Vector Matrix ***'
trainMat=[]
for postinDoc in listOPosts:
trainMat.append(bagOfWords2Vec(myVocabList, postinDoc))
print 'train matrix:',trainMat
print '\n*** train P0V p1V pAb ***'
p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
print 'p0V:\n',p0V
print 'p1V:\n',p1V
print 'pAb:\n',pAb
print '\n*** classify ***'
testEntry = ['love', 'my', 'dalmation']
thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
testEntry = ['stupid', 'garbage']
thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) def textParse(bigString): #input is big string, #output is word list
import re
listOfTokens = re.split(r'\W*', bigString)
return [tok.lower() for tok in listOfTokens if len(tok) > 2] def spamTest():
docList=[]; classList = []; fullText =[]
for i in range(1,26):
wordList = textParse(open('email/spam/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList = textParse(open('email/ham/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)#create vocabulary
trainingSet = range(50); testSet=[] #create test set
for i in range(10):
randIndex = int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[]; trainClasses = []
for docIndex in trainingSet:#train the classifier (get probs) trainNB0
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
errorCount = 0
for docIndex in testSet: #classify the remaining items
wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
errorCount += 1
print "classification error",docList[docIndex]
print 'the error rate is: ',float(errorCount)/len(testSet)
#return vocabList,fullText def calcMostFreq(vocabList,fullText):
import operator
freqDict = {}
for token in vocabList:
freqDict[token]=fullText.count(token)
sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedFreq[:30] def stopWords():
import re
wordList = open('stopword.txt').read() # see http://www.ranks.nl/stopwords
listOfTokens = re.split(r'\W*', wordList)
return [tok.lower() for tok in listOfTokens]
print 'read stop word from \'stopword.txt\':',listOfTokens
return listOfTokens def localWords(feed1,feed0):
import feedparser
docList=[]; classList = []; fullText =[]
print 'feed1 entries length: ', len(feed1['entries']), '\nfeed0 entries length: ', len(feed0['entries'])
minLen = min(len(feed1['entries']),len(feed0['entries']))
print '\nmin Length: ', minLen
for i in range(minLen):
wordList = textParse(feed1['entries'][i]['summary'])
print '\nfeed1\'s entries[',i,']\'s summary - ','parse text:\n',wordList
docList.append(wordList)
fullText.extend(wordList)
classList.append(1) #NY is class 1
wordList = textParse(feed0['entries'][i]['summary'])
print '\nfeed0\'s entries[',i,']\'s summary - ','parse text:\n',wordList
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)#create vocabulary
print '\nVocabList is ',vocabList
print '\nRemove Stop Word:'
stopWordList = stopWords()
for stopWord in stopWordList:
if stopWord in vocabList:
vocabList.remove(stopWord)
print 'Removed: ',stopWord
## top30Words = calcMostFreq(vocabList,fullText) #remove top 30 words
## print '\nTop 30 words: ', top30Words
## for pairW in top30Words:
## if pairW[0] in vocabList:
## vocabList.remove(pairW[0])
## print '\nRemoved: ',pairW[0]
trainingSet = range(2*minLen); testSet=[] #create test set
print '\n\nBegin to create a test set: \ntrainingSet:',trainingSet,'\ntestSet',testSet
for i in range(5):
randIndex = int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
print 'random select 5 sets as the testSet:\ntrainingSet:',trainingSet,'\ntestSet',testSet
trainMat=[]; trainClasses = []
for docIndex in trainingSet:#train the classifier (get probs) trainNB0
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
print '\ntrainMat length:',len(trainMat)
print '\ntrainClasses',trainClasses
print '\n\ntrainNB0:'
p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
#print '\np0V:',p0V,'\np1V',p1V,'\npSpam',pSpam
errorCount = 0
for docIndex in testSet: #classify the remaining items
wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam)
originalClass = classList[docIndex]
result = classifiedClass != originalClass
if result:
errorCount += 1
print '\n',docList[docIndex],'\nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result
print '\nthe error rate is: ',float(errorCount)/len(testSet)
return vocabList,p0V,p1V def testRSS():
import feedparser
ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
vocabList,pSF,pNY = localWords(ny,sf) def testTopWords():
import feedparser
ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
getTopWords(ny,sf) def getTopWords(ny,sf):
import operator
vocabList,p0V,p1V=localWords(ny,sf)
topNY=[]; topSF=[]
for i in range(len(p0V)):
if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
for item in sortedSF:
print item[0]
sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
for item in sortedNY:
print item[0] def test42():
print '\n*** Load DataSet ***'
listOPosts,listClasses = loadDataSet()
print 'List of posts:\n', listOPosts
print 'List of Classes:\n', listClasses print '\n*** Create Vocab List ***'
myVocabList = createVocabList(listOPosts)
print 'Vocab List from posts:\n', myVocabList print '\n*** Vocab show in post Vector Matrix ***'
trainMat=[]
for postinDoc in listOPosts:
trainMat.append(bagOfWords2Vec(myVocabList,postinDoc))
print 'Train Matrix:\n', trainMat print '\n*** Train ***'
p0V,p1V,pAb = trainNB0(trainMat,listClasses)
print 'p0V:\n',p0V
print 'p1V:\n',p1V
print 'pAb:\n',pAb
《机器学习实战》中贝叶斯分类中导入RSS源例子的更多相关文章
- 《机器学习实战》基于朴素贝叶斯分类算法构建文本分类器的Python实现
============================================================================================ <机器学 ...
- python机器学习实战(三)
python机器学习实战(三) 版权声明:本文为博主原创文章,转载请指明转载地址 www.cnblogs.com/fydeblog/p/7277205.html 前言 这篇notebook是关于机器 ...
- 机器学习实战 [Machine learning in action]
内容简介 机器学习是人工智能研究领域中一个极其重要的研究方向,在现今的大数据时代背景下,捕获数据并从中萃取有价值的信息或模式,成为各行业求生存.谋发展的决定性手段,这使得这一过去为分析师和数学家所专属 ...
- 【机器学习实战】第4章 朴素贝叶斯(Naive Bayes)
第4章 基于概率论的分类方法:朴素贝叶斯 朴素贝叶斯 概述 贝叶斯分类是一类分类算法的总称,这类算法均以贝叶斯定理为基础,故统称为贝叶斯分类.本章首先介绍贝叶斯分类算法的基础——贝叶斯定理.最后,我们 ...
- 机器学习实战之k-近邻算法(3)---如何可视化数据
关于可视化: <机器学习实战>书中的一个小错误,P22的datingTestSet.txt这个文件,根据网上的源代码,应该选择datingTestSet2.txt这个文件.主要的区别是最后 ...
- 【Python机器学习实战】决策树和集成学习(一)
摘要:本部分对决策树几种算法的原理及算法过程进行简要介绍,然后编写程序实现决策树算法,再根据Python自带机器学习包实现决策树算法,最后从决策树引申至集成学习相关内容. 1.决策树 决策树作为一种常 ...
- 探秘Tomcat(一)——Myeclipse中导入Tomcat源码
前言:有的时候自己不知道自己是井底之蛙,这并没有什么可怕的,因为你只要蜷缩在方寸之间的井里,无数次的生活轨迹无非最终归结还是一个圆形:但是可怕的是有一天你不得不从井里跳出来生活,需要重新审视井以外的生 ...
- 导入android源码中的APP源码到eclipse
导入android源码中的APP源码到eclipse 一般最简单的办法就是创建新的android工程,选择create project from existing source选项,直接导入源码就OK ...
- 机器学习实战基础(十一):sklearn中的数据预处理和特征工程(四) 数据预处理 Preprocessing & Impute 之 处理分类特征:编码与哑变量
处理分类特征:编码与哑变量 在机器学习中,大多数算法,譬如逻辑回归,支持向量机SVM,k近邻算法等都只能够处理数值型数据,不能处理文字,在sklearn当中,除了专用来处理文字的算法,其他算法在fit的 ...
随机推荐
- PIE SDK打开长时间序列数据
1. 功能简介 时间序列数据(time series data)是在不同时间上收集到的数据,这类数据是按时间顺序收集到的,用于所描述现象随时间变化的情况.当前随着遥感卫星技术日新月异的发展,遥感卫星的 ...
- Vue PDF文件预览vue-pdf
最近做项目,遇到预览PDF这个功能,在网上找了找,大多推荐的是pdf.js,不过在Vue中还是想偷懒直接npm组件,最后找到了一个还不错的Vue-pdf 组件,GitHub地址:https:// ...
- SQL手工注入学习 一
sql注入: (基于DVWA环境的sql注入) 流程: 1.判断是否有SQL注入漏洞 2.判断操作系统.数据库和web应用的类型 3.获取数据库信息看,包括管理员信息(拖库) ...
- innoback 参数及使用说明
--defaults-file 同xtrabackup的--defaults-file参数,指定mysql配置文件; --apply-log 对xtrabackup的--prepare参数的封装; - ...
- dockerfile 语法
基本语法格式:INSTRUCTION arguments (指令+参数)不分大小写 注释格式:# 注释 第一个指令必须是FROM,标示使用什么镜像 1.解析器指令 解析器指令是可选的,并且影响处理Do ...
- (转) Linux Shell经典实例解析
原文:http://blog.csdn.net/yonggeit/article/details/72779955 该篇博客作为对之前Linux Shell常用技巧和高级技巧系列博客的总结,将以Ora ...
- [linux]解决DNS配置重启丢失
DNS配置重启丢失 每次重启后都修改DNS配置文件 /etc/resolv.conf从网上得知 /etc/resolv.conf中的DNS配置是从/etc/resolvconf/resolv.conf ...
- 面试题-Java设计模式举例
面试题-Java设计模式举例 1.适配器模式 涉及三个角色:Target目标接口.Adaptee源角色.Adapter适配器:Adapter将源接口适配到目标接口,继承源接口,实现目标接口. Java ...
- Notepad++的ftp远程编辑功能
我们主要来说说NppFTP的使用方法: 1.启动notepad++后,点击插件-->NppFTP-->Show NppFTP Window,就可以显示NppFTP的管理窗口了. 2.在Np ...
- 网站加入QQ聊天链接
有时候我们的网站需要加入客服聊天功能,实现方式各不相同同,对于流量不大的网站,可以加入qq聊天的链接,点击链接,会打开本地qq的聊天窗口, 和指定的人会话.实现方式很简单,就是一个<a>标 ...