def trainNB0(trainMatrix,trainCategory):
numTrainDocs = len(trainMatrix)
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory)/float(numTrainDocs)
p0Num = ones(numWords); p1Num = ones(numWords) #change to ones()
p0Denom = 2.0; p1Denom = 2.0 #change to 2.0
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
p1Vect = log(p1Num/p1Denom) #change to log()
p0Vect = log(p0Num/p0Denom) #change to log()
return p0Vect,p1Vect,pAbusive
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
p1 = sum(vec2Classify * p1Vec) + log(pClass1) #element-wise mult *提示一
p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
if p1 > p0:
return 1
else:
return 0

*提示一

p(Ci|w)=p(w|Ci)p(Ci)/p(w)  对乘积取自然对数  ln(p(w|Ci)p(Ci))=ln(p(w|Ci))+ln(p(Ci))

在以下样例中。由于每一个分类在样本中的比例都一样的,这样不用再加上log(p(Ci))也不会影响最后的分类效果

用C#随便做个样例,实现文章类型的分类   随机词不如有针对性的词来的有效,所以这里都是从全部三个分类里找到的词汇

1、创建词向量:中超/亚冠/国足/足协/英超/西甲/欧冠/意甲/德甲/篮球/NBA/CBA/高尔夫/乒乓/排球/网球/羽毛球/跑步/赛车/棋牌/台球/游泳/马术/拳击/田径/功夫/扑克/体育/球队/球员/训练/国家队/联赛/俱乐部/场地/翻盘/绝杀/热身/队友/冠军/亚军/季军/犯规/赛季/加时/反超/半场/争夺/战术/阵容/比赛/德比/恢复/进球/失球/奥斯卡/娱乐/影迷/电影/电视/音乐/戏剧/视频/演员/导演/明星/经纪人/歌手/连续剧/展映/粉丝/写真/演技/作秀/节目/艺人/超模/女星/模特/男星/性感/主创/院线/影业/拍摄/编剧/情节/影像/剧情/主演/上映/票房/开机/剧集/表演/收视/预告片/主持人/艾美奖/角色/剧院/乐迷/影迷/演出/专辑/乐坛/剧场/文艺/芭蕾/戏曲/舞蹈/军事/军队/军机/炸弹/军方/坦克/军舰/炸死/军演/战备/部队/军区/国防/士兵/舰船/潜艇/飞机/直升机/舰队/保卫/演习/武器/反击/打击/阅兵/对抗/防卫/海军/空军/陆军/武装/战略/空袭/冲突/装甲/步兵/作战/导弹/边防/侦察/战斗机/雷达/轰炸/防御/据点/火力/航空母舰/进攻/弹药/军营/包围/攻占/俘虏/參战/战友/战斗/入侵

2、搜狐上下载三类文章各10篇组成训练样本,计算出每篇文章的文档矩阵。标注每篇文章的类别标签

样本文件名称格式:  编号_类别标签.txt

文档矩阵:

000000000000000000100000000000000000001100010001001010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

000000000000000000000000010000000000000100000000000000011110001010000000000000000011000000110000000000100000000000000000010000000000000000000000000000000000000000000000

000000000000000000000000000011000000000000000000000001001000001001000000001000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000

000000000000000000000000000000000000000000000000000000001001000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

000000000000000000000000010000000000000000000000000000010010000100000000000000010010000001000000000100000000000000000000000000000000000000000000000000000000000000000000

000000000000000000000000000000000000000000000000010000000010000010000010100000000111111111110000000100000000000000000000000000000000000000000000000000000000000000000000

000000000000000000000000000000000000000110000000000000011010000001000010000000000001100001110000000000000000000000000000000000000000000000000000000000000000000000000000

000000000100000000000000000000000000000000000000000000001010000110000000000000000100000001101000000100000000000000000000000000000000000000000000000000000000000000000000

000000000000000000000000000000000000000000000000000000000010000010000000000000001000000001100000100000000000000000000000000000000000000000000000000000000000000000000000

000000000000000000000000000000000000000000000000000000001010000110000000000000000000001011000010000110000000000000000000000000000000000000000000000000000000000000000000

000000010000000000000000000011100000001000010110001001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000

000000000000000000000000000000000000000000000000000000001001000100000000000000000000000010000100000100000000000000000000000000000000000000000000000000000000000000000000

000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011110000000000000000000000000000000000000000000000000000

000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000110000000111111111111100000000000000000000000000000000000

000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100000000000000000000000000000000000000000000000000000

000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000100000000000000000000000000000000000

000000000000000000000000000000000000000000000001000000000001000000000000000000000000100000000000000000000000000100100000010010000000000000000100000000000100000000000010

000000000000000000000000000000100000000000000000000000000001000000000000000000000000000000000000000000000000000100010000010000000000000000000100000100000000000000000000

000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000110010000000001001010000000010000000000000000000

000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000010100000000000100000000010000000000000001000000000000

000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000010000000000100000000100010000000000001000000000

000000010000000000000000000111001100000000010000001001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000100000000000000100000000110000010000000000000

110000000000000000000000000100001000100000010000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000

110000000000000000000000000111001100100100010001111011000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000

000000000001000000000000000001000000101000100110001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000

000000000000010000000000000000000000001101000001001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

000000000010000000000000000010000000000000010010001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

000000000001000000000000000111100000101000110100001000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000

000000000000000000000000000100000000000110010100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

类别标签向量:

122222222212333333333131111111

using System;
using System.Text;
using System.Windows.Forms;
using System.IO; namespace NaiveBayes
{
public partial class Form1 : Form
{
private string[] vocabArray;
private double[] p0Num, p1Num, p2Num; public Form1()
{
InitializeComponent();
label2.Text = "体育1、娱乐2、军事3\r\n每一个类型10个训练样本\r\n文章所有出自搜狐新闻\r\n词向量从各类文章中分词获得";
StreamReader sr = new StreamReader("vocabList.txt", Encoding.Default);
string line, all = "";
while ((line = sr.ReadLine()) != null)
{
all += line;
}
vocabArray = all.Split(new string[] { "/" }, StringSplitOptions.RemoveEmptyEntries);
} private void Form1_Resize(object sender, EventArgs e)
{
this.Width = 800;
this.Height = 600;
} private void button1_Click(object sender, EventArgs e)
{
//生成文档矩阵和分类标签向量
DirectoryInfo di = new DirectoryInfo("train");
FileInfo[] fi = di.GetFiles("*.txt");
string[] trainMatrix = new string[fi.Length];
p0Num = new double[vocabArray.Length];
p1Num = new double[vocabArray.Length];
p2Num = new double[vocabArray.Length];
double p0Denom = 2.0;
double p1Denom = 2.0;
double p2Denom = 2.0;
for (int i = 0; i < vocabArray.Length; i++)
{
p0Num[i] = p1Num[i] = p2Num[i] = 1.0;
}
string trainCategory = "";
int m = 0;
foreach (FileInfo i in fi)
{
StreamReader sr = new StreamReader(i.FullName, Encoding.Default);
string line, all = "";
while ((line = sr.ReadLine()) != null)
{
all += line;
}
string strVec = "";
foreach (string j in vocabArray)
{
if (all.Contains(j))
strVec += "1";
else
strVec += "0";
}
trainMatrix[m] = strVec;
m++;
trainCategory += i.Name.Substring(i.Name.LastIndexOf("_") + 1, 1);
}
StreamWriter sw = new StreamWriter(".\\trainV\\trainMatrix.txt", true);
foreach (string i in trainMatrix)
{
sw.WriteLine(i);
sw.Flush();
}
sw.Close();
sw = new StreamWriter(".\\trainV\\trainCategory.txt", true);
sw.WriteLine(trainCategory);
sw.Close();
for (int i = 0; i < trainMatrix.Length; i++)
{
if (trainCategory.Substring(i, 1) == "1")
{
double tmp = 0;
for (int j = 0; j < vocabArray.Length; j++)
{
p0Num[j] += double.Parse(trainMatrix[i].Substring(j, 1));
tmp += double.Parse(trainMatrix[i].Substring(j, 1));
}
p0Denom += tmp;
}
else if (trainCategory.Substring(i, 1) == "2")
{
double tmp = 0;
for (int j = 0; j < vocabArray.Length; j++)
{
p1Num[j] += double.Parse(trainMatrix[i].Substring(j, 1));
tmp += double.Parse(trainMatrix[i].Substring(j, 1));
}
p1Denom += tmp;
}
else if (trainCategory.Substring(i, 1) == "3")
{
double tmp = 0;
for (int j = 0; j < vocabArray.Length; j++)
{
p2Num[j] += double.Parse(trainMatrix[i].Substring(j, 1));
tmp += double.Parse(trainMatrix[i].Substring(j, 1));
}
p2Denom += tmp;
}
else
{
//Undo
}
}
for (int j = 0; j < vocabArray.Length; j++)
{
p0Num[j] = Math.Log(p0Num[j] / p0Denom);
p1Num[j] = Math.Log(p1Num[j] / p1Denom);
p2Num[j] = Math.Log(p2Num[j] / p2Denom);
}
label4.Text = "处理样本数据完毕";
} private void button2_Click(object sender, EventArgs e)
{
if (textBox1.Text.Trim() != "")
{
string strVec = "";
foreach (string i in vocabArray)
{
if (textBox1.Text.Contains(i))
strVec += "1";
else
strVec += "0";
}
double p0 = 0;
double p1 = 0;
double p2 = 0;
for (int j = 0; j < vocabArray.Length; j++)
{
p0 += p0Num[j] * double.Parse(strVec.Substring(j, 1));
p1 += p1Num[j] * double.Parse(strVec.Substring(j, 1));
p2 += p2Num[j] * double.Parse(strVec.Substring(j, 1));
}
string catelog = "";
if (p0 > p1 && p0 > p2)
catelog = "体育";
else if (p1 > p0 && p1 > p2)
catelog = "娱乐";
else if (p2 > p0 && p2 > p1)
catelog = "军事";
else
catelog = "无法推断";
label3.Text = "体育:" + p0.ToString() + "\r\n娱乐:" + p1.ToString() + "\r\n军事:" + p2.ToString();
label1.Text = "所属类型是:" + catelog;
}
}
}
}

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvam95Y2VzdW5ueQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="">

&lt;Machine Learning in Action &gt;之二 朴素贝叶斯 C#实现文章分类的更多相关文章

  1. 机器学习实战 [Machine learning in action]

    内容简介 机器学习是人工智能研究领域中一个极其重要的研究方向,在现今的大数据时代背景下,捕获数据并从中萃取有价值的信息或模式,成为各行业求生存.谋发展的决定性手段,这使得这一过去为分析师和数学家所专属 ...

  2. K近邻 Python实现 机器学习实战(Machine Learning in Action)

    算法原理 K近邻是机器学习中常见的分类方法之间,也是相对最简单的一种分类方法,属于监督学习范畴.其实K近邻并没有显式的学习过程,它的学习过程就是测试过程.K近邻思想很简单:先给你一个训练数据集D,包括 ...

  3. 机器学习实战(Machine Learning in Action)学习笔记————10.奇异值分解(SVD)原理、基于协同过滤的推荐引擎、数据降维

    关键字:SVD.奇异值分解.降维.基于协同过滤的推荐引擎作者:米仓山下时间:2018-11-3机器学习实战(Machine Learning in Action,@author: Peter Harr ...

  4. 学习笔记之机器学习实战 (Machine Learning in Action)

    机器学习实战 (豆瓣) https://book.douban.com/subject/24703171/ 机器学习是人工智能研究领域中一个极其重要的研究方向,在现今的大数据时代背景下,捕获数据并从中 ...

  5. 机器学习实战(Machine Learning in Action)学习笔记————08.使用FPgrowth算法来高效发现频繁项集

    机器学习实战(Machine Learning in Action)学习笔记————08.使用FPgrowth算法来高效发现频繁项集 关键字:FPgrowth.频繁项集.条件FP树.非监督学习作者:米 ...

  6. 机器学习实战(Machine Learning in Action)学习笔记————07.使用Apriori算法进行关联分析

    机器学习实战(Machine Learning in Action)学习笔记————07.使用Apriori算法进行关联分析 关键字:Apriori.关联规则挖掘.频繁项集作者:米仓山下时间:2018 ...

  7. 机器学习实战(Machine Learning in Action)学习笔记————06.k-均值聚类算法(kMeans)学习笔记

    机器学习实战(Machine Learning in Action)学习笔记————06.k-均值聚类算法(kMeans)学习笔记 关键字:k-均值.kMeans.聚类.非监督学习作者:米仓山下时间: ...

  8. 机器学习实战(Machine Learning in Action)学习笔记————05.Logistic回归

    机器学习实战(Machine Learning in Action)学习笔记————05.Logistic回归 关键字:Logistic回归.python.源码解析.测试作者:米仓山下时间:2018- ...

  9. 机器学习实战(Machine Learning in Action)学习笔记————04.朴素贝叶斯分类(bayes)

    机器学习实战(Machine Learning in Action)学习笔记————04.朴素贝叶斯分类(bayes) 关键字:朴素贝叶斯.python.源码解析作者:米仓山下时间:2018-10-2 ...

随机推荐

  1. RocketMQ学习笔记(3)----RocketMQ物理结构和逻辑部署结构

    1. RocketMQ的物理结构 RecketMQ网络部署的特点: Name Server是一个几乎无状态特点,可集群部署,节点之间无任何信息同步的(相对于zookeeper是较为轻量级的). Bro ...

  2. 《Unix环境高级编程》读书笔记 第4章-文件和目录

    1. stat结构的基本形式: on error 24. 设备特殊文件 每个文件系统所在的存储设备都由其主.次设备号表示. 设备号所用的数据类型是基本系统数据类型dev_t. 主设备号标识设备驱动程序 ...

  3. 关闭linux终端进程

    [root@linux-node1 ~]# w 22:16:45 up 24 days, 24 min, 2 users, load average: 0.28, 0.17, 0.15 USER TT ...

  4. c++常见操作的模板

    1.统计时间 #include<ctime> clock_t startTime = clock(); code(); clock_t endTime = clock(); cout &l ...

  5. 动态Axios配置

    推荐使用Vue-cli工具来创建和管理项目,就算刚开始不熟悉,用着用着便可知晓其中的奥妙.前一段时间官方所推荐的数据请求插件还是Vue-resource,但现在已经变了,变成了Axios,不用知道为什 ...

  6. SpringBoot实战(二)Restful风格API接口

    在上一篇SpringBoot实战(一)HelloWorld的基础上,编写一个Restful风格的API接口: 1.根据MVC原则,创建一个简单的目录结构,包括controller和entity,分别创 ...

  7. C#调用带结构体指针的C Dll的方法

    在C#中调用C(C++)类的DLL的时候,有时候C的接口函数包含很多参数,而且有的时候这些参数有可能是个结构体,而且有可能是结构体指针,那么在C#到底该如何安全的调用这样的DLL接口函数呢?本文将详细 ...

  8. [terry笔记]11gR2_dataguard_保护模式切换

    保护模式切换 Maximum protection/availability/ performance 1. 首先查看当前的保护模式 SQL> select protection_mode,pr ...

  9. SQLSever: 怎样在select中的每一行产生不同的随机数?

    select 的随机函数有点假, 或许是由于它是基于时间来的吧, 同一select中由于时间无法错开导致产生的随机数都是一样的. 怎样做到让不同的行拥有不同的随机数呢? 以下以产生某个月的随机日期来演 ...

  10. 2015-8-29阿里校园招聘研发project师笔试题

    前言:原题来自于网络:http://www.cnblogs.com/nausicaa/p/3946694.html.本人依据自己理解对题目进行解答.因为水平有限.题目有不会做.做错的地方.欢迎大家留言 ...