机器学习入门-贝叶斯构造LDA主题模型，构造word2vec 1.gensim.corpora.Dictionary(构造映射字典) 2.dictionary.doc2vec(做映射) 3.gensim.model.ldamodel.LdaModel(构建主题模型)4lda.print

1.dictionary = gensim.corpora.Dictionary(clean_content) 对输入的列表做一个数字映射字典，

2. corpus = [dictionary,doc2vec(cl_content) for cl_content in clean_content] # 输出clean_content每一个元素根据dictionary做数字映射后的结果

3.lda = gensim.model.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20) # corpus表示映射后的文本列表， id2word表示根据哪个数字映射字典张开， num_topics表示主题的个数

4. lda.print_topics(1, topn=5) # 打印第一个主题，前5个词

第一步：载入语料库数据

第二步：进行分词操作

第三步：载入停用词表，去除语料库中的停用词

第四步：

构建数字映射字典

对文本做逐个映射

构建LDA主题模型

打印主题模型的主题和前5个主题词

import pandas as pd

import numpy as np

import jieba

# 1.导入数据语料的新闻数据

df_data = pd.read_table('data/val.txt', names=['category', 'theme', 'URL', 'content'], encoding='utf-8')

# 2.对语料库进行分词操作

df_contents = df_data.content.values.tolist()

# list of list 结构

Jie_content = []

for df_content in df_contents:

    split_content = jieba.lcut(df_content)

    if len(split_content) > 1 and split_content != '\t\n':

        Jie_content.append(split_content)

# 3. 导入停止词的语料库, sep='\t'表示分隔符， quoting控制引号的常量， names=列名， index_col=False，不用第一列做为行的列名， encoding

stopwords = pd.read_csv('stopwords.txt', sep='\t', quoting=3, names=['stopwords'], index_col=False, encoding='utf-8')

print(stopwords.head())

# 对文本进行停止词的去除

def drop_stops(Jie_content, stopwords):

    clean_content = []

    all_words = []

    for j_content in Jie_content:

        line_clean = []

        for line in j_content:

            if line in stopwords:

                continue

            line_clean.append(line)

            all_words.append(line)

        clean_content.append(line_clean)

    return clean_content, all_words

# 将DateFrame的stopwords数据转换为list形式

stopwords = stopwords.stopwords.values.tolist()

clean_content, all_words = drop_stops(Jie_content, stopwords)

print(clean_content[0])

# 4. 进行LDA主题模型

import gensim

from gensim import corpora

# 使用gensim.dictionary 生成word2vec

dictionary = corpora.Dictionary(clean_content)

print(np.shape(dictionary))

# 对clean_content 根据dictionary映射构造向量

corpus = [dictionary.doc2bow(clean_c) for clean_c in clean_content]

lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)

print(lda.print_topic(1, topn=5))

机器学习入门-贝叶斯构造LDA主题模型，构造word2vec 1.gensim.corpora.Dictionary(构造映射字典) 2.dictionary.doc2vec(做映射) 3.gensim.model.ldamodel.LdaModel(构建主题模型)4lda.print_topics(打印主题).的更多相关文章

机器学习入门-贝叶斯中文新闻分类任务 1. .map(做标签数字替换) 2.CountVectorizer(词频向量映射) 3.TfidfVectorizer(TFDIF向量映射) 4.MultinomialNB()贝叶斯模型构建
1.map做一个标签的数字替换 2.vec = CountVectorizer(lowercase=False, max_features=4000) # 从sklean.extract_featu ...
吴裕雄 python 机器学习——多项式贝叶斯分类器MultinomialNB模型
import numpy as np import matplotlib.pyplot as plt from sklearn import datasets,naive_bayes from skl ...
机器学习朴素贝叶斯 SVC对新闻文本进行分类
朴素贝叶斯分类器模型(Naive Bayles) Model basic introduction: 朴素贝叶斯分类器是通过数学家贝叶斯的贝叶斯理论构造的,下面先简单介绍贝叶斯的几个公式: 先验概率: ...
吴裕雄 python 机器学习——高斯贝叶斯分类器GaussianNB
import matplotlib.pyplot as plt from sklearn import datasets,naive_bayes from sklearn.model_selectio ...
Python之机器学习-朴素贝叶斯(垃圾邮件分类)
目录朴素贝叶斯(垃圾邮件分类) 邮箱训练集下载地址模块导入文本预处理遍历邮件训练模型测试模型朴素贝叶斯(垃圾邮件分类) 邮箱训练集下载地址邮箱训练集可以加我微信:nickchen121 ...
机器学习---朴素贝叶斯与逻辑回归的区别（Machine Learning Naive Bayes Logistic Regression Difference）
朴素贝叶斯与逻辑回归的区别: 朴素贝叶斯逻辑回归生成模型(Generative model) 判别模型(Discriminative model) 对特征x和目标y的联合分布P(x,y)建模,使用 ...
spark 机器学习朴素贝叶斯实现(二)
已知10月份10-22日网球场地,会员打球情况通过朴素贝叶斯算法,预测23,24号是否适合打网球.结果,日期,天气温度风速结果(0否,1是)天气(0晴天,1阴天,2下雨)温度(0热,1舒适,2冷) ...
spark 机器学习朴素贝叶斯原理(一)
朴素贝叶斯算法仍然是流行的挖掘算法之一,该算法是有监督的学习算法,解决的是分类问题,如客户是否流失.是否值得投资.信用等级评定等多分类问题.该算法的优点在于简单易懂.学习效率高.在某些领域的分类问题中 ...
机器学习入门-文本特征-使用LDA主题模型构造标签 1.LatentDirichletAllocation(LDA用于构建主题模型) 2.LDA.components(输出各个词向量的权重值)
函数说明 1.LDA(n_topics, max_iters, random_state) 用于构建LDA主题模型,将文本分成不同的主题参数说明:n_topics 表示分为多少个主题, max_i ...

随机推荐

PipelineDB 1.0.0 docker 运行
PipelineDB 1.0 是基于标准的pg 扩展来做的,安装也更方便了,目前还没有对应的docker 镜像所以参考timescaledb 做了一个,方便测试以及使用参考地址 https://g ...
stenciljs 学习一 web 组件开发
stenciljs 介绍参考官方网站,或者 https://www.cnblogs.com/rongfengliang/p/9706542.html 创建项目使用脚手架工具 npm init ste ...
Understanding Safari Reader
Interesting enough to find out the Reader function in Safari is actually Javascript and there are ma ...
TortoiseSVN使用步骤和trunk,Branch,Tag详细说明
1 安装及下载client 端 2 什么是SVN(Subversion)? 3 为甚么要用SVN? 4 怎么样在Windows下面建立SVN Repository? 5 建立一个Working目录 6 ...
Tensorflow & Python3 做神经网络（视频教程）
Tensorflow 简介 1.1 科普: 人工神经网络 VS 生物神经网络 1.2 什么是神经网络 (Neural Network) 1.3 神经网络梯度下降 1.4 科普: 神经网络的黑盒不黑 ...
js一种继承机制：用对象冒充继承构造函数的属性，用原型prototype继承对象的方法。
js一种继承机制:用对象冒充继承构造函数的属性,用原型prototype继承对象的方法. function ClassA(sColor) { this.color = sColor; } ClassA ...
AppBox下调用HighCharts画曲线
例子见本博文件下载. 注意 xAxis: { categories: [<%= xAxisCategories %>], ...
Angular 4 依赖注入
一.依赖注入 1. 创建工程 ng new myangular 2. 创建组件 ng g componet product1 3. 创建服务 ng g service shared/product 如 ...
linux中set的用法
功能说明:设置shell.语法:set [+-abCdefhHklmnpPtuvx]补充说明:用set 命令可以设置各种shell选项或者列出shell变量.单个选项设置常用的特性.在某些选项之后 ...
带你走进Linux（Ubuntu）
类Unix系统目录结构 ubuntu没有盘符这个概念,只有一个根目录/,所有文件都在它下面 /:根目录,一般根目录下只存放目录,在Linux下有且只有一个根目录.所有的东西都是从这里开始.当你在终端里 ...

机器学习入门-贝叶斯构造LDA主题模型，构造word2vec 1.gensim.corpora.Dictionary(构造映射字典) 2.dictionary.doc2vec(做映射) 3.gensim.model.ldamodel.LdaModel(构建主题模型)4lda.print_topics(打印主题).

机器学习入门-贝叶斯构造LDA主题模型，构造word2vec 1.gensim.corpora.Dictionary(构造映射字典) 2.dictionary.doc2vec(做映射) 3.gensim.model.ldamodel.LdaModel(构建主题模型)4lda.print_topics(打印主题).的更多相关文章

随机推荐

热门专题