机器学习入门-贝叶斯构造LDA主题模型，构造word2vec 1.gensim.corpora.Dictionary(构造映射字典) 2.dictionary.doc2vec(做映射) 3.gensim.model.ldamodel.LdaModel(构建主题模型)4lda.print

1.dictionary = gensim.corpora.Dictionary(clean_content) 对输入的列表做一个数字映射字典，

2. corpus = [dictionary,doc2vec(cl_content) for cl_content in clean_content] # 输出clean_content每一个元素根据dictionary做数字映射后的结果

3.lda = gensim.model.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20) # corpus表示映射后的文本列表， id2word表示根据哪个数字映射字典张开， num_topics表示主题的个数

4. lda.print_topics(1, topn=5) # 打印第一个主题，前5个词

第一步：载入语料库数据

第二步：进行分词操作

第三步：载入停用词表，去除语料库中的停用词

第四步：

构建数字映射字典

对文本做逐个映射

构建LDA主题模型

打印主题模型的主题和前5个主题词

import pandas as pd

import numpy as np

import jieba

# 1.导入数据语料的新闻数据

df_data = pd.read_table('data/val.txt', names=['category', 'theme', 'URL', 'content'], encoding='utf-8')

# 2.对语料库进行分词操作

df_contents = df_data.content.values.tolist()

# list of list 结构

Jie_content = []

for df_content in df_contents:

    split_content = jieba.lcut(df_content)

    if len(split_content) > 1 and split_content != '\t\n':

        Jie_content.append(split_content)

# 3. 导入停止词的语料库, sep='\t'表示分隔符， quoting控制引号的常量， names=列名， index_col=False，不用第一列做为行的列名， encoding

stopwords = pd.read_csv('stopwords.txt', sep='\t', quoting=3, names=['stopwords'], index_col=False, encoding='utf-8')

print(stopwords.head())

# 对文本进行停止词的去除

def drop_stops(Jie_content, stopwords):

    clean_content = []

    all_words = []

    for j_content in Jie_content:

        line_clean = []

        for line in j_content:

            if line in stopwords:

                continue

            line_clean.append(line)

            all_words.append(line)

        clean_content.append(line_clean)

    return clean_content, all_words

# 将DateFrame的stopwords数据转换为list形式

stopwords = stopwords.stopwords.values.tolist()

clean_content, all_words = drop_stops(Jie_content, stopwords)

print(clean_content[0])

# 4. 进行LDA主题模型

import gensim

from gensim import corpora

# 使用gensim.dictionary 生成word2vec

dictionary = corpora.Dictionary(clean_content)

print(np.shape(dictionary))

# 对clean_content 根据dictionary映射构造向量

corpus = [dictionary.doc2bow(clean_c) for clean_c in clean_content]

lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)

print(lda.print_topic(1, topn=5))

机器学习入门-贝叶斯构造LDA主题模型，构造word2vec 1.gensim.corpora.Dictionary(构造映射字典) 2.dictionary.doc2vec(做映射) 3.gensim.model.ldamodel.LdaModel(构建主题模型)4lda.print_topics(打印主题).的更多相关文章

机器学习入门-贝叶斯中文新闻分类任务 1. .map(做标签数字替换) 2.CountVectorizer(词频向量映射) 3.TfidfVectorizer(TFDIF向量映射) 4.MultinomialNB()贝叶斯模型构建
1.map做一个标签的数字替换 2.vec = CountVectorizer(lowercase=False, max_features=4000) # 从sklean.extract_featu ...
吴裕雄 python 机器学习——多项式贝叶斯分类器MultinomialNB模型
import numpy as np import matplotlib.pyplot as plt from sklearn import datasets,naive_bayes from skl ...
机器学习朴素贝叶斯 SVC对新闻文本进行分类
朴素贝叶斯分类器模型(Naive Bayles) Model basic introduction: 朴素贝叶斯分类器是通过数学家贝叶斯的贝叶斯理论构造的,下面先简单介绍贝叶斯的几个公式: 先验概率: ...
吴裕雄 python 机器学习——高斯贝叶斯分类器GaussianNB
import matplotlib.pyplot as plt from sklearn import datasets,naive_bayes from sklearn.model_selectio ...
Python之机器学习-朴素贝叶斯(垃圾邮件分类)
目录朴素贝叶斯(垃圾邮件分类) 邮箱训练集下载地址模块导入文本预处理遍历邮件训练模型测试模型朴素贝叶斯(垃圾邮件分类) 邮箱训练集下载地址邮箱训练集可以加我微信:nickchen121 ...
机器学习---朴素贝叶斯与逻辑回归的区别（Machine Learning Naive Bayes Logistic Regression Difference）
朴素贝叶斯与逻辑回归的区别: 朴素贝叶斯逻辑回归生成模型(Generative model) 判别模型(Discriminative model) 对特征x和目标y的联合分布P(x,y)建模,使用 ...
spark 机器学习朴素贝叶斯实现(二)
已知10月份10-22日网球场地,会员打球情况通过朴素贝叶斯算法,预测23,24号是否适合打网球.结果,日期,天气温度风速结果(0否,1是)天气(0晴天,1阴天,2下雨)温度(0热,1舒适,2冷) ...
spark 机器学习朴素贝叶斯原理(一)
朴素贝叶斯算法仍然是流行的挖掘算法之一,该算法是有监督的学习算法,解决的是分类问题,如客户是否流失.是否值得投资.信用等级评定等多分类问题.该算法的优点在于简单易懂.学习效率高.在某些领域的分类问题中 ...
机器学习入门-文本特征-使用LDA主题模型构造标签 1.LatentDirichletAllocation(LDA用于构建主题模型) 2.LDA.components(输出各个词向量的权重值)
函数说明 1.LDA(n_topics, max_iters, random_state) 用于构建LDA主题模型,将文本分成不同的主题参数说明:n_topics 表示分为多少个主题, max_i ...

随机推荐

Thrift 个人实战--Thrift 网络服务模型（转）
前言: Thrift作为Facebook开源的RPC框架, 通过IDL中间语言, 并借助代码生成引擎生成各种主流语言的rpc框架服务端/客户端代码. 不过Thrift的实现, 简单使用离实际生产环境还 ...
oracle 日期时间函数
ORACLE日期时间函数大全 TO_DATE格式(以时间:2007-11-02 13:45:25为例) Year: yy two digits 两位年 ...
FastAdmin CMS 插件下载
FastAdmin CMS 插件下载 CMS内容管理系统插件(含小程序) 自定义内容模型.自定义单页.自定义表单.自定义会员发布.付费阅读.小程序等提供全部前后端源代码和小程序源代码功能特性基于 ...
WPF Demo4
<Window x:Class="Demo4.MainWindow" xmlns="http://schemas.microsoft.com/winfx/2006/ ...
Neutron 理解 (2): 使用 Open vSwitch + VLAN 组网 [Neutron Open vSwitch + VLAN Virtual Network]
学习 Neutron 系列文章: (1)Neutron 所实现的虚拟化网络 (2)Neutron OpenvSwitch + VLAN 虚拟网络 (3)Neutron OpenvSwitch + GR ...
廖雪峰Java1-2程序基础-5浮点数运算
1.浮点数运算的特点很多浮点数无法精确表示计算有误差整型可以自动提升到浮点型如0.1用二进制表示会是一个无限循环的小数.计算机不可能在有限内存中表示一个无限小数.因此浮点数不能精确表示.也造成 ...
python selenium 问题汇总
FAQ 1.python+selenium+Safari浏览器,定位元素 selenium.common.exceptions.ElementNotVisibleException: Message: ...
java单机操作redis3.2.10和集群操作增删改查
先直接附上单机版的连接和增删改查,7000-7005是端口号 package com.yilian.util; import java.util.HashMap; import java.util.I ...
STL容器能力一览表和各个容器操作函数异常保证
STL容器能力一览表 Vector Deque List Set Multiset map Multimap 典型内部结构 dynamic array Array of arrays Doubly ...
Mybatis 接口绑定
MyBatis的接口绑定: 参考链接:http://blog.csdn.net/chris_mao/article/details/48836039 接口映射就是在IBatis中任意定义接口,然后把接 ...

机器学习入门-贝叶斯构造LDA主题模型，构造word2vec 1.gensim.corpora.Dictionary(构造映射字典) 2.dictionary.doc2vec(做映射) 3.gensim.model.ldamodel.LdaModel(构建主题模型)4lda.print_topics(打印主题).

机器学习入门-贝叶斯构造LDA主题模型，构造word2vec 1.gensim.corpora.Dictionary(构造映射字典) 2.dictionary.doc2vec(做映射) 3.gensim.model.ldamodel.LdaModel(构建主题模型)4lda.print_topics(打印主题).的更多相关文章

随机推荐

热门专题