Python常用功能函数系列总结（三）

本节目录

常用函数一：词频统计
常用函数二：word2vec
常用函数三：doc2vec
常用函数四：LDA主题分析

常用函数一：词频统计

# -*- coding: utf-8 -*-

"""

Datetime: 2020/06/25

Author: Zhang Yafei

Description: 统计词频

输入 文件名 列名 分割符

输出 词频统计结果-文件

"""

from collections import Counter

import pandas as pd

def count_word_freq(file_path, col_name, to_file, sep='; ', multi_table=False):

    """

    统计词频

    :param file_path: 读取文件路径

    :param col_name: 统计词频所在列名

    :param to_file: 保存文件路径

    :param sep: 词语分割符

    :param multi_table: 是否读取多张表

    :return:

    """

    if multi_table:

        datas = pd.read_excel(file_path, header=None, sheet_name=None)

        with pd.ExcelWriter(path=to_file) as writer:

            for sheet_name in datas:

                df = datas[sheet_name]

                keywords = (word for word_list in df.loc[df[col_name].notna(), col_name].str.split(sep) for word in word_list if word)

                words_freq = Counter(keywords)

                words = [word for word in words_freq]

                freqs = [words_freq[word] for word in words]

                words_df = pd.DataFrame(data={'word': words, 'freq': freqs})

                words_df.sort_values('freq', ascending=False, inplace=True)

                words_df.to_excel(excel_writer=writer, sheet_name=sheet_name, index=False)

            writer.save()

    else:

        df = pd.read_excel(file_path)

        keywords = (word for word_list in df.loc[df[col_name].notna(), col_name].str.split(sep) for word in word_list if word)

        words_freq = Counter(keywords)

        words = [word for word in words_freq]

        freqs = [words_freq[word] for word in words]

        words_df = pd.DataFrame(data={'word': words, 'freq': freqs})

        words_df.sort_values('freq', ascending=False, inplace=True)

        words_df.to_excel(to_file, index=False)

if __name__ == '__main__':

    # 对data.xlsx所有表中的keyword列统计词频，以默认'; '为分割符切割词语，统计该列分词后的词频，结果保存至res.xlsx中

    count_word_freq(file_path='data.xlsx', col_name='keyword', to_file='res.xlsx', multi_table=True)

经验分享：注意输入格式为excel文件，这也是我学习生活中常用的处理方式，直接拿去用，非常方便

另外，在我之前的一篇博客中，我介绍了Python统计词频常用的几种方式，不同的场景可以满足你各自的需求。博客传送门：

https://www.cnblogs.com/zhangyafei/p/10653977.html

常用函数二：word2vec

word2vec是一种词向量技术，核心思想是把单词转换成向量，意思相近的单词向量间的距离越近，反之越远。实际使用的体验也是非常好。

# -*- coding: utf-8 -*-

"""

Datetime: 2019/7/25

Author: Zhang Yafei

Description: word2vec

data.txt

    word1 word2 word3 ...

    word1 word2 word3 ...

    word1 word2 word3 ...

    ...   ...   ...   ...

"""

import warnings

warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

from gensim.models.word2vec import LineSentence

from gensim.models import Word2Vec

def word2vec_model_train(file, model_path, ):

    model = Word2Vec(LineSentence(file), size=100, window=5, iter=10, min_count=5)

    model.save(model_path)

def word2vec_load(self, model_path):

    model = Word2Vec.load(model_path)

    print(model.similarity('生育意愿', '主观幸福感'))

    for key in model.wv.similar_by_word('新生代农民工', topn=50):

        print(key)

if __name__ == "__main__":

    word2vec_model_train(file='data.txt', model_path='word2vec_keywords.model')

    # word2vec_load(model_path='word2vec_keywords.model')

常用函数三：doc2vec

doc2vec和word2vec类似, word2vec是词向量技术，那么doc2vec见名知意就是文档向量技术，可以将一篇文档转换成一个向量。理论上讲，意思相近的句子向量间的距离越近。

# -*- coding: utf-8 -*-

"""

Datetime: 2019/7/14

Author: Zhang Yafei

Description: doc2vec

docs format

    TaggedDocument([word1, word2, ...], [doc tag])

    TaggedDocument([word1, word2, ...], [doc tag])

    TaggedDocument([word1, word2, ...], [doc tag])

    ...

"""

import os

import warnings

warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

from matplotlib import pyplot as plt

from sklearn import metrics

from sklearn.cluster import KMeans

import pandas as pd

import numpy as np

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

output_dir = 'res'

model_dir = 'model'

if not os.path.exists(model_dir):

    os.mkdir(model_dir)

if not os.path.exists(output_dir):

    os.mkdir(output_dir)

def data_preparetion():

    """

    数据预处理 准备文档词矩阵

    :return [TaggedDocument(words=['contribut', 'antarctica', 'past', 'futur', 'sea-level', 'rise'], tags=[0]),

             TaggedDocument(words=['evid', 'limit', 'human', 'lifespan'], tags=[1]),

             ...]

    """

    print('开始准备文档语料')

    df = pd.read_excel('data/data.xlsx')

    documents = iter(df.text)

    for index, doc in enumerate(documents):

        doc_word_list = doc.split()

        yield TaggedDocument(doc_word_list, [index])

def get_datasest():

    df = pd.read_excel('data/data.xlsx')

    documents = iter(df.text)

    datasets = []

    for index, doc in enumerate(documents):

        doc_word_list = doc.split()

        datasets.append(TaggedDocument(doc_word_list, [index]))

    return datasets

class Doc2VecModel(object):

    """

    Doc2Vec模型

    """

    def __init__(self, vector_size=100, dm=0, window=10, epochs=30, iter_num=10):

        self.model = Doc2Vec(vector_size=vector_size,

                             dm=dm,

                             window=window,

                             epochs=epochs,

                             iter=iter_num,

                             )

    def run(self, documents, model_path, epochs=30):

        """

        训练模型及结果的保存

        :param documents: iterable [[doc1], [doc2], [doc3], ...]

        :param model_path: str

        :param max_epochs: int

        :param epochs: int

        :return:

        """

        # 根据文档词矩阵构建词汇表

        print('开始构建词汇表')

        self.model.build_vocab(documents)

        print('开始训练')

        self.model.train(documents, total_examples=self.model.corpus_count, epochs=epochs)

        # 模型保存

        self.model.save(f'{model_dir}/{model_path}')

        print(f'{model_path}\t保存成功')

    @staticmethod

    def simlarity_cal(vector1, vector2):

        vector1_mod = np.sqrt(vector1.dot(vector1))

        vector2_mod = np.sqrt(vector2.dot(vector2))

        if vector2_mod != 0 and vector1_mod != 0:

            simlarity = (vector1.dot(vector2)) / (vector1_mod * vector2_mod)

        else:

            simlarity = 0

        return simlarity

    def model_test(self):

        doc2vec_model = Doc2Vec.load(f'{model_dir}/doc2vec.model')

        vectors_docs = doc2vec_model.docvecs.vectors_docs

        datasets = get_datasest()

        sentence1 = '老年人 生活满意度 影响 全国 老年人口 健康状况 调查数据 以往 社会经济因素 健康 因素 人口因素 老年人 生活满意度 影响 基础 引入 变量 模型 分析 老年人 生活满意度 自评 影响 统计 控制 影响因素 基础 老年人 性格 情绪 孤独感 焦虑 程度 生活满意度 自评 影响 影响 原有 模型 变量 变化 生活满意度 老年人'

        inferred_vector = doc2vec_model.infer_vector(sentence1)

        sims = doc2vec_model.docvecs.most_similar([inferred_vector], topn=10)

        for count, sim in sims:

            sentence = datasets[count]

            words = ''

            for word in sentence[0]:

                words = words + word + ' '

            print(words, sim, len(sentence[0]))

    def get_topic_num(self, min_topic_num, max_topic_num):

        doc2vec_model = Doc2Vec.load(f'{model_dir}/doc2vec.model')

        vectors_docs = doc2vec_model.docvecs.vectors_docs

        silhouette_score_dict = {}

        ch_score_dict = {}

        inertia_score = {}

        for n in range(min_topic_num, max_topic_num + 1):

            km = KMeans(n_clusters=n)

            km.fit(X=vectors_docs)

            pre_labels = km.labels_

            inertia = km.inertia_

            sil_score = metrics.silhouette_score(X=vectors_docs, labels=pre_labels)

            ch_score = metrics.calinski_harabaz_score(X=vectors_docs, labels=pre_labels)

            print(f'{n} inertia score: {inertia} silhouette_score: {sil_score} ch score: {ch_score}')

            inertia_score[n] = inertia

            silhouette_score_dict[n] = sil_score

            ch_score_dict[n] = ch_score

        self.plot_image(data=silhouette_score_dict, xticks=range(min_topic_num, max_topic_num + 1),

                        title='不同聚类个数下silhouette_score对比', xlabel='cluster_num',

                        ylabel='silhouette_score')

        self.plot_image(data=ch_score_dict, xticks=range(min_topic_num, max_topic_num + 1),

                        title='不同聚类个数下calinski_harabaz_score对比', xlabel='cluster_num',

                        ylabel='calinski_harabaz_score')

        self.plot_image(data=inertia_score, xticks=range(min_topic_num, max_topic_num + 1),

                        title='不同聚类个数下inertia score对比',

                        xlabel='cluster_num', ylabel='inertia_score')

    @staticmethod

    def plot_image(data, title, xticks, xlabel, ylabel):

        """ 画图 """

        plt.rcParams['font.sans-serif'] = ['SimHei']

        plt.figure(figsize=(8, 4), dpi=500)

        plt.plot(data.keys(), data.values(), '#007A99')

        plt.xticks(xticks)

        plt.xlabel(xlabel)

        plt.ylabel(ylabel)

        plt.title(title)

        plt.savefig(f'{output_dir}/{title}.png',

                    bbox_inches='tight', pad_inches=0.1)

        plt.show()

if __name__ == '__main__':

    docs = data_preparetion()

    model = Doc2VecModel(vector_size=100, epochs=30, window=10, dm=0, iter_num=20)

    model.run(documents=docs, model_path=f'doc2vec.model')

    # model.model_test()

    # model.get_topic_num(min_topic_num=5, max_topic_num=40)

常用函数四：LDA主题分析

LDA(Latent dirichlet allocation)是文档主题生成模型中最有代表性的一种。LDA于2003年由David Blei等人提出，由于其应用简单且有效，在学术界被广泛应用在主题聚类、热点识别、演化分析等领域。

# -*- coding: utf-8 -*-

"""

Datetime: 2019/7/14

Author: Zhang Yafei

Description: LDA主题模型

安装依赖环境

pip install pandas numpy matplotlib sklearn

使用说明：

1. 数据准备

index, docs = data_preparetion(path='data/数据.xlsx', doc_col='摘要')

数据格式为excel, path是数据文件路径, doc_col是列名, 需修改数据文件路径和文档列名

2. LDA模型参数设定

LDA模型指定主题个数范围

def main(index=index, docs=docs, test_topic_num=True, tfidf=False, max_iter=50, min_topic=5, max_topic=30,

         topic_word_num=20)

    :param index: 索引

    :param docs: 文档

    :param n_topics: 指定主题个数

    :param tfidf: 是否对文档采用tfidf编码

    :param max_iter: 最大迭代次数

    :param min_topic: 最小主题个数 前提为test_topic_num=True

    :param max_topic: 最大主题个数 前提为test_topic_num=True

    :param learning_offset: 学习率

    :param random_state: 随机状态值

    :param test_topic_num: 测试主题个数

    :param topic_word_num: 主题词矩阵词的个数

"""

import json

import os

import time

from functools import wraps

import numpy as np

import pandas as pd

import scipy

from matplotlib import pyplot as plt

from sklearn.decomposition import LatentDirichletAllocation

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

pd.set_option('display.max_columns', None)

output_dir = 'res'

if not os.path.exists(output_dir):

    os.mkdir(output_dir)

def timeit(func):

    """ 时间装饰器 """

    @wraps(func)

    def inner(*args, **kwargs):

        start_time = time.time()

        ret = func(*args, **kwargs)

        end_time = time.time() - start_time

        if end_time < 60:

            print(f'共花费时间：', round(end_time, 2), '秒')

        else:

            minute, sec = divmod(end_time, 60)

            print(f'花费时间\t{round(minute)}分\t{round(sec, 2)}秒')

        return ret

    return inner

class Articles(object):

    def __init__(self, data, stopwords=False):

        self.data = data

        if stopwords:

            self.stopwords = set([line.strip() for line in open('data/stopwords.txt')])

        else:

            self.stopwords = None

    def __iter__(self):

        if self.stopwords:

            for word_list in self.data:

                yield ' '.join(self.pro_words_with_stopwords(word_list))

        else:

            for word_list in self.data:

                yield ' '.join(self.pro_words(word_list))

    @staticmethod

    def word_replace(word):

        return word.replace(' & ', '_____').replace('/', '___').replace(', ', '__'). \

            replace(',', '__').replace(' ', '_').replace('-', '____'). \

            replace('(', '______').replace(')', '______')

    def pro_words_with_stopwords(self, word_list):

        return (self.word_replace(word) for word in word_list if word.lower() not in self.stopwords)

    def pro_words(self, word_list):

        return (self.word_replace(word) for word in word_list)

class SklearnLDA(object):

    def __init__(self, corpus, n_topics, tf_idf=True, max_iter=10, learning_method='online', learning_offset=50.,

                 random_state=0, res_dir='res', english_words_fixed=False):

        self.tfidf = tf_idf

        self.lda_model = LatentDirichletAllocation(n_components=n_topics, max_iter=max_iter,

                                                   doc_topic_prior=0.001, topic_word_prior=0.02,

                                                   learning_method=learning_method,

                                                   learning_offset=learning_offset,

                                                   random_state=random_state)  # 定义lda模型

        print('正在将语料转化为向量------------')

        self.vectorizer = TfidfVectorizer() if tf_idf else CountVectorizer()

        self.bow_corpus = self.vectorizer.fit_transform(corpus)  # 将语料生成词袋向量

        if english_words_fixed:

            self.vocab = self.fixed_vocab()

        else:

            self.vocab = self.vectorizer.get_feature_names()  # 词汇表

        self.res_dir = res_dir

    def fixed_vocab(self):

        return [

            vocab.replace('_____', ' & ').replace('____', '-').replace('___', '/').replace('__', ',').replace('_', ' ')

            for vocab in self.vectorizer.get_feature_names()]

    def get_topic_num(self, index, max_iter=10, min_topic=5, max_topic=30, learning_offset=50., random_state=0,

                      topic_word_num=30):

        """ 确定LDA主题个数 """

        print('开始训练模型, 计算困惑度')

        perplexity_dict = {}

        kld_list = {}

        jsd_list = {}

        cos_sim_list = {}

        w_score_dict = {}

        x_ticks = list(range(min_topic, max_topic + 1))

        for n_topics in x_ticks:

            result_dir = f'{self.res_dir}/{n_topics}'

            if not os.path.exists(result_dir):

                os.mkdir(result_dir)

            if os.path.exists(f'{result_dir}/topic-word-{topic_word_num}.csv'):

                doc_topic_matrix = np.loadtxt(f'{result_dir}/doc_topic_matrix.txt')

                topic_word_matrix = np.loadtxt(f'{result_dir}/topic_word_matrix.txt')

            else:

                lda = LatentDirichletAllocation(n_components=n_topics, max_iter=max_iter, learning_method='online',

                                                doc_topic_prior=0.001, topic_word_prior=0.02,

                                                learning_offset=learning_offset,

                                                random_state=random_state)  # 定义lda模型

                doc_topic_matrix = lda.fit_transform(self.bow_corpus)

                topic_word_matrix = lda.components_

                # 计算困惑度

                perplexity = lda.perplexity(self.bow_corpus)

                perplexity_dict[n_topics] = perplexity

                print(f'topic: {n_topics}\tsklearn preplexity: {perplexity:.3f}')

                # 保存数据

                np.savetxt(f'{result_dir}/doc_topic_matrix.txt', doc_topic_matrix)

                np.savetxt(f'{result_dir}/topic_word_matrix.txt', topic_word_matrix)

                doc_topic_columns = [f'topic{num}' for num in range(

                    1, n_topics + 1)]

                topic_word_columns = [

                    f'word{num}' for num in range(1, topic_word_num + 1)]

                doc_topic_index = index

                topic_word_index = pd.Index(data=doc_topic_columns, name='topic')

                doc_topic_data = np.argsort(-doc_topic_matrix, axis=1)

                topic_word_data = np.array(self.vocab)[np.argsort(-topic_word_matrix, axis=1)[:, :topic_word_num]]

                self.save_data(file_path=f'{result_dir}/doc-topic.csv', data=doc_topic_data,

                               columns=doc_topic_columns, index=doc_topic_index)

                self.save_data(file_path=f"{result_dir}/topic-word-{topic_word_num}.csv", data=topic_word_data,

                               columns=topic_word_columns, index=topic_word_index)

            # 计算文本–主题最大平均分布概率和主题–词语平均相似度概率的加权数值的方法

            w_score = self.weight_score(doc_topic_matrix, topic_word_matrix)

            w_score_dict[n_topics] = w_score

            # 计算KL距离和JS距离

            kld_sum = 0

            jsd_sum = 0

            for topic_vec1 in topic_word_matrix:

                for topic_vec2 in topic_word_matrix:

                    kld_sum += self.kl_divergence(topic_vec1, topic_vec2)

                    jsd_sum += self.js_divergence(topic_vec1, topic_vec2)

            avg_kld = kld_sum / (n_topics ** 2)

            kld_list[n_topics] = avg_kld

            avg_jsd = jsd_sum / (n_topics ** 2)

            jsd_list[n_topics] = avg_jsd

            # 计算余弦相似度

            cos_sim_matrix = cosine_similarity(X=topic_word_matrix)

            cos_sim = cos_sim_matrix.sum() / (n_topics * (n_topics - 1))

            cos_sim_list[n_topics] = cos_sim

            # 计算JS散度

            for topic_vec1 in topic_word_matrix:

                for topic_vec2 in topic_word_matrix:

                    jsd_sum += self.js_divergence(topic_vec1, topic_vec2)

            # 打印

            print(f'topic: {n_topics}\tavg KLD: {avg_kld:.3f}')

            print(f'topic: {n_topics}\tavg JSD: {avg_jsd:.3f}')

            print(f'topic: {n_topics}\tcosine_similarity: {cos_sim:.3f}')

            print(f'topic: {n_topics}\tweight_score: {w_score:.3f}')

        # 画图

        if perplexity_dict:

            self.plot_image(data=perplexity_dict, x_ticks=list(perplexity_dict.keys()), title='lda_topic_perplexity',

                            xlabel='topic num', ylabel='perplexity')

        self.plot_image(data=kld_list, x_ticks=x_ticks, title='lda_topic_KLD',

                        xlabel='topic num', ylabel='KLD')

        self.plot_image(data=jsd_list, title='lda_topic_JSD', x_ticks=x_ticks,

                        xlabel='topic num', ylabel='JSD')

        self.plot_image(data=cos_sim_list, title='lda_topic_cosine_simlarity', x_ticks=x_ticks,

                        xlabel='topic num', ylabel='cosine_simlarity')

        self.plot_image(data=w_score_dict, title='lda_topic_weight_score', x_ticks=x_ticks,

                        xlabel='topic num', ylabel='weight_score')

    def train(self, index, topic_word_num=10, save_matrix=True, save_data=True, print_doc_topic=False,

              print_topic_word=True, save_vocab=True):

        """ 训练LDA模型 """

        print('正在训练模型')

        doc_topic_matrix = self.lda_model.fit_transform(self.bow_corpus)

        topic_word_matrix = self.lda_model.components_

        if save_vocab:

            with open('res/vocab.txt', 'w') as f:

                json.dump(self.vocab, f)

        if save_matrix:

            print('正在保存矩阵')

            if self.tfidf:

                np.savetxt(f'{output_dir}/doc_topic_tfidf_matrix.txt', doc_topic_matrix)

                np.savetxt(f'{output_dir}/topic_word_tfidf_matrix.txt', topic_word_matrix)

            else:

                np.savetxt(f'{output_dir}/doc_topic_matrix.txt', doc_topic_matrix)

                np.savetxt(f'{output_dir}/topic_word_matrix.txt', topic_word_matrix)

        if save_data:

            print('正在保存数据')

            doc_topic_columns = [f'topic{num}' for num in range(

                1, self.lda_model.n_components + 1)]

            topic_word_columns = [

                f'word{num}' for num in range(1, topic_word_num + 1)]

            doc_topic_index = index

            topic_word_index = pd.Index(data=doc_topic_columns, name='topic')

            doc_topic_data = np.argsort(-doc_topic_matrix, axis=1)

            topic_word_data = np.array(

                self.vocab)[np.argsort(-topic_word_matrix, axis=1)[:, :topic_word_num]]

            if self.tfidf:

                self.save_data(file_path=f'{output_dir}/doc-topic_tfidf.csv', data=doc_topic_data,

                               columns=doc_topic_columns, index=doc_topic_index)

                self.save_data(file_path=f"{output_dir}/topic-word-tfidf_{topic_word_num}.csv", data=topic_word_data,

                               columns=topic_word_columns, index=topic_word_index)

            else:

                self.save_data(file_path=f'{output_dir}/doc-topic.csv', data=doc_topic_data,

                               columns=doc_topic_columns, index=doc_topic_index)

                self.save_data(file_path=f"{output_dir}/topic-word-{topic_word_num}.csv", data=topic_word_data,

                               columns=topic_word_columns, index=topic_word_index)

        if print_doc_topic:

            print('正在输出文档-主题')

            for doc_num, doc_topic_index in zip(index, np.argsort(-doc_topic_matrix, axis=1)):

                print(f'{doc_num}:\t{doc_topic_index[:5]}')

        if print_topic_word:

            print('正在输出主题-词')

            for topic_num, topic_word_index in enumerate(np.argsort(-topic_word_matrix, axis=1)):

                words_list = np.array(

                    self.vocab)[topic_word_index][: 10]

                print(f'主题{topic_num}:\t{words_list}')

    @staticmethod

    def save_data(file_path, data, columns, index):

        """ 保存数据 """

        df = pd.DataFrame(data=data, columns=columns, index=index)

        df.to_csv(file_path, encoding='utf_8_sig')

        print(f'{file_path}\t保存成功')

    @staticmethod

    def kl_divergence(p, q):

        """

        有时也称为相对熵，KL距离。对于两个概率分布P、Q，二者越相似，KL散度越小。

        KL散度满足非负性

        KL散度是不对称的，交换P、Q的位置将得到不同结果。

        :param p:

        :param q:

        :return:

        """

        return scipy.stats.entropy(p, q)

    @staticmethod

    def js_divergence(p, q):

        """

        JS散度基于KL散度，同样是二者越相似，JS散度越小。

            JS散度的取值范围在0-1之间，完全相同时为0

            JS散度是对称的

        :param p:

        :param q:

        :return:

        """

        M = (p + q) / 2

        return 0.5 * scipy.stats.entropy(p, M) + 0.5 * scipy.stats.entropy(q, M)

    @staticmethod

    def weight_score(doc_topic_matrix, topic_word_matrix):

        # doc_topic_matrix = np.loadtxt('res/doc_topic_matrix.txt')

        # topic_word_matrix = np.loadtxt('res/topic_word_matrix.txt')

        # 计算最大平均主题分布概率

        max_mean_topic_prob = np.mean(np.max(doc_topic_matrix, axis=1))

        # 计算平均主题相似度

        topic_cos_sim_matrix = cosine_similarity(X=topic_word_matrix)

        topic_num = topic_cos_sim_matrix.shape[0]

        mean_topic_sim = np.sum(np.where(topic_cos_sim_matrix > 0.99, 0, topic_cos_sim_matrix)) / (

                topic_num * (topic_num - 1))

        # 加权得分

        weight_score = max_mean_topic_prob / mean_topic_sim

        # print(f'加权得分：{weight_score}')

        return weight_score

    def plot_image(self, data, title, x_ticks, xlabel, ylabel):

        """ 画图 """

        plt.figure(figsize=(12, 6), dpi=180)

        plt.plot(list(data.keys()), list(data.values()), '#007A99')

        plt.xticks(x_ticks)

        plt.xlabel(xlabel)

        plt.ylabel(ylabel)

        plt.title(title)

        plt.savefig(f'{self.res_dir}/{title}.png',

                    bbox_inches='tight', pad_inches=0.1)

        plt.show()

def data_preparetion(path, doc_col, index_col=None, sep=None, english_words_fixed=False, stopwords=False):

    """

    数据准备

    :param path: 数据路径

    :param doc_col: 文档列

    :param index_col: 索引列

    :return:

    """

    df = pd.read_excel(path)

    df.dropna(subset=[doc_col], inplace=True)

    if sep:

        docs = iter(df[doc_col].str.split(sep))

    else:

        docs = iter(df[doc_col])

    if english_words_fixed:

        documents = Articles(data=docs, stopwords=stopwords)

    else:

        documents = docs

    index_list = df[index_col] if index_col else df.index

    return index_list, documents

@timeit

def main(index, docs, n_topics=10, tfidf=False, max_iter=5, min_topic=5, max_topic=30, learning_offset=50.,

         random_state=0,

         test_topic_num=False, topic_word_num=30, res_dir='res', english_words_fixed=False):

    """

    主函数

    :param index: 索引

    :param docs: 文档

    :param n_topics: 指定主题个数

    :param tfidf: 是否对文档采用tfidf编码

    :param max_iter: 最大迭代次数

    :param min_topic: 最小主题个数 前提为test_topic_num=True

    :param max_topic: 最大主题个数 前提为test_topic_num=True

    :param learning_offset: 学习率

    :param random_state: 随机状态值

    :param test_topic_num: 测试主题个数

    :param topic_word_num: 主题词矩阵词的个数

    :param res_dir: 结果文件夹

    :return:

    """

    if not os.path.exists(res_dir):

        os.mkdir(res_dir)

    lda = SklearnLDA(corpus=docs, n_topics=n_topics, max_iter=max_iter, tf_idf=tfidf, learning_offset=learning_offset,

                     random_state=random_state, res_dir=res_dir, english_words_fixed=english_words_fixed)

    if test_topic_num:

        lda.get_topic_num(index=index, max_iter=max_iter, min_topic=min_topic, max_topic=max_topic,

                          learning_offset=learning_offset, random_state=random_state, topic_word_num=topic_word_num)

    else:

        lda.train(index=index, save_matrix=True, save_data=True,

                  print_doc_topic=False, print_topic_word=True, topic_word_num=topic_word_num)

if __name__ == '__main__':

    # 数据准备

    # index, docs = data_preparetion(path='data/山西政策3.xlsx', doc_col='标题分词')

    index, docs = data_preparetion(path='data/COVID-19-2020.xlsx', doc_col='keywords', index_col='PMID', sep='; ', english_words_fixed=True, stopwords=False)

    # LDA模型指定主题个数范围

    main(index=index, docs=docs, test_topic_num=True, tfidf=False, max_iter=50, min_topic=5, max_topic=10,

         topic_word_num=20, res_dir='res/聚类结果', english_words_fixed=True)

    # LDA模型指定主题个数

    # main(index=index, docs=docs, n_topics=19, tfidf=False, max_iter=50)

topic_evolution.py

# -*- coding: utf-8 -*-

'''

Datetime: 2019/08/16

author: Zhang Yafei

description:

colormap    https://blog.csdn.net/Mr_Cat123/article/details/78638491

'''

import warnings

warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim.matutils')

import pandas as pd

import numpy as np

import os

from gensim.models import Word2Vec

import seaborn as sns

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['SimHei']

# plt.figure(figsize=(16, 6), dpi=500)

class TopicEvolution(object):

    def __init__(self, data_path, doc_topic_matrix_path=None, topic_word_csv_path=None):

        self.data_path = data_path

        self.topic_word_csv_path = topic_word_csv_path

        self.doc_topic_matrix_path = doc_topic_matrix_path

    def topic_intensity_evolution(self, start_year, end_year, topic_num, res_dir='res', space=1):

        df = pd.read_excel(self.data_path)

        # print(df['年'])

        doc_topic_matrix = np.loadtxt(self.doc_topic_matrix_path.format(topic_num))

        # # 柱状图

        x = [f'topic{num}' for num in range(1, topic_num + 1)]

        y = doc_topic_matrix.mean(axis=0)

        print(x, np.mean(y))

        self.plot_bar(x=x, y=y, path=f'{res_dir}/{topic_num}/柱状图.png')

        # # # 热图

        doc_topic_df = pd.DataFrame(data=doc_topic_matrix)

        doc_topic_df.index = df['年']

        topic_intensity_df = pd.DataFrame(columns=list(range(start_year, end_year, space)))

        for year in range(start_year, end_year, space):

            topic_intensity_df[year] = doc_topic_df.loc[year, :].mean()

        topic_intensity_df.index = [f'Topic {num}' for num in range(1, topic_num + 1)]

        self.plot_heatmap(data=topic_intensity_df, cmap='Reds', xlabel='年份', ylabel='主题',

                          path=f'{res_dir}/{topic_num}/热力图.png')

        x = [int(year) for year in range(start_year, end_year, space)]

        print(x, topic_intensity_df)

        topic_intensity_df.to_excel('res/topic_intensity.xlsx')

        self.plot(x=x, data_list=topic_intensity_df, path=f'{res_dir}/{topic_num}/折线图.png')

    @staticmethod

    def plot(x, data_list, path=None):

        for index in data_list.index.unique():

            y = [num for num in data_list.loc[index, :]]

            # plt.plot(x, y)

            plt.plot(x, y, "x-", label=f'主题{index}')

        plt.savefig(path)

        # plt.legend(loc='best', labels=[f'主题{num}' for num in range(1, len(data_list.index.unique()+1))])

        plt.show()

    @staticmethod

    def plot_bar(x, y, path=None):

        plt.bar(x, y, width=0.5)

        plt.xticks(range(len(x)), x, rotation=45)

        plt.axhline(y=np.mean(y), xmin=.05, xmax=.95, ls='--', color='black')

        plt.savefig(path)

        plt.show()

    @staticmethod

    def plot_heatmap(data, cmap, xlabel, ylabel, path=None):

        if cmap:

            sns.heatmap(data, cmap=cmap)

        else:

            sns.heatmap(data)

        plt.xticks(rotation=45)

        plt.xlabel(xlabel)

        plt.ylabel(ylabel)

        # plt.title(name)

        # 保存图片

        plt.savefig(path)

        # 显示图片

        plt.show()

    def extract_keywords_txt(self):

        df = pd.read_excel(self.data_file)

        # data_key = pd.read_csv(f'{data_dir}/data_key.txt', delimiter='\t', encoding='gbk')

        # df['keywords'] = data_key.ID.apply(self.add_keywords)

        # df['keywords'] = df.apply(self.add_keywords, axis=1)

        # df.to_excel(self.data_file)

        # for year in range(2004, 2019):

        #     print(year)

        #     year_df = pd.DataFrame(columns=['ID'])

        #     year_df['ID'] = df.loc[df['年'] == year, 'keywords'].str.strip().str.replace(' ', '; ')

        #     year_df.reset_index(inplace=True, drop=True)

        #     year_df.to_csv(f'{data_dir}/{year}.txt', sep='\t')

        with open(self.keywords_txt, 'w', encoding='utf-8') as f:

            for text in df.keywords:

                f.write(f'{text}\n')

    @staticmethod

    def word_replace(word):

        return word.replace(' & ', '_____').replace('/', '___').replace(', ', '__').replace(',', '__').replace(' ',

                                                                                                               '_').replace(

            '-', '____').replace('(', '______').replace(')', '______')

    def clac_inter_intimate(self, row, model, keywords):

        topic_internal_sim_sum = []

        for word1 in row:

            word1 = self.word_replace(word1)

            if word1 not in keywords:

                continue

            for word2 in row:

                word2 = self.word_replace(word2)

                if (word2 not in keywords) or (word1 == word2):

                    continue

                try:

                    topic_internal_sim_sum.append(model.wv.similarity(word1, word2))

                except KeyError:

                    continue

                # print(word1, word2, model.wv.similarity(word1, word2))

        return np.mean(topic_internal_sim_sum)

    def topic_intimate(self, model, topic_num=None):

        df = pd.read_csv(self.topic_word_csv_path, index_col=0)

        with open('data/vocab.txt', encoding='utf-8') as f:

            keywords = {word.strip() for word in f if word}

        topic_inter_intimate = np.mean(df.apply(self.clac_inter_intimate, axis=1, args=(model, keywords)))

        topic_exter_sim_sum = []

        for row1 in df.values.tolist():

            for row2 in df.values.tolist():

                if row1 == row2:

                    continue

                topic_exter_sim = []

                for word1 in row1:

                    word1 = self.word_replace(word1)

                    if word1 not in keywords:

                        continue

                    for word2 in row2:

                        word2 = self.word_replace(word2)

                        if word2 not in keywords:

                            continue

                        try:

                            topic_exter_sim.append(model.wv.similarity(word1, word2))

                        except KeyError as e:

                            continue

                topic_exter_sim_sum.append(np.mean(topic_exter_sim))

        # 主题间亲密度

        topic_exter_intimate = np.mean(topic_exter_sim_sum)

        # 主题亲密度 = （主题内亲密度 - 主题间亲密度） / 主题内亲密度

        topic_proximity = (topic_inter_intimate - topic_exter_intimate) / topic_inter_intimate

        print(topic_num, topic_inter_intimate, topic_exter_intimate, topic_proximity)

        return topic_num, topic_proximity

def file_rename(dir_path, start, end):

    for num in range(start, end):

        os.rename(f'res/2004-2018/{dir_path}/{num}/文档-主题.csv', f'res/2004-2018/{dir_path}/{num}/doc-topic.csv')

        # os.rename(f'res/2004-2018/{dir_path}/{num}/主题-词-30.csv', f'res/2004-2018/{dir_path}/{num}/topic-word-30.csv')

def plot_image(data, title, x_ticks, xlabel, ylabel, output_dir=None):

    """ 画图 """

    plt.figure(figsize=(12, 6), dpi=180)

    plt.plot(data.keys(), data.values(), '#007A99')

    plt.xticks(x_ticks)

    plt.xlabel(xlabel)

    plt.ylabel(ylabel)

    plt.title(title)

    if output_dir:

        plt.savefig(f'{output_dir}/{title}.png', bbox_inches='tight', pad_inches=0.1)

    plt.show()

def start_plot(start_year, end_year, data_path, doc_topic_matrix_path, res_dir, topic_num=None, min_topics=None,

               max_topics=None, space=1):

    """ 柱状图、折线图、heatmap图 """

    if min_topics and max_topics:

        for n_topics in range(min_topics, max_topics + 1):

            topic = TopicEvolution(data_path=data_path, doc_topic_matrix_path=doc_topic_matrix_path.format(n_topics))

            topic.topic_intensity_evolution(start_year=start_year, end_year=end_year, topic_num=n_topics,

                                            res_dir=res_dir, space=space)

    elif topic_num:

        topic = TopicEvolution(data_path=data_path, doc_topic_matrix_path=doc_topic_matrix_path)

        topic.topic_intensity_evolution(start_year=start_year, end_year=end_year, topic_num=topic_num, res_dir=res_dir,

                                        space=space)

def start_run(model_path, data_path, topic_word_csv_path, min_topics, max_topics, res_dir=None):

    """ 主题亲密度 """

    topic_proximity_dict = {}

    model = Word2Vec.load(model_path)

    for n_topics in range(min_topics, max_topics + 1):

        topic = TopicEvolution(data_path='data/data.xlsx', topic_word_csv_path=topic_word_csv_path.format(n_topics))

        proximity = topic.topic_intimate(topic_num=n_topics, model=model)

        topic_proximity_dict[n_topics] = proximity

    # plot_image(data=topic_proximity_dict, x_ticks=list(range(start, end+1)), title='topic_proximity', xlabel='topic num', ylabel='proximity', output_dir='res/2004-2018')

if __name__ == "__main__":

    topic = TopicEvolution(data_path='data/data.xlsx')

    start_plot(min_topics=5, max_topics=30, start_year=1993, end_year=2018, data_path='GLP1.xlsx',

               doc_topic_matrix_path='res/{}/doc_topic_matrix.txt', res_dir='res', space=5)

    start_run(model_path='model/word2vec.model', data_path='data/GLP1.xlsx',

              topic_word_csv_path='res/{}/topic-word-30.csv', min_topics=5, max_topics=6)

经验分享：我都写好了，直接拿去用吧！

Python常用功能函数系列总结（三）的更多相关文章

Python常用功能函数系列总结（一）
本节目录常用函数一:获取指定文件夹内所有文件常用函数二:文件合并常用函数三:将文件按时间划分常用函数四:数据去重写在前面写代码也有很长时间了,总觉得应该做点什么有价值的事情,写代码初始阶段 ...
Python常用功能函数系列总结（二）
本节目录常用函数一:sel文件转换常用函数二:refwork文件转换常用函数三:xml文档解析常用函数四:文本分词常用函数一:sel文件转换 sel是种特殊的文件格式,具体应用场景的话可以 ...
Python常用功能函数系列总结（六）
本节目录常用函数一:词云图常用函数二:关键词清洗常用函数三:中英文姓名转换常用函数四:去除文本中的HTML标签和文本清洗常用函数一:词云图 wordcloud # -*- coding: ...
Python常用功能函数系列总结（五）
本节目录常用函数一:向量距离和相似度计算常用函数二:pagerank 常用函数三:TF-IDF 常用函数四:关键词提取常用函数一:向量距离和相似度计算 KL距离.JS距离.余弦距离 # -*- ...
Python常用功能函数系列总结（四）之数据库操作
本节目录常用函数一:redis操作常用函数二:mongodb操作常用函数三:数据库连接池操作常用函数四:pandas连接数据库常用函数五:异步连接数据库常用函数一:redis操作 # -* ...
Python常用功能函数系列总结（七）
本节目录常用函数一:批量文件重命名常用函数一:批量文件重命名 # -*- coding: utf-8 -*- """ DateTime : 2021/02/08 10 ...
Python常用功能函数总结系列
Python常用功能函数系列总结(一) 常用函数一:获取指定文件夹内所有文件常用函数二:文件合并常用函数三:将文件按时间划分常用函数四:数据去重 Python常用功能函数系列总结(二) 常用函数 ...
Python常用功能函数
Python常用功能函数汇总 1.按行写字符串到文件中 import sys, os, time, json def saveContext(filename,*name): format = '^' ...
深入理解javascript函数系列第三篇——属性和方法
× 目录 [1]属性 [2]方法前面的话函数是javascript中的特殊的对象,可以拥有属性和方法,就像普通的对象拥有属性和方法一样.甚至可以用Function()构造函数来创建新的函数对象.本 ...

随机推荐

C/C++ Qt 数据库与Chart历史数据展示
在前面的博文中具体介绍了QChart组件是如何绘制各种通用的二维图形的,本章内容将继续延申一个新的知识点,通过数据库存储某一段时间节点数据的走向,当用户通过编辑框提交查询记录时,程序自动过滤出该时间节 ...
培训班输出的大量学员，会对IT行业产生哪些影响？
先说下会有哪些影响呢? 1 可能也就是些大城市的,规模比较大的,口碑比较好的培训学校输出的码农才能入行,而且能做长久.一些线上的所谓培训机构,或者小城市的培训学校,输出的能入行的码农,其实规模很有 ...
Redis版本历史
目录 Redis4.0 Redis3.2 Redis3.0 Redis2.8 Redis2.6 Redis4.0 可能出乎很多人的意料,Redis3.2之后的版本是4.0,而不是3.4.3.6.3.8 ...
[BUUCTF]REVERSE——[WUSTCTF2020]level3
[WUSTCTF2020]level3 附件步骤: 例行检查,64位程序,无壳 64位ida载入,找到关键函数看样子是个base64加密,但又感觉没那么简单,再翻翻左边的函数,找到了base64加 ...
Windows 任务计划部署 .Net 控制台程序
Windows 搜索:任务计划程序创建任务添加任务名称设置触发器:这里设置每10分钟执行一次保存之后显示此任务会从每天的 0:10:00 执行第一次后一直循环下去. 在操作选项卡下,选择启动 ...
Linux（centos7）设置docker服务开机自启动以及容器自启动
docker服务开机自启动 systemctl enable docker 设置容器自启动可以在运行的时候通过设置--restart 参数 docker run --restart always - ...
IDEA 修改之前保存的git地址的账号和密码
1.打开控制面板快捷键 win+R,然后输入control,打开控制面板 2.用户账户 3.管理windows凭据 4.点击里面的git就可以修改了
再谈多线程模型之生产者消费者（多生产者和单一消费者）（c++11实现）
0.关于为缩短篇幅,本系列记录如下: 再谈多线程模型之生产者消费者(基础概念)(c++11实现) 再谈多线程模型之生产者消费者(单一生产者和单一消费者)(c++11实现) 再谈多线程模型之生产者消费 ...
【LeetCode】611. Valid Triangle Number 解题报告（Python）
作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题目地址:https://leetcode.com/problems/valid-tri ...
[炼丹术]YOLOv5训练自定义数据集
YOLOv5训练自定义数据一.开始之前的准备工作克隆 repo 并在Python>=3.6.0环境中安装requirements.txt,包括PyTorch>=1.7.模型和数据集会从 ...