NLP（二十）利用词向量实现高维词在二维空间的可视化

原文链接：http://www.one2know.cn/nlp20/

准备

Alice in Wonderland数据集可用于单词抽取，结合稠密网络可实现其单词的可视化，这与编码器-解码器架构类似。
代码

from __future__ import print_function

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder

import matplotlib.pyplot as plt

import nltk

import numpy as np

import pandas as pd

import random

from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer

import string

from nltk import pos_tag

from nltk.stem import PorterStemmer

def preprocessing(text):

    text2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in text]).split())

    tokens = [word for sent in nltk.sent_tokenize(text2) for word in nltk.word_tokenize(sent)]

    tokens = [word.lower() for word in tokens]

    stopwds = stopwords.words('english')

    tokens = [token for token in tokens if token not in stopwds]

    tokens = [word for word in tokens if len(word)>=3]

    stemmer = PorterStemmer()

    tokens = [stemmer.stem(word) for word in tokens]

    tagged_corpus = pos_tag(tokens)

    Noun_tags = ['NN','NNP','NNPS','NNS']

    Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ']

    lemmatizer = WordNetLemmatizer()

    def prat_lemmatize(token,tag):

        if tag in Noun_tags:

            return lemmatizer.lemmatize(token,'n')

        elif tag in Verb_tags:

            return lemmatizer.lemmatize(token,'v')

        else:

            return lemmatizer.lemmatize(token,'n')

    pre_proc_text =  " ".join([prat_lemmatize(token,tag) for token,tag in tagged_corpus])

    return pre_proc_text

lines = []

fin = open("alice_in_wonderland.txt", "r") # fin = open("alice_in_wonderland.txt", "rb")

for line in fin:

    # line = line.strip().decode("ascii", "ignore").encode("utf-8")

    if len(line) == 0:

        continue

    lines.append(preprocessing(line))

fin.close()

import collections

counter = collections.Counter()

for line in lines:

    for word in nltk.word_tokenize(line):

        counter[word.lower()]+=1

word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())}

idx2word = {v:k for k,v in word2idx.items()}

xs = []

ys = []

for line in lines:

    embedding = [word2idx[w.lower()] for w in nltk.word_tokenize(line)]

    triples = list(nltk.trigrams(embedding))

    w_lefts = [x[0] for x in triples]

    w_centers = [x[1] for x in triples]

    w_rights = [x[2] for x in triples]

    xs.extend(w_centers)

    ys.extend(w_lefts)

    xs.extend(w_centers)

    ys.extend(w_rights)

print (len(word2idx))

vocab_size = len(word2idx)+1

ohe = OneHotEncoder(n_values=vocab_size)

X = ohe.fit_transform(np.array(xs).reshape(-1, 1)).todense()

Y = ohe.fit_transform(np.array(ys).reshape(-1, 1)).todense()

Xtrain, Xtest, Ytrain, Ytest,xstr,xsts = train_test_split(X, Y,xs, test_size=0.3,random_state=42)

print(Xtrain.shape, Xtest.shape, Ytrain.shape, Ytest.shape)

from keras.layers import Input,Dense,Dropout

from keras.models import Model

np.random.seed(1)

BATCH_SIZE = 128

NUM_EPOCHS = 1

input_layer = Input(shape = (Xtrain.shape[1],),name="input")

first_layer = Dense(300,activation='relu',name = "first")(input_layer)

first_dropout = Dropout(0.5,name="firstdout")(first_layer)

second_layer = Dense(2,activation='relu',name="second")(first_dropout)

third_layer = Dense(300,activation='relu',name="third")(second_layer)

third_dropout = Dropout(0.5,name="thirdout")(third_layer)

fourth_layer = Dense(Ytrain.shape[1],activation='softmax',name = "fourth")(third_dropout)

history = Model(input_layer,fourth_layer)

history.compile(optimizer = "rmsprop",loss="categorical_crossentropy",metrics=["accuracy"])

history.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE,epochs=NUM_EPOCHS, verbose=1,validation_split = 0.2)

# Extracting Encoder section of the Model for prediction of latent variables

encoder = Model(history.input,history.get_layer("second").output)

# Predicting latent variables with extracted Encoder model

reduced_X = encoder.predict(Xtest)

final_pdframe = pd.DataFrame(reduced_X)

final_pdframe.columns = ["xaxis","yaxis"]

final_pdframe["word_indx"] = xsts

final_pdframe["word"] = final_pdframe["word_indx"].map(idx2word)

rows = random.sample(list(final_pdframe.index), 100)

vis_df = final_pdframe.loc[rows]

labels = list(vis_df["word"])

xvals = list(vis_df["xaxis"])

yvals = list(vis_df["yaxis"])

plt.figure(figsize=(10, 10))

for i, label in enumerate(labels):

    x = xvals[i]

    y = yvals[i]

    plt.scatter(x, y)

    plt.annotate(label,xy=(x, y),xytext=(5, 2),textcoords='offset points',ha='right',va='bottom')

plt.xlabel("Dimension 1")

plt.ylabel("Dimension 2")

plt.show()

输出：不是二维的，为什么！！！看了两天不明白！

NLP（二十）利用词向量实现高维词在二维空间的可视化的更多相关文章

NLP︱词向量经验总结（功能作用、高维可视化、R语言实现、大规模语料、延伸拓展）
R语言由于效率问题,实现自然语言处理的分析会受到一定的影响,如何提高效率以及提升词向量的精度是在当前软件环境下,比较需要解决的问题. 笔者认为还存在的问题有: 1.如何在R语言环境下,大规模语料提高运 ...
NLP︱高级词向量表达（二）——FastText（简述、学习笔记）
FastText是Facebook开发的一款快速文本分类器,提供简单而高效的文本分类和表征学习的方法,不过这个项目其实是有两部分组成的,一部分是这篇文章介绍的 fastText 文本分类(paper: ...
Deep Learning In NLP 神经网络与词向量
0. 词向量是什么自然语言理解的问题要转化为机器学习的问题,第一步肯定是要找一种方法把这些符号数学化. NLP 中最直观,也是到目前为止最常用的词表示方法是 One-hot Representati ...
NLP教程(2) | GloVe及词向量的训练与评估
作者:韩信子@ShowMeAI 教程地址:http://www.showmeai.tech/tutorials/36 本文地址:http://www.showmeai.tech/article-det ...
NLP之词向量
1.对词用独热编码进行表示的缺点向量的维度会随着句子中词的类型的增大而增大,最后可能会造成维度灾难2.任意两个词之间都是孤立的,仅仅将词符号化,不包含任何语义信息,根本无法表示出在语义层面上词与词之 ...
文本情感分析(二)：基于word2vec、glove和fasttext词向量的文本表示
上一篇博客用词袋模型,包括词频矩阵.Tf-Idf矩阵.LSA和n-gram构造文本特征,做了Kaggle上的电影评论情感分类题. 这篇博客还是关于文本特征工程的,用词嵌入的方法来构造文本特征,也就是用 ...
NLP获取词向量的方法（Glove、n-gram、word2vec、fastText、ELMo 对比分析）
自然语言处理的第一步就是获取词向量,获取词向量的方法总体可以分为两种两种,一个是基于统计方法的,一种是基于语言模型的. 1 Glove - 基于统计方法 Glove是一个典型的基于统计的获取词向量的方 ...
词向量(one-hot/SVD/NNLM/Word2Vec/GloVe)
目录词向量简介 1. 基于one-hot编码的词向量方法 2. 统计语言模型 3. 从分布式表征到SVD分解 3.1 分布式表征(Distribution) 3.2 奇异值分解(SVD) 3.3 基 ...
【paddle学习】词向量
http://spaces.ac.cn/archives/4122/ 关于词向量讲的很好上边的形式表明,这是一个以2x6的one hot矩阵的为输入.中间层节点数为3的全连接神经网络层,但你看右 ...

随机推荐

HPU暑期集训积分赛2
A. 再战斐波那契单点时限: 1.0 sec 内存限制: 512 MB 小z 学会了斐波那契和 gcd 后,老师又给他出了个难题,求第N个和第M个斐波那契数的最大公约数,这可难倒了小z ,不过在小z ...
Calico 网络通信原理揭秘
Calico 是一个纯三层的数据中心网络方案,而且无缝集成像 OpenStack 这种 Iaas 云架构,能够提供可控的 VM.容器.裸机之间的 IP 通信.为什么说它是纯三层呢?因为所有的数据包都是 ...
centos7上搭建zookeeper集群
1.下载zookeeper http://www.apache.org/dyn/closer.cgi/zookeeper/ 可以登录这个网站下载,然后上传到 centos上修改成自己需要的版本 , ...
腾讯企业邮箱 POP3/SMTP 设置
下午魅族MX2刷完机,原先配置的公司邮箱还要重新配置.有些地方需要改,找到了篇文章,如下: 腾讯企业邮箱支持通过客户端进行邮件管理.POP3/SMTP协议收发邮件服务器地址分别如下.接收邮件服务器:p ...
提交bug的标准及书写规范
Bug有效性 1.交付过程中测试者需按照专家设定好的模块,对Bug进行归类提交: 2.Bug的类型默认为UI问题.功能问题.崩溃问题,提交Bug时不能弄错: 3.需求是否明确.前提条件是否满足.输入数 ...
接口测试时遇到 java 代码加密请求数据，用 python 的我该怎么办？
前言自动化测试应用越来越多了,尤其是接口自动化测试. 在接口测试数据传递方面,很多公司都会选择对请求数据进行加密处理. 而目前为主,大部分公司的产品都是java语言实现的.所以加密处理也是java实 ...
PYNQ上手笔记 | ① 启动Pynq
现在人工智能非常火爆,一般的教程都是为博硕生准备的,太难看懂了,分享一个非常适合小白入门的教程,不仅通俗易懂而且还很风趣幽默,点☞这里☜进入传送门~ = = = = 我是华丽的分割线 = ...
SpringMVC的流程
Springmvc的流程 1.用户发送请求至前端控制器DispatcherServlet 2.DispatcherServlet收到请求后,调用HandlerMapping处理映射器,请求获取Hand ...
LR(1)语法分析器生成器(生成Action表和Goto表)java实现(二)
本来这次想好好写一下博客的...结果耐心有限,又想着烂尾总比断更好些.于是还是把后续代码贴上.不过后续代码是继续贴在BNF容器里面的...可能会显得有些臃肿.但目前管不了那么多了.先贴上来吧hhh.说 ...
eclipse解决properties文件中文乱码（两种方试）
第一种:大多数网上搜到的情况(不靠谱) 第一步:windows-->properties-->General-->Content Types-->text(如下图) 第二步:p ...

NLP（二十） 利用词向量实现高维词在二维空间的可视化

NLP（二十） 利用词向量实现高维词在二维空间的可视化的更多相关文章

随机推荐

热门专题

NLP（二十）利用词向量实现高维词在二维空间的可视化

NLP（二十）利用词向量实现高维词在二维空间的可视化的更多相关文章