NLP（二十）利用词向量实现高维词在二维空间的可视化

原文链接：http://www.one2know.cn/nlp20/

准备

Alice in Wonderland数据集可用于单词抽取，结合稠密网络可实现其单词的可视化，这与编码器-解码器架构类似。
代码

from __future__ import print_function

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder

import matplotlib.pyplot as plt

import nltk

import numpy as np

import pandas as pd

import random

from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer

import string

from nltk import pos_tag

from nltk.stem import PorterStemmer

def preprocessing(text):

    text2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in text]).split())

    tokens = [word for sent in nltk.sent_tokenize(text2) for word in nltk.word_tokenize(sent)]

    tokens = [word.lower() for word in tokens]

    stopwds = stopwords.words('english')

    tokens = [token for token in tokens if token not in stopwds]

    tokens = [word for word in tokens if len(word)>=3]

    stemmer = PorterStemmer()

    tokens = [stemmer.stem(word) for word in tokens]

    tagged_corpus = pos_tag(tokens)

    Noun_tags = ['NN','NNP','NNPS','NNS']

    Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ']

    lemmatizer = WordNetLemmatizer()

    def prat_lemmatize(token,tag):

        if tag in Noun_tags:

            return lemmatizer.lemmatize(token,'n')

        elif tag in Verb_tags:

            return lemmatizer.lemmatize(token,'v')

        else:

            return lemmatizer.lemmatize(token,'n')

    pre_proc_text =  " ".join([prat_lemmatize(token,tag) for token,tag in tagged_corpus])

    return pre_proc_text

lines = []

fin = open("alice_in_wonderland.txt", "r") # fin = open("alice_in_wonderland.txt", "rb")

for line in fin:

    # line = line.strip().decode("ascii", "ignore").encode("utf-8")

    if len(line) == 0:

        continue

    lines.append(preprocessing(line))

fin.close()

import collections

counter = collections.Counter()

for line in lines:

    for word in nltk.word_tokenize(line):

        counter[word.lower()]+=1

word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())}

idx2word = {v:k for k,v in word2idx.items()}

xs = []

ys = []

for line in lines:

    embedding = [word2idx[w.lower()] for w in nltk.word_tokenize(line)]

    triples = list(nltk.trigrams(embedding))

    w_lefts = [x[0] for x in triples]

    w_centers = [x[1] for x in triples]

    w_rights = [x[2] for x in triples]

    xs.extend(w_centers)

    ys.extend(w_lefts)

    xs.extend(w_centers)

    ys.extend(w_rights)

print (len(word2idx))

vocab_size = len(word2idx)+1

ohe = OneHotEncoder(n_values=vocab_size)

X = ohe.fit_transform(np.array(xs).reshape(-1, 1)).todense()

Y = ohe.fit_transform(np.array(ys).reshape(-1, 1)).todense()

Xtrain, Xtest, Ytrain, Ytest,xstr,xsts = train_test_split(X, Y,xs, test_size=0.3,random_state=42)

print(Xtrain.shape, Xtest.shape, Ytrain.shape, Ytest.shape)

from keras.layers import Input,Dense,Dropout

from keras.models import Model

np.random.seed(1)

BATCH_SIZE = 128

NUM_EPOCHS = 1

input_layer = Input(shape = (Xtrain.shape[1],),name="input")

first_layer = Dense(300,activation='relu',name = "first")(input_layer)

first_dropout = Dropout(0.5,name="firstdout")(first_layer)

second_layer = Dense(2,activation='relu',name="second")(first_dropout)

third_layer = Dense(300,activation='relu',name="third")(second_layer)

third_dropout = Dropout(0.5,name="thirdout")(third_layer)

fourth_layer = Dense(Ytrain.shape[1],activation='softmax',name = "fourth")(third_dropout)

history = Model(input_layer,fourth_layer)

history.compile(optimizer = "rmsprop",loss="categorical_crossentropy",metrics=["accuracy"])

history.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE,epochs=NUM_EPOCHS, verbose=1,validation_split = 0.2)

# Extracting Encoder section of the Model for prediction of latent variables

encoder = Model(history.input,history.get_layer("second").output)

# Predicting latent variables with extracted Encoder model

reduced_X = encoder.predict(Xtest)

final_pdframe = pd.DataFrame(reduced_X)

final_pdframe.columns = ["xaxis","yaxis"]

final_pdframe["word_indx"] = xsts

final_pdframe["word"] = final_pdframe["word_indx"].map(idx2word)

rows = random.sample(list(final_pdframe.index), 100)

vis_df = final_pdframe.loc[rows]

labels = list(vis_df["word"])

xvals = list(vis_df["xaxis"])

yvals = list(vis_df["yaxis"])

plt.figure(figsize=(10, 10))

for i, label in enumerate(labels):

    x = xvals[i]

    y = yvals[i]

    plt.scatter(x, y)

    plt.annotate(label,xy=(x, y),xytext=(5, 2),textcoords='offset points',ha='right',va='bottom')

plt.xlabel("Dimension 1")

plt.ylabel("Dimension 2")

plt.show()

输出：不是二维的，为什么！！！看了两天不明白！

NLP（二十）利用词向量实现高维词在二维空间的可视化的更多相关文章

NLP︱词向量经验总结（功能作用、高维可视化、R语言实现、大规模语料、延伸拓展）
R语言由于效率问题,实现自然语言处理的分析会受到一定的影响,如何提高效率以及提升词向量的精度是在当前软件环境下,比较需要解决的问题. 笔者认为还存在的问题有: 1.如何在R语言环境下,大规模语料提高运 ...
NLP︱高级词向量表达（二）——FastText（简述、学习笔记）
FastText是Facebook开发的一款快速文本分类器,提供简单而高效的文本分类和表征学习的方法,不过这个项目其实是有两部分组成的,一部分是这篇文章介绍的 fastText 文本分类(paper: ...
Deep Learning In NLP 神经网络与词向量
0. 词向量是什么自然语言理解的问题要转化为机器学习的问题,第一步肯定是要找一种方法把这些符号数学化. NLP 中最直观,也是到目前为止最常用的词表示方法是 One-hot Representati ...
NLP教程(2) | GloVe及词向量的训练与评估
作者:韩信子@ShowMeAI 教程地址:http://www.showmeai.tech/tutorials/36 本文地址:http://www.showmeai.tech/article-det ...
NLP之词向量
1.对词用独热编码进行表示的缺点向量的维度会随着句子中词的类型的增大而增大,最后可能会造成维度灾难2.任意两个词之间都是孤立的,仅仅将词符号化,不包含任何语义信息,根本无法表示出在语义层面上词与词之 ...
文本情感分析(二)：基于word2vec、glove和fasttext词向量的文本表示
上一篇博客用词袋模型,包括词频矩阵.Tf-Idf矩阵.LSA和n-gram构造文本特征,做了Kaggle上的电影评论情感分类题. 这篇博客还是关于文本特征工程的,用词嵌入的方法来构造文本特征,也就是用 ...
NLP获取词向量的方法（Glove、n-gram、word2vec、fastText、ELMo 对比分析）
自然语言处理的第一步就是获取词向量,获取词向量的方法总体可以分为两种两种,一个是基于统计方法的,一种是基于语言模型的. 1 Glove - 基于统计方法 Glove是一个典型的基于统计的获取词向量的方 ...
词向量(one-hot/SVD/NNLM/Word2Vec/GloVe)
目录词向量简介 1. 基于one-hot编码的词向量方法 2. 统计语言模型 3. 从分布式表征到SVD分解 3.1 分布式表征(Distribution) 3.2 奇异值分解(SVD) 3.3 基 ...
【paddle学习】词向量
http://spaces.ac.cn/archives/4122/ 关于词向量讲的很好上边的形式表明,这是一个以2x6的one hot矩阵的为输入.中间层节点数为3的全连接神经网络层,但你看右 ...

随机推荐

java实用类总结
1.什么是枚举类? 访问修饰符 Enum 枚举名称{}其应用上可以看做一个类去定义,如果枚举里有方法,定义的枚举常量要以':'结尾 2.应用枚举的好处? 枚举限制了范围,更加安全,如果要大量定义常量用 ...
Angular JS 中 ng-controller 值复制和引用复制
我们知道在使用ng-app或者ng-controller指令的时候,都会创建一个新的作用域($rootScope或者是$scope),并且在使用ng-controller指令创建的作用域会继承父级作用 ...
iOS程序员如何提升核心竞争力，防止自己被裁员？
前言: 核心竞争力最早由普拉哈拉德和加里·哈默尔两位教授提出,通常认为核心竞争力,即企业或个人相较于竞争对手而言所具备的竞争优势与核心能力差异, 说白了就是你的优势,而且最好是独一无二的的优势,这就是 ...
cogs 264. 数列操作单点修改区间查询
http://cogs.pro:8080/cogs/problem/problem.php?pid=pyNimmVeq 264. 数列操作 ★☆ 输入文件:shulie.in 输出文件:shu ...
【算法】【查找】二分法 Bisection
#include<stdio.h> int main(){ ,,,,,,,,,,,,,,}; ; //长度 ; //要查找到的值 int Bisection(int x,int* a,in ...
Django中自定义admin---Xadmin的实现
在Django框架中,自带一个后台管理页面admin,这个管理页面很全,但是,有些并不是我们需要的,所以我们可以根据admin的实现流程来自定义自己的需求,即根据admin的实现方式来实现自定制--X ...
Java ActionListenner类的一些理解
Java的ActionListenner事实上我去年年这个时候大概就已经接触到了,也学会了比较简单的使用.但却始终不能理解ActionListenner的一系列的运行是怎么维持这么一个联系的? 我产生 ...
Oauth2认证模式之授权码模式实现
Oauth2认证模式之授权码模式(authorization code) 本示例实现了Oauth2之授权码模式,授权码模式(authorization code)是功能最完整.流程最严密的授权模式.它 ...
Multiple dex files define Lokhttp3/internal/wsWebSocketProtocol
Multiple dex files define Lokhttp3/internal/wsWebSocketProtocol 老套路,先晒图图一:如题,在编译打包时遇到了如上错误,很明显这是一个依 ...
常用maven 命令
重新依赖:mvn package -U -DskipTest=true; 在本地安装jar包:mvn install 清除产生的项目:mvn clean 运行测试:mvn test 上传到私服:mvn ...

NLP（二十） 利用词向量实现高维词在二维空间的可视化

NLP（二十） 利用词向量实现高维词在二维空间的可视化的更多相关文章

随机推荐

热门专题

NLP（二十）利用词向量实现高维词在二维空间的可视化

NLP（二十）利用词向量实现高维词在二维空间的可视化的更多相关文章