1. 语料下载:https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 【中文维基百科语料】

2. 语料处理

(1)提取数据集的文本

下载的数据集无法直接使用,需要提取出文本信息。

安装python库:

pip install numpy
pip install scipy
pip install gensim
python代码:
'''
Description: 提取中文语料
Author: zhangyh
Date: 2024-05-09 21:31:22
LastEditTime: 2024-05-09 22:10:16
LastEditors: zhangyh
'''
import logging
import os.path
import six
import sys
import warnings warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.corpora import WikiCorpus if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments
if len(sys.argv) != 3:
print("Using: python process_wiki.py enwiki.xxx.xml.bz2 wiki.en.text")
sys.exit(1)
inp, outp = sys.argv[1:3]
space = " "
i = 0 output = open(outp, 'w',encoding='utf-8')
wiki = WikiCorpus(inp, dictionary={})
for text in wiki.get_texts():
output.write(space.join(text) + "\n")
i=i+1
if (i%10000==0):
logger.info("Saved " + str(i) + " articles") output.close()
logger.info("Finished Saved " + str(i) + " articles")

运行代码提取文本:

PS C:\Users\zhang\Desktop\nlp 自然语言处理\data> python .\process_wiki.py .\zhwiki-latest-pages-articles.xml.bz2 wiki_zh.text
2024-05-09 21:43:10,036: INFO: running .\process_wiki.py .\zhwiki-latest-pages-articles.xml.bz2 wiki_zh.text
2024-05-09 21:44:02,944: INFO: Saved 10000 articles
2024-05-09 21:44:51,875: INFO: Saved 20000 articles
...
2024-05-09 22:22:34,244: INFO: Saved 460000 articles
2024-05-09 22:23:33,323: INFO: Saved 470000 articles

提取后的文本(有繁体字):

(2)转繁体为简体

opencc -i wiki_zh.text -o wiki_sample_chinese.text -c "C:\Program Files\OpenCC\build\share\opencc\t2s.json"
  • 转换后的简体文本如下:

(3)分词(使用jieba分词)

  • 分词代码:
'''
Description:
Author: zhangyh
Date: 2024-05-10 22:48:45
LastEditTime: 2024-05-10 23:02:57
LastEditors: zhangyh
'''
#文章分词
import jieba
import jieba.analyse
import codecs
import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__))) # def cut_words(sentence):
# return " ".join(jieba.cut(sentence)).encode('utf-8') f=codecs.open('data\\wiki_sample_chinese.text','r',encoding="utf8")
target = codecs.open("data\\wiki_word_cutted_result.text", 'w',encoding="utf8") line_num=1
line = f.readline()
while line:
print('---- processing', line_num, 'article----------------')
line_seg = " ".join(jieba.cut(line))
target.writelines(line_seg)
line_num = line_num + 1
line = f.readline() f.close()
target.close() # exit()
# while line:
# curr = []
# for oneline in line:
# #print(oneline)
# curr.append(oneline)
# after_cut = map(cut_words, curr)
# target.writelines(after_cut)
# print ('saved',line_num,'articles')
# exit()
# line = f.readline1()
# f.close()
# target.close()
  • 分词后的结果

3. 模型训练

(1)skip-gram模型

'''
Description:
Author: zhangyh
Date: 2024-05-12 21:51:03
LastEditTime: 2024-05-16 11:08:59
LastEditors: zhangyh
'''
import numpy as np
import pandas as pd
import pickle
from tqdm import tqdm
import os
import sys
import random sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) def load_stop_words(file = "作业-skipgram\\stopwords.txt"):
with open(file,"r",encoding = "utf-8") as f:
return f.read().split("\n") def load_cutted_data(num_lines: int):
stop_words = load_stop_words()
data = []
# with open('wiki_word_cutted_result.text', mode='r', encoding='utf-8') as file:
with open('作业-skipgram\\wiki_word_cutted_result.text', mode='r', encoding='utf-8') as file:
for line in tqdm(file.readlines()[:num_lines]):
words_list = line.split()
words_list = [word for word in words_list if word not in stop_words]
data += words_list
data = list(set(data))
return data def get_dict(data):
index_2_word = []
word_2_index = {} for word in tqdm(data):
if word not in word_2_index:
index = len(index_2_word)
word_2_index[word] = index
index_2_word.append(word) word_2_onehot = {}
word_size = len(word_2_index)
for word, index in tqdm(word_2_index.items()):
one_hot = np.zeros((1, word_size))
one_hot[0, index] = 1
word_2_onehot[word] = one_hot return word_2_index, index_2_word, word_2_onehot def softmax(x):
ex = np.exp(x)
return ex/np.sum(ex,axis = 1,keepdims = True) # 负采样
# def negative_sampling(word_2_index, word_count, num_negative_samples):
# word_probs = [word_count[word]**0.75 for word in word_2_index]
# word_probs = np.array(word_probs) / sum(word_probs)
# neg_samples = np.random.choice(len(word_2_index), size=num_negative_samples, replace=True, p=word_probs)
# return neg_samples if __name__ == "__main__": batch_size = 562 # 定义批量大小 data = load_cutted_data(5) word_2_index, index_2_word, word_2_onehot = get_dict(data) word_size = len(word_2_index)
embedding_num = 100
lr = 0.01
epochs = 200
n_gram = 3
# num_negative_samples = 5 # 计算词频
# word_count = dict.fromkeys(word_2_index, 0)
# for word in data:
# word_count[word] += 1 batches = [data[j:j+batch_size] for j in range(0, len(data), batch_size)] w1 = np.random.normal(-1,1,size = (word_size,embedding_num))
w2 = np.random.normal(-1,1,size = (embedding_num,word_size)) for i in range(epochs):
print(f'-------- epoch {i + 1} --------')
for batch in tqdm(batches):
for i in tqdm(range(len(batch))):
now_word = batch[i]
now_word_onehot = word_2_onehot[now_word]
other_words = batch[max(0, i - n_gram): i] + batch[i + 1: min(len(batch), i + n_gram + 1)]
for other_word in other_words:
other_word_onehot = word_2_onehot[other_word] hidden = now_word_onehot @ w1
p = hidden @ w2
pre = softmax(p)
# A @ B = C
# delta_C = G
# delta_A = G @ B.T
# delta_B = A.T @ G
G2 = pre - other_word_onehot
delta_w2 = hidden.T @ G2
G1 = G2 @ w2.T
delta_w1 = now_word_onehot.T @ G1 w1 -= lr * delta_w1
w2 -= lr * delta_w2 with open("作业-skipgram\\word2vec_skipgram.pkl","wb") as f:
# with open("word2vec_skipgram.pkl","wb") as f:
pickle.dump([w1, word_2_index, index_2_word, w2], f)

  

(2)CBOW 模型

'''
Description:
Author: zhangyh
Date: 2024-05-13 20:47:57
LastEditTime: 2024-05-16 09:21:40
LastEditors: zhangyh
'''
import numpy as np
import pandas as pd
import pickle
from tqdm import tqdm
import os
import sys sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) def load_stop_words(file = "stopwords.txt"):
with open(file,"r",encoding = "utf-8") as f:
return f.read().split("\n") def load_cutted_data(num_lines: int):
stop_words = load_stop_words()
data = []
with open('wiki_word_cutted_result.text', mode='r', encoding='utf-8') as file:
# with open('作业-CBOW\\wiki_word_cutted_result.text', mode='r', encoding='utf-8') as file:
for line in tqdm(file.readlines()[:num_lines]):
words_list = line.split()
words_list = [word for word in words_list if word not in stop_words]
data += words_list
data = list(set(data))
return data def get_dict(data):
index_2_word = []
word_2_index = {} for word in tqdm(data):
if word not in word_2_index:
index = len(index_2_word)
word_2_index[word] = index
index_2_word.append(word) word_2_onehot = {}
word_size = len(word_2_index)
for word, index in tqdm(word_2_index.items()):
one_hot = np.zeros((1, word_size))
one_hot[0, index] = 1
word_2_onehot[word] = one_hot return word_2_index, index_2_word, word_2_onehot def softmax(x):
ex = np.exp(x)
return ex/np.sum(ex,axis = 1,keepdims = True) if __name__ == "__main__": batch_size = 562
data = load_cutted_data(5) word_2_index, index_2_word, word_2_onehot = get_dict(data) word_size = len(word_2_index)
embedding_num = 100
lr = 0.01
epochs = 200
context_window = 3 batches = [data[j:j+batch_size] for j in range(0, len(data), batch_size)] w1 = np.random.normal(-1,1,size = (word_size,embedding_num))
w2 = np.random.normal(-1,1,size = (embedding_num,word_size)) for i in range(epochs):
print(f'-------- epoch {i + 1} --------')
for batch in tqdm(batches):
for i in tqdm(range(len(batch))):
target_word = batch[i]
context_words = batch[max(0, i - context_window): i] + batch[i + 1: min(len(batch), i + context_window + 1)] # 获取上下文词的词向量的平均值作为输入
context_vectors = np.mean([word_2_onehot[word] for word in context_words], axis=0) # 计算输出层
hidden = context_vectors @ w1
p = hidden @ w2
pre = softmax(p) # 交叉熵损失函数
# loss = -np.log(pre[word_2_index[target_word], 0]) # 反向传播更新参数
G2 = pre - word_2_onehot[target_word]
delta_w2 = hidden.T @ G2
G1 = G2 @ w2.T
delta_w1 = context_vectors.T @ G1 w1 -= lr * delta_w1
w2 -= lr * delta_w2 # with open("作业-CBOW\\word2vec_cbow.pkl","wb") as f:
with open("word2vec_cbow.pkl","wb") as f:
pickle.dump([w1, word_2_index, index_2_word, w2], f)

  

4. 训练结果

(1)余弦相似度计算

'''
Description:
Author: zhangyh
Date: 2024-05-13 20:12:56
LastEditTime: 2024-05-16 21:16:19
LastEditors: zhangyh
'''
import pickle
import numpy as np # w1, voc_index, index_voc, w2 = pickle.load(open('word2vec_cbow.pkl','rb'))
w1, voc_index, index_voc, w2 = pickle.load(open('作业-CBOW\\word2vec_cbow.pkl','rb')) def word_voc(word):
return w1[voc_index[word]] def voc_sim(word, top_n):
v_w1 = word_voc(word)
word_sim = {}
for i in range(len(voc_index)):
v_w2 = w1[i]
theta_sum = np.dot(v_w1, v_w2)
theta_den = np.linalg.norm(v_w1) * np.linalg.norm(v_w2)
theta = theta_sum / theta_den
word = index_voc[i]
word_sim[word] = theta
words_sorted = sorted(word_sim.items(), key=lambda kv: kv[1], reverse=True)
for word, sim in words_sorted[:top_n]:
# print(f'word: {word}, similiar: {sim}, vector: {w1[voc_index[word]]}')
print(f'word: {word}, similiar: {sim}') voc_sim('学院', 20)

  

(2)可视化展示

'''
Description:
Author: zhangyh
Date: 2024-05-16 21:41:33
LastEditTime: 2024-05-17 23:50:07
LastEditors: zhangyh
'''
import numpy as np
import pandas as pd
import pickle
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt plt.rcParams['font.family'] = ['Microsoft YaHei', 'SimHei', 'sans-serif'] # Load trained word embeddings
with open("word2vec_cbow.pkl", "rb") as f:
w1, word_2_index, index_2_word, w2 = pickle.load(f) # Select specific words for visualization
visual_words = ['研究', '电脑', '雅典', '数学', '数学家', '学院', '函数', '定理', '实数', '复数'] # Get the word vectors corresponding to the selected words
subset_vectors = np.array([w1[word_2_index[word]] for word in visual_words]) # Perform PCA for dimensionality reduction
pca = PCA(n_components=2)
reduced_vectors = pca.fit_transform(subset_vectors) # Visualization
plt.figure(figsize=(10, 8))
plt.scatter(reduced_vectors[:, 0], reduced_vectors[:, 1], marker='o')
for i, word in enumerate(visual_words):
plt.annotate(word, xy=(reduced_vectors[i, 0], reduced_vectors[i, 1]), xytext=(5, 2),
textcoords='offset points', ha='right', va='bottom')
plt.title('Word Embeddings Visualization')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()

(3)类比实验探索(例如:王子 - 男 + 女 = 公主)

'''
Description:
Author: zhangyh
Date: 2024-05-16 23:13:21
LastEditTime: 2024-05-19 11:51:53
LastEditors: zhangyh
'''
import numpy as np
import pickle
from sklearn.metrics.pairwise import cosine_similarity # 加载训练得到的词向量
with open("word2vec_cbow.pkl", "rb") as f:
w1, word_2_index, index_2_word, w2 = pickle.load(f) # 计算类比关系
v_prince = w1[word_2_index["王子"]]
v_man = w1[word_2_index["男"]]
v_woman = w1[word_2_index["女"]]
v_princess = v_prince - v_man + v_woman # 找出最相近的词向量
similarities = cosine_similarity(v_princess.reshape(1, -1), w1)
most_similar_index = np.argmax(similarities)
most_similar_word = index_2_word[most_similar_index] print("结果:", most_similar_word)

  

手写Word2vec算法实现的更多相关文章

  1. [纯C#实现]基于BP神经网络的中文手写识别算法

    效果展示 这不是OCR,有些人可能会觉得这东西会和OCR一样,直接进行整个字的识别就行,然而并不是. OCR是2维像素矩阵的像素数据.而手写识别不一样,手写可以把用户写字的笔画时间顺序,抽象成一个维度 ...

  2. 08.手写KNN算法测试

    导入库 import numpy as np from sklearn import datasets import matplotlib.pyplot as plt 导入数据 iris = data ...

  3. 用C实现单隐层神经网络的训练和预测(手写BP算法)

    实验要求:•实现10以内的非负双精度浮点数加法,例如输入4.99和5.70,能够预测输出为10.69•使用Gprof测试代码热度 代码框架•随机初始化1000对数值在0~10之间的浮点数,保存在二维数 ...

  4. 手写KMeans算法

    KMeans算法是一种无监督学习,它会将相似的对象归到同一类中. 其基本思想是: 1.随机计算k个类中心作为起始点. 将数据点分配到理其最近的类中心. 3.移动类中心. 4.重复2,3直至类中心不再改 ...

  5. 手写k-means算法

    作为聚类的代表算法,k-means本属于NP难问题,通过迭代优化的方式,可以求解出近似解. 伪代码如下: 1,算法部分 距离采用欧氏距离.参数默认值随意选的. import numpy as np d ...

  6. Javascript 手写 LRU 算法

    LRU 是 Least Recently Used 的缩写,即最近最少使用.作为一种经典的缓存策略,它的基本思想是长期不被使用的数据,在未来被用到的几率也不大,所以当新的数据进来时我们可以优先把这些数 ...

  7. 手写LRU算法

    import java.util.LinkedHashMap; import java.util.Map; public class LRUCache<K, V> extends Link ...

  8. 手写hashmap算法

    /** * 01.自定义一个hashmap * 02.实现put增加键值对,实现key重复时替换key的值 * 03.重写toString方法,方便查看map中的键值对信息 * 04.实现get方法, ...

  9. 手写BP(反向传播)算法

    BP算法为深度学习中参数更新的重要角色,一般基于loss对参数的偏导进行更新. 一些根据均方误差,每层默认激活函数sigmoid(不同激活函数,则更新公式不一样) 假设网络如图所示: 则更新公式为: ...

  10. 面试题目:手写一个LRU算法实现

    一.常见的内存淘汰算法 FIFO  先进先出 在这种淘汰算法中,先进⼊缓存的会先被淘汰 命中率很低 LRU Least recently used,最近最少使⽤get 根据数据的历史访问记录来进⾏淘汰 ...

随机推荐

  1. MogDB备机处于standby need-repair(WAL)状态

    MogDB 备机处于 standby need-repair(WAL)状态 本文出处:https://www.modb.pro/db/402820 问题现象 Mogdb 主备环境,备机检查发现 Sta ...

  2. 调用App Store Connect Api

    对iOS的证书.描述文件.账号.设备等管理,之前都去苹果开发者中心操作,官网上操作也比较繁杂,想搞一些自动化之类的,更是麻烦,有时候官网都打不开-- 其实苹果还提供里一套API接口,创建证书.创建账号 ...

  3. 深度解读《深度探索C++对象模型》之拷贝构造函数

    接下来我将持续更新"深度解读<深度探索C++对象模型>"系列,敬请期待,欢迎关注!也可以关注公众号:iShare爱分享,自动获得推文. 写作不易,请有心人到我的公众号上 ...

  4. pageSpy - 远程调试利器

    视频版: https://www.bilibili.com/video/BV1Zi4y167TZ 前言 在工作中, 经常需要面对的问题就是处理客户提出的bug. 但是这个事儿最耗费精力甚至决定能不能修 ...

  5. Linux基础——shell

    shell ############# shell是什么 -Bash Shell是一个命令解释器(python解释器),它在操作系统的最外层,负责用户程序与内核进行交互操作的一种接口,将用户输入的命令 ...

  6. Django框架——路由分发、名称空间、虚拟环境、视图层三板斧、JsonResponse对象、request获取文件、FBV与CBV、CBV源码剖析、模版层

    路由分发 # Django支持每个应用都可以有自己独立的路由层.静态文件.模版层.基于该特性多人开发项目就可以完全解耦合,之后利用路由分发还可以整合到一起 多个应用都有很多路由与视图函数的对应关系 这 ...

  7. 使用ISS服务器方式跑C#程序

    使用ISS服务器方式跑C#程序 VS2010,临时接了一个C#系统的小系统,需要本地调试跑一下 但是老是在conn.open提示06413,简单来说就是连接不上数据库 尝试了很多方法,最后还是决定配置 ...

  8. 如何迁移 Flink 任务到实时计算

    简介: 本文由阿里巴巴技术专家景丽宁(砚田)分享,主要介绍如何迁移Flink任务到实时计算 Flink 中来. 通常用户在线下主要使用 Flink run,这会造成一些问题,比如:同一个配置因版本而变 ...

  9. [FAQ] Quasar BEX Bridge 通信方式 this.$q.bex 未定义的问题

      Bridge 是一个基于 Promise 的事件系统,在BEX的所有部分之间共享,允许在你的Quasar App中监听事件,从其它部分发出它们. 你可以使用 $q.bex 从你的 Quasar A ...

  10. dotnet 警惕 ConcurrentDictionary 使用 FirstOrDefault 获取到非预期的首项

    在 dotnet 里面的 ConcurrentDictionary 是一个支持并发读写的线程安全字典,在这个字典里面有一些行为会出现随机性,即多次执行相同的代码返回的结果可能不相同.本文记录在 Con ...