Preprocessing

Tokenizer

source code：https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/text.py#L490-L519

some important functions and variables

init
def fit_on_texts(self, texts) #texts can be a string or a list of strings or a list of list of strings
self.word_index # the type of variance is dictonary, which contain a specific word subject to a unique index
self.index_word #r eserve the key and value of the word_index

sample

  import tensorflow as tf

  from tensorflow import keras

  # the package which can tokenizer

  from tensorflow.keras.preprocessing.text import Tokenizer

  '''

    transform the word into number

  '''

  sentences= ['i love my dog', 'i love my cat','you love my dog!']

  tokenizer = Tokenizer(num_words = 100)

  tokenizer.fit_on_texts(sentences)

  word_index = tokenizer.word_index

  print(word_index)

  # get the result {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

Serialization

texts_to_sequences(self,texts) # transforms each text in texts to a sequence of integers.
tf.keras.preprocessing.sequence.pad_sequences( sequences, maxlen=None, dtype='int32',padding='pre', truncating='pre', value=0.) # make the sentences with same length.
- sorce code https://github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/python/keras/preprocessing/sequence.py#L88-L154

sample

sentences= ['i love my dog', 'i love my cat','you love my dog!','do you think my dog is amazing']

sequences = tokenizer.texts_to_sequences(sentences)

print(sequences)

 '''

   result is [[3, 1, 2, 4], [3, 1, 2, 5], [6, 1, 2, 4], [6, 2, 4]]

   which is not encoding for amazing, because it's not appear in fit texts

 '''

To solve this problem，we can set a oov in tokenizer to encode a word which not appear before.

tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>")

'''

    restart the code,we can get the result

    [[4, 2, 3, 5], [4, 2, 3, 6], [7, 2, 3, 5], [1, 7, 1, 3, 5, 1, 1]]

'''

but each sequences has the different length of the series, it's difficult for train a neuro network,so we need make the sequnces has the same length.

from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequences = pad_sequences(sequences,

                                 padding = 'post',   # right padding

                                 maxlen = 5,         # max len of senquence

                                 truncating = 'post') # right cut

padded_sequences

'''

then we can get the result

array([[5, 3, 2, 4, 0],

       [5, 3, 2, 7, 0],

       [6, 3, 2, 4, 0],

       [8, 6, 9, 2, 4]])

'''

word processing in nlp with tensorflow的更多相关文章

论文阅读 | Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems
[code&data] [pdf] 主要工作文章首先证明了对抗攻击对NLP系统的影响力,然后提出了三种屏蔽方法: visual character embeddings adversaria ...
自然语言处理资源NLP
转自:https://github.com/andrewt3000/DL4NLP Deep Learning for NLP resources State of the art resources ...
NLP与深度学习（一）NLP任务流程
1. 自然语言处理简介根据工业界的估计,仅有21% 的数据是以结构化的形式展现的[1].在日常生活中,大量的数据是以文本.语音的方式产生(例如短信.微博.录音.聊天记录等等),这种方式是高度无结构化 ...
NLP新手入门指南|北大-TANGENT
开源的学习资源:<NLP 新手入门指南>,项目作者为北京大学 TANGENT 实验室成员. 该指南主要提供了 NLP 学习入门引导.常见任务的开发实现.各大技术教程与文献的相关推荐等内容, ...
分词（Tokenization） - NLP学习（1）
自从开始使用Python做深度学习的相关项目时,大部分时候或者说基本都是在研究图像处理与分析方面,但是找工作反而碰到了很多关于自然语言处理(natural language processing: N ...
TensorFlow系列专题（十一）：RNN的应用及注意力模型
磐创智能-专注机器学习深度学习的教程网站 http://panchuang.net/ 磐创AI-智能客服,聊天机器人,推荐系统 http://panchuangai.com/ 目录: 循环神经网络的应 ...
TensorFlow开发者证书中文手册
经过一个月的准备,终于通过了TensorFlow的开发者认证,由于官方的中文文档较少,为了方便大家了解这个考试,同时分享自己的备考经验,让大家少踩坑,我整理并制作了这个中文手册,请大家多多指正,有任何 ...
TensorFlow 在android上的Demo（1）
转载时请注明出处: 修雨轩陈系统环境说明: ------------------------------------ 操作系统 : ubunt 14.03 _ x86_64 操作系统内存: 8GB ...
初学者如何查阅自然语言处理（NLP）领域学术资料
1. 国际学术组织.学术会议与学术论文自然语言处理(natural language processing,NLP)在很大程度上与计算语言学(computational linguistics,CL ...

随机推荐

ZABBIX Proxy容器启动的配置过程
ZABBIX Proxy容器启动的配置过程环境介绍版本 zabbix6 zabbix server 与 zabbix proxy 非同一台主机,zabbix proxy为主动方式提交给server ...
php进制转换
前端html页面代码: <!DOCTYPE html> <html lang="en"> <head> <meta charset=&qu ...
Codeforces Round #762 (Div. 3), CDE
(C) Wrong Addition Problem - C - Codeforces 题意定义一种计算方式, 对于a+b=c, 给出a和c, 求b 题解因为求法是从个位求得, 先求出来的最后输 ...
MATLAB地图工具箱学习心得（二）设计可变参数和位置拾取的“放大镜”式投影程序
最近刚好因为一些原因整理这方面的内容,所以还是把这篇鸽了一年多的博客顺手写出来了∠( ᐛ 」∠)＿.因为是当时课程设计的一部分,程序上难免会有一些不足和bug,在这里将设计的思路分享给大家. 本篇博客 ...
-2.输入加速(cin,cout)
+ ios::sync_with_stdio(false);//加速几百毫秒 cin.tie(0); // 接近scanf cout.tie(0);
一个实战让你搞懂Dockerfile
摘要在认识Dockerfile的基础功能之后,即一个用基础镜像来构建新镜像的文本文件,就需要在实际工作中使用其灵活便利的操作来提升我们的工作效率了,这里演示在Tomcat里运行一个程序的过程,以此来 ...
JVM探究
1.JVM探究请你谈谈你对JVM的理解?java8虚拟机和之前的变化更新? 什么是OOM,什么是栈溢出StackOverFlowError?怎么分析? JVM的常用调优参数有哪些? 内存快照如何抓取 ...
.NET混合开发解决方案7 WinForm程序中通过NuGet管理器引用集成WebView2控件
系列目录 [已更新最新开发文章,点击查看详细] WebView2组件支持在WinForm.WPF.WinUI3.Win32应用程序中集成加载Web网页功能应用.本篇主要介绍如何在WinForm ...
nova服务的基本使用
创建flavor类型 [root@controller ~]# openstack help flavor create usage: openstack flavor create [-h] [-f ...
mysql的命令二
1.插入数据格式一:insert into table_name valuse (字段1,字段2): insert test1 values ('wangsan',22,'male'); 格式二:i ...

word processing in nlp with tensorflow

Preprocessing

Tokenizer

some important functions and variables

sample

Serialization

sample

word processing in nlp with tensorflow的更多相关文章

随机推荐

热门专题