bert中的分词

simple_wxl 2024-09-14 08:15:07 原文

直接把自己的工作文档导入的，由于是在外企工作，所以都是英文写的

chinese and english tokens result

input: "我爱中国"，tokens:["我","爱","中","国"]

input: "I love china habih", tokens:["I","love","china","ha","##bi","##h"] (here "##bi","##h" are all in vocabulary)

Implementation

chinese and english text would call two tokens,one is basic_tokenizer and other one is wordpiece_tokenizer as you can see from the the code below.

basic_tokenizer

if the input is chinese, the _tokenize_chinese_chars would add whitespace between chinese charater, and then call whitespace_tokenizer which separate text with whitespace,so

if the input is the query "我爱中国"，would return ["我","爱","中","国"]，if the input is the english query "I love china", would return ["I","love","china"]

wordpiece_tokenizer

if the input is chinese ,if would iterate tokens from basic_tokenizer result, if the character is in vocabulary, just keep the same character ,otherwise append unk_token.

if the input is english, we would iterate over one word ,for example: the word is "shabi", while it is not in vocabulary,so the end index would rollback until it found "sh" in vocabulary,

in the following process, once it found a substr in vocabulary ,it would append "##" then append it to output tokens, so we can get ["sh","##ab","##i"] finally.

#65 How are out of vocabulary words handled for Chinese?

The top 8000 characters are character-tokenized, other characters are mapped to [UNK]. I should've commented this section better but it's here

Basically that's saying if it tries to apply WordPiece tokenization (the character tokenization happens previously), and it gets to a single character that it can't find, it maps it to unk_token.

#62 Why Chinese vocab contains ##word?

This is the character used to denote WordPieces, it's just an artifact of the WordPiece vocabulary generator that we use, but most of those words were never actually used during training (for Chinese). So you can just ignore those tokens. Note that for the English characters that appear in Chinese text they are actually used.

bert中的分词的更多相关文章

一文读懂BERT中的WordPiece
1. 前言 2018年最火的论文要属google的BERT,不过今天我们不介绍BERT的模型,而是要介绍BERT中的一个小模块WordPiece. 2. WordPiece原理现在基本性能好一些的N ...
广告行业中那些趣事系列8：详解BERT中分类器源码
最新最全的文章请关注我的微信公众号:数据拾光者. 摘要:BERT是近几年NLP领域中具有里程碑意义的存在.因为效果好和应用范围广所以被广泛应用于科学研究和工程项目中.广告系列中前几篇文章有从理论的方面 ...
Python中结巴分词使用手记
手记实用系列文章: 1 结巴分词和自然语言处理HanLP处理手记 2 Python中文语料批量预处理手记 3 自然语言处理手记 4 Python中调用自然语言处理工具HanLP手记 5 Python中 ...
Elasticsearch中的分词器比较及使用方法
Elasticsearch 默认分词器和中分分词器之间的比较及使用方法 https://segmentfault.com/a/1190000012553894 介绍:ElasticSearch 是一个 ...
ES中的分词器
基本概念: 全文搜索引擎会用某种算法对要建索引的文档进行分析, 从文档中提取出若干Token(词元), 这些算法称为Tokenizer(分词器), 这些Token会被进一步处理, 比如转成小写等, 这 ...
开源自然语言处理工具包hanlp中CRF分词实现详解
CRF简介 CRF是序列标注场景中常用的模型,比HMM能利用更多的特征,比MEMM更能抵抗标记偏置的问题. [gerative-discriminative.png] CRF训练这类耗时的任务,还 ...
sklearn中的分词函数countVectorizer()的改动--保留长度为1的字符串
1简述问题使用countVectorizer()将文本向量化时发现,文本中长度唯一的字符串会被自动过滤掉,这对于我在做的情感分析来讲,一些表较重要的表达情感倾向的词汇被过滤掉,比如文本'没用的东西, ...
nlp任务中的传统分词器和Bert系列伴生的新分词器tokenizers介绍
layout: blog title: Bert系列伴生的新分词器 date: 2020-04-29 09:31:52 tags: 5 categories: nlp mathjax: true ty ...
中文分词中的战斗机-jieba库
英文分词的第三方库NLTK不错,中文分词工具也有很多(盘古分词.Yaha分词.Jieba分词等).但是从加载自定义字典.多线程.自动匹配新词等方面来看. 大jieba确实是中文分词中的战斗机. 请随意 ...

随机推荐

html中空格字符实体整理
摘要浏览器总是会截短 HTML 页面中的空格.如果您在文本中写 10 个空格,在显示该页面之前,浏览器会删除它们中的 9 个.如需在页面中增加空格的数量,您需要使用字符实体. 本篇就单介绍空格的字 ...
eclipse卸载自带maven
1.在eclipse的安装目录下,找到 features和plugins文件夹,删除这两个文件夹下maven对应的jar和文件夹(windows用户建议用如下搜索:*maven*和*m2e*) 2 ...
Hadoop生态圈-使用Ganglia监控flume中间件
Hadoop生态圈-使用Ganglia监控flume中间件作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.Ganglia监控简介加州伯克利大学千禧计划的其中一个开源项目.是一 ...
Linux记录-AWK语法（转载）
1.原理 awk,一个行文本处理工具,逐行处理文件中的数据语法:awk 'pattern + {action}' 说明:(1)单引号''是为了和shell命令区分开:(2)大括号{}表示一个命令分组 ...
关于同步，异步，阻塞，非阻塞，IOCP/epoll,select/poll,AIO ,NIO ,BIO的总结
相关资料 IO基本概念 Linux环境同步异步阻塞非阻塞同步与异步阻塞与非阻塞 IO模型Reference Link 阻塞IO模型非阻塞IO模型 IO复用模型信号驱动异步IO模型异步IO模 ...
Linux 命令详解（八）Systemd 入门教程：实战篇
Systemd 入门教程:实战篇 http://www.ruanyifeng.com/blog/2016/03/systemd-tutorial-part-two.html
Python基础-day02
写在前面上课第二天,打卡: 大人不华,君子务实. 一.进制相关 - 进制基础数据存储在磁盘上或者内存中,都是以0.1形式存在的:即是以二进制的形式存在: 为了存储和展示,人们陆续扩展了数据的表 ...
DotNetBar TextBoxDropDown响应按键事件
textBoxDropDownHelp.TextBox.KeyDown += new KeyEventHandler(textBoxDropDownHelp_KeyDown); private voi ...
七、uboot 代码流程分析---C环境建立
7.1 start.S 修改在上一节中的流程中,发现初始化的过程并没由设置看门狗,也未进行中断屏蔽如果看门狗不禁用,会导致系统反复重启,因此需要在初始化的时候禁用看门狗:中断屏蔽保证启动过程中不出 ...
理解django框架中的MTV与MVC模式
1．Models:一个抽象层,用来构建和操作你的web应用中的数据,模型是你的数据的唯一的.权威的信息源.它包含你所储存数据的必要字段和行为.通常,每个模型对应数据库中唯一的一张表. from dja ...