jieba分词原理解析：用户词典如何优先于系统词典

目标

查看jieba分词组件源码，分析源码各个模块的功能，找到分词模块，实现能自定义分词字典，且优先级大于系统自带的字典等级，以医疗词语邻域词语为例。

jieba分词地址：github地址：https://github.com/fxsjy/jieba

jieba四种分词模式

精确模式，试图将句子最精确地切开，适合文本分析。
- 按照优先级只显示一次需要划分的词语。
全模式，把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义。
- 比如清华大学，会划词显示清华/ 清华大学/ 华大/ 大学四个词
搜索引擎模式，在精确模式的基础上，对长词再次切分。
- 如中国科学院计算所，会分词为中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所。
使用了Paddlepaddle框架，暂时跳过。

根据任务需求，因为只需要将优先级高的特有名词显示一次即可，所以定位在精确模式。

方案一

将自己需要的分词的词语，加入自定义词典当中

开发者可以指定自己自定义的词典，以便包含 jieba 词库里没有的词。虽然 jieba 有新词识别能力，但是自行添加新词可以保证更高的正确率

用法： jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径

词典格式和 dict.txt 一样，一个词占一行；每一行分三部分：词语、词频（可省略）、词性（可省略），用空格隔开，顺序不可颠倒。file_name 若为路径或二进制方式打开的文件，则文件必须为 UTF-8 编码。

词频省略时使用自动计算的能保证分出该词的词频。

分为以下几步

构建自定义词典为相应的格式部分结果如下:

21-羟化酶缺陷症
CO2潴留
E字征
HIV感染
Howship-Romberg征
Korsakov综合征
Moro反应迟钝
Q-T间期延长
畏寒

调用相应的方法
测试相应的数据

#encoding=utf-8
from __future__ import print_function, unicode_literals
import sys
sys.path.append("../")
import jieba
import jieba.posseg as pseg
import os
path = os.getcwd()
# 添加用户词典
jieba.load_userdict(path + "\\userdict.txt")

test_sent = (
"无畏寒一过性心尖部收缩期杂音测试"
)
words = jieba.cut(test_sent)
print('/'.join(words))

添加前无畏寒一过性心尖部收缩期杂音测试 => 无畏/寒一/过性/心尖/部/收缩期/杂音/测试添加后 => 无畏/寒/一过性心尖部收缩期杂音/测试

分析

添加前后，jiaba分词能将一过性心尖部收缩期杂音分词成功，但是无畏寒没有成功，可能是因为无畏在系统中的频度较高，（词频省略时使用自动计算的能保证分出该词的词频。）而用户词典的词语频度相对较低。所以下一步，我将尝试提高用户词典频度，或者降低系统词典频度。如若不行，可进一步查看源码。

继续测试

jieba.add_word('畏寒', freq=10, tag=None) # 通过添加畏寒的频度为10
结果仍为:无畏/寒/一过性心尖部收缩期杂音/测试
jieba.add_word('畏寒', freq=100, tag=None) # 添加畏寒的频度为100
结果为:无/畏寒/一过性心尖部收缩期杂音/测试

由上述结果可知确实若用户词典省略词频，则频度相对系统词典较低，无法正确分出结果，所以只需要在用户添加频度，并设置在100，便可以实现分词功能，但是不能确定是不是100就能覆盖所有的系统词汇，所以进一步查看系统词汇的词频。在源码中有dict.txt其中包含了所有的词语的系统词频，查看其中的最大值。

#encoding=utf-8
from __future__ import print_function, unicode_literals
import sys
sys.path.append("../")
import jieba
import jieba.posseg as pseg
import os
path = os.getcwd()
# 获取系统词典路径
rpath = path + "\\jieba\\dict.txt"
# 读取词典 
import pandas as pd
res = pd.read_csv(rpath,sep=' ',header=None,names = ['name','frequence','type'])
print(res.head())
print(res['frequence'].max())
# 得到结果883634 ,也就是如果将词频设置为大于883634的数则用户词典绝对优先于系统词典

设置代码如下

import numpy as np
import pandas as pd

import os
path  = os.getcwd() #获取当前工作路径
print(path)
output_file = os.path.join(path,'userdict.txt')

# 处理词典
res = pd.read_csv('zhengzhuang.txt',sep=' ',header = None,names=['name','type','frequence'])
print(res.head())

res = res.drop(labels=['type', 'frequence'],axis=1)
print(res.head())
# 添加频度
res['frequence'] = 883635
# 转化为txt文件
res.to_csv(output_file,sep=' ',index=False,header=False)

在添加症状的词条后测试如下

患者1月前无明显诱因及前驱症状下出现腹泻，起初稀便，后为水样便，无恶心呕吐，每日2-3次，无呕血，无腹痛，无畏寒寒战，无低热盗汗，无心悸心慌，无大汗淋漓，否认里急后重感，否认蛋花样大便，当时未重视，未就诊。

患者/1/月前/无/明显/诱因/及/前驱/症状/下/出现/腹泻/，/起初/稀便/，/后/为/水样便/，/无/恶心/呕吐/，/每日/2/-/3/次/，/无/呕血/，/无/腹痛/，/无/畏寒/ 寒战/，/无/低热/盗汗/，/无/心悸/心慌/，/无/大汗淋漓/，/否认/里急后重/感/，/否认/蛋/花样/大便/，/当时/未/重视/，/未/就诊/。

方案二

查看源码，从cut入手一步步查看其内部如何调用的

__init__.py
cut = dt.cut # cut为全局方法
# 关键方法
def cut(self, sentence, cut_all=False, HMM=True, use_paddle=False):
        """
        The main function that segments an entire sentence that contains
        Chinese characters into separated words.

        Parameter:
            - sentence: The str(unicode) to be segmented.
            - cut_all: Model type. True for full pattern, False for accurate pattern.
            - HMM: Whether to use the Hidden Markov Model.
        """
         # 判断是存在paddle
        is_paddle_installed = check_paddle_install['is_paddle_installed']
        # 转码，英文utf8 中文gbk
        sentence = strdecode(sentence)
        # paddle相关
        if use_paddle and is_paddle_installed:
            # if sentence is null, it will raise core exception in paddle.
            if sentence is None or len(sentence) == 0:
                return
            import jieba.lac_small.predict as predict
            results = predict.get_sent(sentence)
            for sent in results:
                if sent is None:
                    continue
                yield sent
            return
        
        re_han = re_han_default
        re_skip = re_skip_default
        # 判断cut 模式
        if cut_all:
            cut_block = self.__cut_all # 全模式
        elif HMM:
            cut_block = self.__cut_DAG # HMM模型 默认为这种
        else:
            cut_block = self.__cut_DAG_NO_HMM # 无HMM模型
        
        blocks = re_han.split(sentence)  # 正则表达式获取相应的字符串 现根据标点符号分词
        print(blocks)
        for blk in blocks:
            if not blk:
                continue
            if re_han.match(blk): # 没有空格
                for word in cut_block(blk):
                    yield word
            else:
                tmp = re_skip.split(blk) # 去空格
                for x in tmp:
                    if re_skip.match(x):
                        yield x
                    elif not cut_all:
                        for xx in x:
                            yield xx
                    else:
                        yield x

由上述代码可知默认的模型__cut_DAG, 输入的字符串，首先根据标点符号分词，然后cut_block负责处理每一个初步拆分过后的字符串，具体的拆分方法为以下两个函数

 # 得到的最大概率路径的概率。这里即为动态规划查找最大概率路径
    def calc(self, sentence, DAG, route):
        N = len(sentence)
        route[N] = (0, 0)
        # 通过词频搜索
        logtotal = log(self.total) #利用total进行动态规划
        for idx in xrange(N - 1, -1, -1):
            route[idx] = max((log(self.FREQ.get(sentence[idx:x + 1]) or 1) -
                              logtotal + route[x + 1][0], x) for x in DAG[idx])

    # 函数功能为把输入的句子生成有向无环图, 得到所有可能的词语
    def get_DAG(self, sentence):
        self.check_initialized()
        DAG = {}
        N = len(sentence)
        for k in xrange(N):
            tmplist = []
            i = k
            frag = sentence[k]
            while i < N and frag in self.FREQ: # 查询FREQ中的词
                if self.FREQ[frag]:
                    tmplist.append(i)
                i += 1
                frag = sentence[k:i + 1]
            if not tmplist:
                tmplist.append(k)
            DAG[k] = tmplist
        return DAG

由此可知FREQ和total是函数计算的关键，所以找到FREQ和total是如何初始化的就可以明白计算的依据了

 self.FREQ, self.total = self.gen_pfdict(self.get_dict_file()) # 得到词语和词频
  # 通过get_dict_file获得
  def get_dict_file(self):
        if self.dictionary == DEFAULT_DICT:
            return get_module_res(DEFAULT_DICT_NAME)
        else:
            return open(self.dictionary, 'rb')
  # 默认为该目录下的dict.txt
  DEFAULT_DICT_NAME = "dict.txt"

上述推理可知，是dict.txt中的词语和词频，通过有向无环图和动态规划路径得到分词结果，而在之前通过用户词典调用的方法无非就是在此基础上加上新的词语和词频。

    def load_userdict(self, f):
        '''
        Load personalized dict to improve detect rate.

        Parameter:
            - f : A plain text file contains words and their ocurrences.
                  Can be a file-like object, or the path of the dictionary file,
                  whose encoding must be utf-8.

        Structure of dict file:
        word1 freq1 word_type1
        word2 freq2 word_type2
        ...
        Word type may be ignored
        '''
        self.check_initialized()
        if isinstance(f, string_types):
            f_name = f
            f = open(f, 'rb')
        else:
            f_name = resolve_filename(f)
        for lineno, ln in enumerate(f, 1):
            line = ln.strip()
            if not isinstance(line, text_type):
                try:
                    line = line.decode('utf-8').lstrip('\ufeff')
                except UnicodeDecodeError:
                    raise ValueError('dictionary file %s must be utf-8' % f_name)
            if not line:
                continue
            # match won't be None because there's at least one character
            word, freq, tag = re_userdict.match(line).groups() # 得到用户词典词语和词频
            if freq is not None:
                freq = freq.strip()
            if tag is not None:
                tag = tag.strip()
            self.add_word(word, freq, tag) # 添加到词典中

 def add_word(self, word, freq=None, tag=None):
        """
        Add a word to dictionary.

        freq and tag can be omitted, freq defaults to be a calculated value
        that ensures the word can be cut out.
        """
        self.check_initialized()
        word = strdecode(word)
        freq = int(freq) if freq is not None else self.suggest_freq(word, False)
        # 添加词
        self.FREQ[word] = freq
        self.total += freq
        if tag:
            self.user_word_tag_tab[word] = tag
        for ch in xrange(len(word)):
            wfrag = word[:ch + 1]
            if wfrag not in self.FREQ:
                self.FREQ[wfrag] = 0
        if freq == 0:
            finalseg.add_force_split(word)

结论

构建一个用户词典表，然后将词语的词频设置为大于883634的数，则用户词典绝对优先于系统词典，其中的工程量主要在如何构造一个合适的字典，在通过调用load_userdict 方法，在用cut方法即可得到设置后的结果，具体操作见方案一。接下来的研究目标：1.用户词典大小最大可以有多大 2.用户词典大小对速度的影响 3.有相同前缀和后缀的词汇如何区分 4. 和百度分词的API对比。