FMM和BMM的python代码实现

FMM和BMM的编程实现，其实两个算法思路都挺简单，一个是从前取最大词长度的小分句，查找字典是否有该词，若无则分句去掉最后面一个字，再次查找，直至分句变成单词或者在字典中找到，并将其去除，然后重复上述步骤。BMM则是从后取分句，字典中不存在则分句最前去掉一个字，也是重复类似的步骤。

readCorpus.py

import sys

output = {}

with open('语料库.txt', mode='r', encoding='UTF-8') as f:

    for line in f.readlines():

        if line is not None:

            # 去除每行的换行符

            t_line = line.strip('\n')

            # 按空格分开每个词

            words = t_line.split(' ')

            for word in words:

                # 按/分开标记和词

                t_word = word.split('/')

                # 左方括号去除

                tf_word = t_word[0].split('[')

                if len(tf_word) == 2:

                    f_word = tf_word[1]

                else:

                    f_word = t_word[0]

                # 若在输出字典中，则value+1

                if f_word in output.keys():

                    output[f_word] = output[f_word]+1

                # 不在输出字典中则新建

                else:

                    output[f_word] = 1

            big_word1 = t_line.split('[')

            for i in range(1, len(big_word1)):

                big_word2 = big_word1[i].split(']')[0]

                words = big_word2.split(' ')

                big_word = ""

                for word in words:

                    # 按/分开标记和词

                    t_word = word.split('/')

                    big_word = big_word + t_word[0]

                # 若在输出字典中，则value+1

                if big_word in output.keys():

                    output[big_word] = output[big_word]+1

                # 不在输出字典中则新建

                else:

                    output[big_word] = 1

f.close()

with open('output.txt', mode='w', encoding='UTF-8') as f:

    while output:

        minNum = sys.maxsize

        minName = ""

        for key, values in output.items():

            if values < minNum:

                minNum = values

                minName = key

        f.write(minName+": "+str(minNum)+"\n")

        del output[minName]

f.close()

BMM.py

MAX_WORD = 19

word_list = []

ans_word = []

with open('output.txt', mode='r', encoding='UTF-8')as f:

    for line in f.readlines():

        if line is not None:

            word = line.split(':')

            word_list.append(word[0])

f.close()

#num = input("输入句子个数：")

#for i in range(int(num)):

while True:

    ans_word = []

    try:

        origin_sentence = input("输入：\n")

        while len(origin_sentence) != 0:

            len_word = MAX_WORD

            while len_word > 0:

                # 从后读取最大词长度的数据，若该数据在字典中，则存入数组，并将其去除

                if origin_sentence[-len_word:] in word_list:

                    ans_word.append(origin_sentence[-len_word:])

                    len_sentence = len(origin_sentence)

                    origin_sentence = origin_sentence[0:len_sentence-len_word]

                    break

                # 不在词典中，则从后取词长度-1

                else:

                    len_word = len_word - 1

            # 单词直接存入数组

            if len_word == 0:

                if origin_sentence[-1:] != ' ':

                    ans_word.append(origin_sentence[-1:])

                len_sentence = len(origin_sentence)

                origin_sentence = origin_sentence[0:len_sentence - 1]

        for j in range(len(ans_word)-1, -1, -1):

            print(ans_word[j] + '/', end='')

        print('\n')

    except (KeyboardInterrupt, EOFError):

        break

FMM.py

MAX_WORD = 19

word_list = []

with open('output.txt', mode='r', encoding='UTF-8')as f:

    for line in f.readlines():

        if line is not None:

            word = line.split(':')

            word_list.append(word[0])

f.close()

#num = input("输入句子个数：")

#for i in range(int(num)):

while True:

    try:

        origin_sentence = input("输入：\n")

        while len(origin_sentence) != 0:

            len_word = MAX_WORD

            while len_word > 0:

                # 读取前最大词长度数据，在数组中则输出，并将其去除

                if origin_sentence[0:len_word] in word_list:

                    print(origin_sentence[0:len_word]+'/', end='')

                    origin_sentence = origin_sentence[len_word:]

                    break

                # 不在字典中，则读取长度-1

                else:

                    len_word = len_word - 1

            # 为0则表示为单词，输出

            if len_word == 0:

                if origin_sentence[0] != ' ':

                    print(origin_sentence[0]+'/', end='')

                origin_sentence = origin_sentence[1:]

        print('\n')

    except (KeyboardInterrupt, EOFError):

        break

效果图

BMM.py（不含大粒度分词）

BMM.py（含大粒度分词）

FMM.py（不含大粒度分词）

FMM.py（含大粒度分词）

我们可以观察到含大粒度分词的情况将香港科技大学，北京航空航天大学等表意能力强的词分在了一起而不是拆开，更符合分词要求。

FMM和BMM的python代码实现的更多相关文章

可爱的豆子——使用Beans思想让Python代码更易维护
title: 可爱的豆子--使用Beans思想让Python代码更易维护 toc: false comments: true date: 2016-06-19 21:43:33 tags: [Pyth ...
if __name__== "__main__" 的意思(作用)python代码复用
if __name__== "__main__" 的意思(作用)python代码复用转自:大步's Blog http://www.dabu.info/if-__-name__ ...
Python 代码风格
1 原则在开始讨论Python社区所采用的具体标准或是由其他人推荐的建议之前,考虑一些总体原则非常重要. 请记住可读性标准的目标是提升可读性.这些规则存在的目的就是为了帮助人读写代码,而不是相反. ...
一行python代码实现树结构
树结构是一种抽象数据类型,在计算机科学领域有着非常广泛的应用.一颗树可以简单的表示为根, 左子树, 右子树. 而左子树和右子树又可以有自己的子树.这似乎是一种比较复杂的数据结构,那么真的能像我们在标题 ...
[Dynamic Language] 用Sphinx自动生成python代码注释文档
用Sphinx自动生成python代码注释文档 pip install -U sphinx 安装好了之后,对Python代码的文档,一般使用sphinx-apidoc来自动生成:查看帮助mac-abe ...
上传自己的Python代码到PyPI
一.需要准备的事情 1.当然是自己的Python代码包了: 2.注册PyPI的一个账号. 二.详细介绍 1.代码包的结构: application \application __init__.py m ...
如何在batch脚本中嵌入python代码
老板叫我帮他测一个命令在windows下消耗的时间,因为没有装windows那个啥工具包,没有timeit那个命令,于是想自己写一个,原理很简单: REM timeit.bat echo %TIME% ...
ROS系统python代码测试之rostest
ROS系统中提供了测试框架,可以实现python/c++代码的单元测试,python和C++通过不同的方式实现, 之后的两篇文档分别详细介绍各自的实现步骤,以及测试结果和覆盖率的获取. ROS系统中p ...
让计算机崩溃的python代码，求共同分析
在现在的异常机制处理的比较完善的编码系统里面,让计算机完全崩溃无法操作的代码还是不多的.今天就无意运行到这段python代码,运行完,计算机直接崩溃,任务管理器都无法调用,任何键都用不了,只能强行电源 ...

随机推荐

2，搭建一个java开发环境
(1)java开发需要的条件? 1)适用于环境开发的jdk(里面包括了jre和加热里面包括了jvm) 2)对应开发环境的eclipse 3)如果涉及到web开发,还需要web服务器(Tomcat) ( ...
Redis(二)linux下redis安装
上篇讲解了redis在windows下的安装,接下来看看在linux下如何安装redis(纯菜鸟入门级别)? (1)redis的下载及编译这里,首先进入存放文件目录(我的云服务器的是:cd /jel ...
【spring boot】14.spring boot集成mybatis,注解方式OR映射文件方式AND pagehelper分页插件【Mybatis】pagehelper分页插件分页查询无效解决方法
spring boot集成mybatis,集成使用mybatis拖沓了好久,今天终于可以补起来了. 本篇源码中,同时使用了Spring data JPA 和 Mybatis两种方式. 在使用的过程中一 ...
批处理创建数据库（Sql Server）
ylbtech-Miscellaneos:批处理创建数据库(Sql Server) 1.A,资源(Resource) - 创建数据返回顶部 1.A.1,InstallDatabases.cmd - 编 ...
通过JBoss反序列化（CVE-2017-12149）浅谈Java反序列化漏洞
前段时间学校学习J2EE,用到了jboss,顺便看了下jboss的反序列化,再浅谈下反序列化漏洞. Java序列化,简而言之就是把java对象转化为字节序列的过程.而反序列话则是再把字节序列恢复为ja ...
oracle调优浅析“会话管理开销”
调优之浅析"会话管理开销" [简单介绍] 在调优的过程中,对于会话的管理是比較普遍的问题,由于维护会话的开销相对是比較高的. [过程表现例如以下] ...
Android服务之bindService源代码分析
上一篇分析startService时没有画出调用ActivityManagerService之前的时序图,这里画出bindService的时序图.它们的调用流程是一致的. 先看ContextWrapp ...
hdata datax交流总结
http://blog.csdn.net/zlm838687/article/details/74781522 hdata datax交流总结今天和阿里云的同学就数据同步做了简要的交流,下面就交流的 ...
python 处理抓取网页乱码问题一招鲜
FROM: http://my.oschina.net/012345678/blog/122355 相信用python的人一定在抓取网页时,被编码问题弄晕过一阵前几天写了一个测试网页的小脚本,并查找 ...
EffectiveJava（21）策略模式多种方式实现字符串比较
**调用对象上的方法通常是执行该对象上的某项操作**. 如果一个对象的方法执行其他对象的操作,一个类仅仅导出这个方法(它的实例相当于一个指向该方法的指针),这样的实例被称为函数对象例如: /** * ...

FMM和BMM的python代码实现

FMM和BMM的python代码实现

FMM和BMM的python代码实现的更多相关文章

随机推荐

热门专题