未经允许不可转载

关于Kenlm模块的使用及C++源码说明

加载Kenlm模块命令

qy@IAT-QYVPN:~/Documents/kenlm/lm$ ../bin/query -n test.arpa

Kenlm模块C++源码说明

query的主入口文件:query_main.cc

query的执行函数文件:ngram_query.hh

注意:

默认执行的是query_main.cc文件96行的

Query<ProbingModel>(file, config, sentence_context, show_words);

而不是lm/wrappers/nplm.hh,这个封装文件是需要NPLM模块的,参考以下代码,当时疏忽了在这个地方耽误了一些时间

#ifdef WITH_NPLM

    } else if (lm::np::Model::Recognize(file)) {

      lm::np::Model model(file);

      if (show_words) {

        Query<lm::np::Model, lm::ngram::FullPrint>(model, sentence_context);

      } else {

        Query<lm::np::Model, lm::ngram::BasicPrint>(model, sentence_context);

      }

#endif

关于Model类的继承关系

最基类virtual_interface.hh lm::base::Model

次基类facade.hh lm::base::ModelFacade : public Model

子类model.hh lm::ngram::GenericModel : public base::ModelFacade<GenericModel<Search, VocabularyT>, State, VocabularyT>

关于cython的简单说明

cython官网

可以从官网下载最新版本,参考Documentation分类中的Cython Wiki和Cython FAQ了解一些知识。

cython-cpp-test-sample

Wrapping C++ Classes in Cython

cython wrapping of base and derived class

std::string arguments in cython

Cython and constructors of classes

Cython基础--Cython入门

kenlm的python模块封装

接下来，让我们进入正题，在kenlm的源码中实际上已经提供了python的应用。在kenlm/python文件夹中，那么为什么还要再封装python模块呢，因为kenlm中所带的python模块仅仅实现了包含<s>和</s>这种情况下的计算分数的方法，而没有提供不包含这种情况的计算分数的算法，这就是为什么要重新封装python模块的原因。

简单介绍一下python模块使用的必要步骤

安装kenlm.so模块到python的目录下，默认直接运行kenlm目录下的setup.py文件即可安装成功sudo python setup.py install --record log。

安装成功后，即可运行python example.py文件，查看运行结果。

如何扩展kenlm的python模块

接下来，正式进入python扩展模块的介绍。kenlm.pxd是cython针对所用到C++类及对象的声明文件，kenlm.pyx是真正要编写的cython功能代码，也是未来python所要调用的类及方法。使用cython的编译命令，可以把kenlm.pxd和kenlm.pyx编译出kenlm.cpp文件。setup.py文件会用到编译出来的kenlm.cpp文件。

cython编译命令cython --cplus kenlm.pyx

扩展后的kenlm.pxd文件

from libcpp.string cimport string

cdef extern from "lm/word_index.hh":

    ctypedef unsigned WordIndex

cdef extern from "lm/return.hh" namespace "lm":

    cdef struct FullScoreReturn:

        float prob

        unsigned char ngram_length

cdef extern from "lm/state.hh" namespace "lm::ngram":

    cdef struct State:

        pass

    ctypedef State const_State "const lm::ngram::State"

cdef extern from "lm/virtual_interface.hh" namespace "lm::base":

    cdef cppclass Vocabulary:

        WordIndex Index(char*)

        WordIndex BeginSentence()

        WordIndex EndSentence()

        WordIndex NotFound()

    ctypedef Vocabulary const_Vocabulary "const lm::base::Vocabulary"

cdef extern from "lm/model.hh" namespace "lm::ngram":

    cdef cppclass Model:

        const_Vocabulary& GetVocabulary()

        const_State& NullContextState()

        void Model(char* file)

        FullScoreReturn FullScore(const_State& in_state, WordIndex new_word, const_State& out_state)

        void BeginSentenceWrite(void *)

        void NullContextWrite(void *)

        unsigned int Order()

        const_Vocabulary& BaseVocabulary()

        float BaseScore(void *in_state, WordIndex new_word, void *out_state)

        FullScoreReturn BaseFullScore(void *in_state, WordIndex new_word, void *out_state)

        void * NullContextMemory()

扩展后的kenlm.pyx文件

import os

cdef bytes as_str(data):

    if isinstance(data, bytes):

        return data

    elif isinstance(data, unicode):

        return data.encode('utf8')

    raise TypeError('Cannot convert %s to string' % type(data))

cdef int as_in(int &Num):

    (&Num)[0] = 1

cdef class LanguageModel:

    cdef Model* model

    cdef public bytes path

    cdef const_Vocabulary* vocab

    def __init__(self, path):

        self.path = os.path.abspath(as_str(path))

        try:

            self.model = new Model(self.path)

        except RuntimeError as exception:

            exception_message = str(exception).replace('\n', ' ')

            raise IOError('Cannot read model \'{}\' ({})'.format(path, exception_message))\

                    from exception

        self.vocab = &self.model.GetVocabulary()

    def __dealloc__(self):

        del self.model

    property order:

        def __get__(self):

            return self.model.Order()

    def score(self, sentence):

        cdef list words = as_str(sentence).split()

        cdef State state

        self.model.BeginSentenceWrite(&state)

        cdef State out_state

        cdef float total = 0

        for word in words:

            total += self.model.BaseScore(&state, self.vocab.Index(word), &out_state)

            state = out_state

        total += self.model.BaseScore(&state, self.vocab.EndSentence(), &out_state)

        return total

    def full_scores(self, sentence):

        cdef list words = as_str(sentence).split()

        cdef State state

        self.model.BeginSentenceWrite(&state)

        cdef State out_state

        cdef FullScoreReturn ret

        cdef float total = 0

        for word in words:

            ret = self.model.BaseFullScore(&state,

                self.vocab.Index(word), &out_state)

            yield (ret.prob, ret.ngram_length)

            state = out_state

        ret = self.model.BaseFullScore(&state,

            self.vocab.EndSentence(), &out_state)

        yield (ret.prob, ret.ngram_length)

    def full_scores_n(self, sentence):

        cdef list words = as_str(sentence).split()

        cdef State state

        state = self.model.NullContextState()

        cdef State out_state

        cdef FullScoreReturn ret

        cdef int ovv = 0

        for word in words:

            ret = self.model.FullScore(state,

                self.vocab.Index(word), out_state)

            yield (ret.prob, ret.ngram_length)

            state = out_state

    """""""""""

    """count scores when not included <s> and </s>"""

    """""""""""

    def score_n(self, sentence):

        cdef list words = as_str(sentence).split()

        cdef State state

        state = self.model.NullContextState()

        cdef State out_state

        cdef float total = 0

        for word in words:

            ret = self.model.FullScore(state,

                self.vocab.Index(word), out_state)

            total += ret.prob

            """print(total)"""

            state = out_state

        return total

    def __contains__(self, word):

        cdef bytes w = as_str(word)

        return (self.vocab.Index(w) != 0)

    def __repr__(self):

        return '<LanguageModel from {0}>'.format(os.path.basename(self.path))

    def __reduce__(self):

        return (LanguageModel, (self.path,))

【原创】cython and python for kenlm的更多相关文章

用Cython加速Python程序以及包装C程序简单测试
用Cython加速Python程序我没有拼错,就是Cython,C+Python=Cython! 我们来看看Cython的威力,先运行下边的程序: import time def fib(n): i ...
原创：用python把链接指向的网页直接生成图片的http服务及网站(含源码及思想)
原创:用python把链接指向的网页直接生成图片的http服务及网站(含源码及思想) 总体思想: 希望让调用方通过 http调用传入一个需要生成图片的网页链接生成一个网页的图片并返回图片链接 ...
用Cython加速Python代码
安装Cython pip install Cython 如何使用要在我们的笔记本中使用Cython,我们将使用IPython magic命令.Magic命令以百分号开始,并提供一些额外的功能,这些功 ...
Cython保护Python代码
注:.pyc也有一定的保护性,容易被反编译出源码... 项目发布时,为防止源码泄露,需要对源码进行一定的保护机制,本文使用Cython将.py文件转为.so进行保护.这一方法,虽仍能被反编译,但难度会 ...
利用Cython对python代码进行加密
利用Cython对python代码进行加密 Cython是属于PYTHON的超集,他首先会将PYTHON代码转化成C语言代码,然后通过c编译器生成可执行文件.优势:资源丰富,适合快速开发.翻译成C后速 ...
使用cython把python编译so
1.需求为了保证线上代码安全和效率,使用python编写代码,pyc可直接反编译,于是把重要代码编译so文件 2.工作 2.1 安装相关库: pip install cython yum insta ...
用cython提升python的性能
Boosting performance with Cython Even with my old pc (AMD Athlon II, 3GB ram), I seldom run into ...
【原创分享】python获取乌云最新提交的漏洞，邮件发送
#!/usr/bin/env python # coding:utf-8 # @Date : 2016年4月21日 15:08:44 # @Author : sevck (sevck@jdsec.co ...
[原创博文] 用Python做统计分析（Scipy.stats的文档）
[转自] 用Python做统计分析 (Scipy.stats的文档) 对scipy.stats的详细介绍: 这个文档说了以下内容,对python如何做统计分析感兴趣的人可以看看,毕竟Python的库也 ...

随机推荐

Top Android App使用的组件
唱吧_462 smack:de.measite.smack:??? ???:org.apache:??? smack:org.jivesoftware.smack:XMPP客户端类库 dnsjava: ...
idea 类注释，方法注释设置
类头注释:打开file->setting->Editor->Filr and Code Templates->Includes->File Header 直接在右边的文件 ...
ActiveMQ入门之四--ActiveMQ持久化方式
消息持久性对于可靠消息传递来说应该是一种比较好的方法,有了消息持久化,即使发送者和接受者不是同时在线或者消息中心在发送者发送消息后宕机了,在消息中心重新启动后仍然可以将消息发送出去,如果把这种持久化和 ...
模拟admin组件自己开发stark组件之增删改查
增删改查,针对视图我们需要modelform来创建,可自动生成标签,我们还要考虑用户是不是自己定制,依然解决方法是,继承和重写 app01下的joker.py文件 class BookModelFo ...
详解AJAX核心中的XMLHttpRequest对象
转自:http://developer.51cto.com/art/200904/119577.htm XMLHttpRequest 对象是AJAX功能的核心,要开发AJAX程序必须从了解XMLHtt ...
（转）Tomcat 启动后 “闪退”
缘由今天在一台新机器上部署开发环境,安装完Tomcat以后,运行startup.bat后出现“闪退”...在网上找到了解决方案,条理清晰且分析的很详尽.记录如下: 首先贴出原文链接: http:// ...
编译安装x264
网上也有相应的教程,之所以在这里重申一遍,是因为我试了网上很多的编译方法,都出现了问题,为此将此编译安装方法记录下来. 首先是获取x264的网站:http://www.videolan.org/de ...
A start job is running for /etc/rc.d/rc.local ... ... no limit
/etc/rc.d/rc.local文件中配置了redis随机启动但是没有设置redis启动为守护进程(daemonize yes)导致redis启动后阻塞
springMVC json自动将date类型转换为long
今天早上遇到了一个奇怪得问题,直接给后台发送请求返回得页面信息中显示时间是正常得,如:2016-03-17 15:42:11.0,但是通过AJAX获取得信息中显示得时间竟然是时间戳. 我首先检查后台传 ...
Java规则引擎及JSR-94[转]
规则引擎简介 Java规则引擎是推理引擎的一种,它起源于基于规则的专家系统. Java规则引擎将业务决策从应用程序代码中分离出来,并使用预定义的语义模块编写业务决策.Java规则引擎接 ...

【原创】cython and python for kenlm