N-Gram的数据结构
ARPA的n-gram语法如下:
[html] view plaincopyprint?
\data\
ngram 1=64000
ngram 2=522530
ngram 3=173445  
\1-grams:
-5.24036        'cause  -0.2084827
-4.675221       'em     -0.221857
-4.989297       'n      -0.05809768
-5.365303       'til    -0.1855581
-2.111539       </s>    0.0
-99     <s>     -0.7736475
-1.128404       <unk>   -0.8049794
-2.271447       a       -0.6163939
-5.174762       a's     -0.03869072
-3.384722       a.      -0.1877073
-5.789208       a.'s    0.0
-6.000091       aachen  0.0
-4.707208       aaron   -0.2046838
-5.580914       aaron's -0.06230035
-5.789208       aarons  -0.07077657
-5.881973       aaronson        -0.2173971
具体说明见 :ARPA的n-gram语言模型格式
整个ARPA-LM由很多个n-gram项组成,分别说明这两个的数据结构
一,n-gram数据结构
n-gram的数据结构如下:
typedef struct
{
    real        log_prob ;
    real        log_bo ;
    int         *words ;
} ARPALMEntry ;
words,表示当前的n-gram所涉及的单词,如果是1-gram,那就只有一个,如果是2-gram,那么words就包括这两个单词的序号。
log_bo,表示ngram的回退概率。
log_prob,表示ngram的组合概率。
二,ARPA-LM数据结构
多个项组成的整个n-gram语言模型的数据结构如下:
[cpp] view plaincopyprint?
class ARPALM
 {
    public:
        Vocabulary *vocab ;  
        int            order ;
        ARPALMEntry    **entries ; // 语言模型的所有项,组成一个数组
        int            *n_ngrams ; // 一元语言模型、二元语言模型、三元语言模型等组成的数组,数组每一项都表示对应的的元有多少个。  
        char           *unk_wrd ; // 词典中不在语言模型中的词。
        int            unk_id ;// 词典中不在语言模型中的词ID,这个ID指定为词典的最后一个序号。  
        int            n_unk_words ;
        int            *unk_words ;
    private:
        bool           *words_in_lm ; // 布尔类型数组,标识词是否在语言模型中。
}
vocab,用于构建语言模型的词典指针。词典定义见:词典内存存储模型
entries,语言模型的所有ngram项,是ARPALMEntry类型的一个二维数组。entries[0]存储1-gram,entries[1]存储2-gram,依此类推。
n_ngrams,整型数组,依次包含1-gram,2-gram,3-gram,....所包含的ngram项个数。
unk_wrd,词典中可以不在语言模型中的词。
unk_id,词典中可以不在语言模型中的词的ID,这个ID指定为词典的最后一个词序号。
n_unk_words,在读语言模型之后,统计在词典中,但没有用来建立语言模型的词个数,如果没有指定unk_wrd的话,是不允许的,就表示所有的词典中的词都应该用来建语言模型。
unk_words,存储6中统计的词序号。
words_in_lm,这个标识词典中的词是否在语言模型中出现。
N-Gram的数据结构的更多相关文章
- 多线程爬坑之路-学习多线程需要来了解哪些东西?(concurrent并发包的数据结构和线程池,Locks锁,Atomic原子类)
		
前言:刚学习了一段机器学习,最近需要重构一个java项目,又赶过来看java.大多是线程代码,没办法,那时候总觉得多线程是个很难的部分很少用到,所以一直没下决定去啃,那些年留下的坑,总是得自己跳进去填 ...
 - 一起学 Java(三) 集合框架、数据结构、泛型
		
一.Java 集合框架 集合框架是一个用来代表和操纵集合的统一架构.所有的集合框架都包含如下内容: 接口:是代表集合的抽象数据类型.接口允许集合独立操纵其代表的细节.在面向对象的语言,接口通常形成一个 ...
 - 深入浅出Redis-redis底层数据结构(上)
		
1.概述 相信使用过Redis 的各位同学都很清楚,Redis 是一个基于键值对(key-value)的分布式存储系统,与Memcached类似,却优于Memcached的一个高性能的key-valu ...
 - 算法与数据结构(十五) 归并排序(Swift 3.0版)
		
上篇博客我们主要聊了堆排序的相关内容,本篇博客,我们就来聊一下归并排序的相关内容.归并排序主要用了分治法的思想,在归并排序中,将我们需要排序的数组进行拆分,将其拆分的足够小.当拆分的数组中只有一个元素 ...
 - 算法与数据结构(十三) 冒泡排序、插入排序、希尔排序、选择排序(Swift3.0版)
		
本篇博客中的代码实现依然采用Swift3.0来实现.在前几篇博客连续的介绍了关于查找的相关内容, 大约包括线性数据结构的顺序查找.折半查找.插值查找.Fibonacci查找,还包括数结构的二叉排序树以 ...
 - 算法与数据结构(九) 查找表的顺序查找、折半查找、插值查找以及Fibonacci查找
		
今天这篇博客就聊聊几种常见的查找算法,当然本篇博客只是涉及了部分查找算法,接下来的几篇博客中都将会介绍关于查找的相关内容.本篇博客主要介绍查找表的顺序查找.折半查找.插值查找以及Fibonacci查找 ...
 - 算法与数据结构(八) AOV网的关键路径
		
上篇博客我们介绍了AOV网的拓扑序列,请参考<数据结构(七) AOV网的拓扑排序(Swift面向对象版)>.拓扑序列中包括项目的每个结点,沿着拓扑序列将项目进行下去是肯定可以将项目完成的, ...
 - 算法与数据结构(七) AOV网的拓扑排序
		
今天博客的内容依然与图有关,今天博客的主题是关于拓扑排序的.拓扑排序是基于AOV网的,关于AOV网的概念,我想引用下方这句话来介绍: AOV网:在现代化管理中,人们常用有向图来描述和分析一项工程的计划 ...
 - 掌握javascript中的最基础数据结构-----数组
		
这是一篇<数据结构与算法javascript描述>的读书笔记.主要梳理了关于数组的知识.部分内容及源码来自原作. 书中第一章介绍了如何配置javascript运行环境:javascript ...
 - [数据结构]——链表(list)、队列(queue)和栈(stack)
		
在前面几篇博文中曾经提到链表(list).队列(queue)和(stack),为了更加系统化,这里统一介绍着三种数据结构及相应实现. 1)链表 首先回想一下基本的数据类型,当需要存储多个相同类型的数据 ...
 
随机推荐
- 原 win10 msys2 vs2015 ffmpeg3.2.2 编译
			
01 环境 win10x64企业版.vs2015update3企业版.git(git version 2.10.0.windows.1). 02 下载ffmpeg代码 git clone https: ...
 - JDK 8 新特性
			
JDK 8, Oracle's implementation of Java SE 8. JDK 8 是 Oracle 对 Java SE 8 规范的实现. 本文分析 JDK 8 引入的新特性. 官方 ...
 - Hibernate学习10——Hibernate 查询方式
			
本章主要是以查询Student的例子: Student.java: package com.cy.model; public class Student { private int id; priva ...
 - PAT 甲级 1007 Maximum Subsequence Sum (25)(25 分)(0不是负数,水题)
			
1007 Maximum Subsequence Sum (25)(25 分) Given a sequence of K integers { N~1~, N~2~, ..., N~K~ }. A ...
 - Java-Runoob-面向对象:Java 包(Package)
			
ylbtech-Java-Runoob-面向对象:Java 包(Package) 1.返回顶部 1. Java 包(package) 为了更好地组织类,Java 提供了包机制,用于区别类名的命名空间. ...
 - 准确计算Java中对象的大小
			
由于在项目中需要大致计算一下对象的内存占用率(Hadoop中的Reduce端内存占用居高不下却又无法解释),因此深入学习了一下如何准确计算对象的大小. 使用system.gc()和java.lang. ...
 - 【UVa】1601 The Morning after Halloween(双向bfs)
			
题目 题目 分析 双向bfs,对着书打的,我还调了好久. 代码 #include<cstdio> #include<cstring> #include<c ...
 - oracle 索引使用小结
			
1. 普通索引 create index my_index on test (col_1); 可创建合并两列或多列的索引,最多可将32列合并在一个索引中(位图索引最多可合并30列) create in ...
 - Linux 服务器--Iptables 端口转发
			
日常Iptables 端口转发 需求:公司是局域网络,通过一个外网ip,进行互联网的访问.公司的云平台服务器在公网中,虚拟化平台中有一台内部服务器,用于公司某部门的使用,上面运行www 服务,ssh端 ...
 - 代做JSP课程设计,毕业设计
			
代做JSP课程设计,毕业设计,大家都是学生,绝对靠谱,有意者加我Q 279283855