A paper I was reading, http://www.cs.toronto.edu/~ilya/pubs/2011/LANG-RNN.pdf, uses bits per character as a test metric for estimating the quality of generative computer models of text but doesn't reference how it was calculated. Googling around, I can't really find anything about it.

Does anyone know how to calculate it? Python preferably, but pseudo-code or anything works. Thanks!

asked Jul 22 '13 at 21:40
Newmu
48619
 
   
Are you talking about the stuff defined in C CHAR_BITtigcc.ticalc.org/doc/limits.html#CHAR_BIT ? –  woozyking Jul 23 '13 at 0:31
   
Nope, this is related to information theory and entropy, not actual bit size. – Newmu Jul 23 '13 at 0:46

add comment

2 Answers

Bits per character is a measure of the performance of compression methods. It's applied by compressing a string and then measuring how many bits the compressed representation takes in total, divided by how many symbols (i.e. characters) there were in the original string. The fewer bits per character the compressed version takes, the more effective the compression method is.

In other words, the authors use their generative language model, among other things, for compression and make an assumption that a higheffectiveness of the resulting compression method indicates highaccuracy of the underlying generative model.

In section 1 they state:

The goal of the paper is to demonstrate the power of large RNNs trained with the new Hessian-Free optimizer by applying them to the task of predicting the next character in a stream of text. This is an important problem because a better character-level language model could improve compression of text files (Rissanen & Langdon, 1979) [...]

The Rissanen & Langdon (1979) article is the original description ofarithmetic coding, a well-known method for text compression.

Arithmetic coding operates on the basis of a generative language model, such as the one the authors have built. Given a (possibly empty) sequence of characters, the model predicts what character may come next. Humans can do that, too, for example given the input sequence hello w, we can guess probabilities for the next character: o has high probability (because hello world is a plausible continuation), but characters like h as in hello where can I find.. or i as in hello winston also have non-zero probability. So we can establish aprobability distribution of characters for this particular input, and that's exactly what the authors' generative model does as well.

This fits naturally with arithmetic coding: Given an input sequence that has already been encoded, the bit sequence for the next character is determined by the probability distribution of possible characters: Characters with high probability get a short bit sequence, characters with low probability get a longer sequence. Then the next character is read from the input and encoded using the bit sequence that was determined from the probability distribution. If the language model is good, the character will have been predicted with high probability, so the bit sequence will be short. Then the compression continues with the next character, again using the input so far to establish a probability distribution of characters, determining bit sequences, and then reading the actual next character and encoding it accordingly.

Note that the generative model is used in every step to establish a new probability distribution. So this is an instance of adaptive arithmetic coding.

After all input has been read and encoded, the total length (in bits) of the result is measured and divided by the number of characters in the original, uncompressed input. If the model is good, it will have predicted the characters with high accuracy, so the bit sequence used for each character will have been short on average, hence the total bits per character will be low.


Regarding ready-to-use implementations

I am not aware of an implementation of arithmetic coding that allows for easy integration of your own generative language model. Most implementations build their own adaptive model on-the-fly, i.e. they adjust character frequency tables as they read input.

One option for you may be to start with arcode. I looked at the code, and it seems as though it may be possible to integrate your own model, although it's not very easy. The self._ranges member represents the language model; basically as an array of cumulative character frequencies, so self._ranges[ord('d')] is the total relative frequency of all characters that are less than d (i.e. abc if we assume lower-case alphabetic characters only). You would have to modify that array after every input character and map the character probabilities you get from the generative model to character frequency ranges.

answered Jul 23 '13 at 1:40
jogojapan
24k42859
 
2  
Excellent introduction and explanation, thank you! –  Newmu Jul 23 '13 at 4:00
   
The model I'm working on generates a probability distribution for the next character given a fixed length of previous characters, so it looks like it'll be some work but I should be able to figure it out. –  Newmu Jul 23 '13 at 4:14
   
@Newmu It should definitely be possible. But I agree it will be easy to make mistakes. You'll need to test this carefully after implementing it. (Unfortunately I have no insider knowledge about arcode. The code seemed to be relatively easy to understand, which is why I suggested this module. There are others, too, though. If you do a search for "python arithmetic coding" or similar, you may be able to find implementations more suitable.) –  jogojapan Jul 23 '13 at 4:31

add comment

The sys library has a getsizeof() function, this may be helpful?http://docs.python.org/dev/library/sys

answered Jul 23 '13 at 0:57
ChrisProsser
2,5691620
  add comment

Not the answer you're looking for? Browse other questions tagged python algorithm machine-learning nlp entropy or ask your own question.

asked

7 months ago

viewed

199 times

active

7 months ago

Hot Network Questions

more hot questions

site design / logo © 2014 stack exchange inc; user contributions licensed under cc by-sa 3.0 withattribution required
rev 2014.2.25.1396
 

How to calculate bits per character of a string? (bpc) to read的更多相关文章

  1. 209. First Unique Character in a String

    Description Find the first unique character in a given string. You can assume that there is at least ...

  2. PHP icov转码报错解决方法,iconv(): Detected an illegal character in input string

    iconv(): Detected an illegal character in input string 错误解决方法 //转码 function iconv_gbk_to_uft8($strin ...

  3. LeetCode_387. First Unique Character in a String

    387. First Unique Character in a String Easy Given a string, find the first non-repeating character ...

  4. PHP出现iconv(): Detected an illegal character in input string

    PHP传给JS字符串用ecsape转换加到url里,又用PHP接收,再用网上找的unscape函数转换一下,这样得到的字符串是UTF-8的,但我需要的是GB2312,于是用iconv转换 开始是这样用 ...

  5. [LeetCode] First Unique Character in a String 字符串第一个不同字符

    Given a string, find the first non-repeating character in it and return it's index. If it doesn't ex ...

  6. 387. First Unique Character in a String

    Given a string, find the first non-repeating character in it and return it's index. If it doesn't ex ...

  7. LeetCode 387. First Unique Character in a String

    Problem: Given a string, find the first non-repeating character in it and return it's index. If it d ...

  8. LeetCode First Unique Character in a String

    原题链接在这里:https://leetcode.com/problems/first-unique-character-in-a-string/ 题目: Given a string, find t ...

  9. leetcode修炼之路——387. First Unique Character in a String

    最近公司搬家了,有两天没写了,今天闲下来了,继续开始算法之路. leetcode的题目如下: Given a string, find the first non-repeating characte ...

随机推荐

  1. Tp5.1使用导出Excel

    composer require phpoffice/phpexcel 不管它的警告,都能用的. use PHPExcel; use PHPExcel_IOFactory; public static ...

  2. oracle 12c创建可插拔数据库(PDB)与用户详解

    前言 由于oracle 12c使用了CDB-PDB架构,类似于docker,在container-db内可以加载多个pluggable-db,因此安装后需要额外配置才能使用. 一.修改listener ...

  3. oracle 12c创建可插拔数据库(PDB)及用户

    由于oracle 12c使用了CDB-PDB架构,类似于docker,在container-db内可以加载多个pluggable-db,因此安装后需要额外配置才能使用. 一.修改listener.or ...

  4. 使用JDBC从数据库中查询数据的方法

    * ResultSet 结果集:封装了使用JDBC 进行查询的结果 * 1. 调用Statement 对象的 executeQuery(sql) 方法可以得到结果集 * 2. ResultSet 返回 ...

  5. centos 7 安装jdk1.8

    首先下载jdk1.8  去官网下载jdk:http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151. ...

  6. C#动态代理

    所谓代理,就是不直接访问目标对象,而是由中间对象生成一个目标代理类,由中间代理对象来代理目标对象的方法.Java里面有JDK和CGLIB代理.C#里面则使用Castle代理.nuget引用如下: &l ...

  7. python中列表和元组的操作(结尾格式化输出小福利)

    一. 列表 1. 查 names = "YanFeixu WuYifan" names_1 = ["YanFeixu"," WuYifan" ...

  8. 单细胞数据高级分析之构建成熟路径 | Identifying a maturation trajectory

    其实就是另一种形式的打分. 个人点评这种方法: 这篇文章发表在nature上,有点奇怪,个人感觉创新性和重要性还不够格,工具很多,但是本文基本都是自己开发的算法(毕竟satji就是搞统计出身的). 但 ...

  9. Confluence 6 空间中的常用宏

    小组空间(Team Spaces): 介绍小组:User Profile Macro 将会对 Confluence 的用户显示属性的简单摘要,属性照片,联系方式. 在你小组中分享通知和新闻:The B ...

  10. Vue.js,select中的option用ajax想循环控制值的显示 v-model可以实现但提示报错,这是为什么?

    应该将v-model换成:value,因为v-model只能绑定一个值,无法绑定多个值 <select v-model="citys">       <optio ...