How to calculate bits per character of a string? (bpc) to read


|
A paper I was reading, http://www.cs.toronto.edu/~ilya/pubs/2011/LANG-RNN.pdf, uses bits per character as a test metric for estimating the quality of generative computer models of text but doesn't reference how it was calculated. Googling around, I can't really find anything about it. Does anyone know how to calculate it? Python preferably, but pseudo-code or anything works. Thanks! |
|||||||||
add comment |
|
Bits per character is a measure of the performance of compression methods. It's applied by compressing a string and then measuring how many bits the compressed representation takes in total, divided by how many symbols (i.e. characters) there were in the original string. The fewer bits per character the compressed version takes, the more effective the compression method is. In other words, the authors use their generative language model, among other things, for compression and make an assumption that a higheffectiveness of the resulting compression method indicates highaccuracy of the underlying generative model. In section 1 they state:
The Rissanen & Langdon (1979) article is the original description ofarithmetic coding, a well-known method for text compression. Arithmetic coding operates on the basis of a generative language model, such as the one the authors have built. Given a (possibly empty) sequence of characters, the model predicts what character may come next. Humans can do that, too, for example given the input sequence This fits naturally with arithmetic coding: Given an input sequence that has already been encoded, the bit sequence for the next character is determined by the probability distribution of possible characters: Characters with high probability get a short bit sequence, characters with low probability get a longer sequence. Then the next character is read from the input and encoded using the bit sequence that was determined from the probability distribution. If the language model is good, the character will have been predicted with high probability, so the bit sequence will be short. Then the compression continues with the next character, again using the input so far to establish a probability distribution of characters, determining bit sequences, and then reading the actual next character and encoding it accordingly. Note that the generative model is used in every step to establish a new probability distribution. So this is an instance of adaptive arithmetic coding. After all input has been read and encoded, the total length (in bits) of the result is measured and divided by the number of characters in the original, uncompressed input. If the model is good, it will have predicted the characters with high accuracy, so the bit sequence used for each character will have been short on average, hence the total bits per character will be low. Regarding ready-to-use implementations I am not aware of an implementation of arithmetic coding that allows for easy integration of your own generative language model. Most implementations build their own adaptive model on-the-fly, i.e. they adjust character frequency tables as they read input. One option for you may be to start with arcode. I looked at the code, and it seems as though it may be possible to integrate your own model, although it's not very easy. The |
|||||||||||||
add comment |


|
The sys library has a getsizeof() function, this may be helpful?http://docs.python.org/dev/library/sys |
|||
| add comment |
Not the answer you're looking for? Browse other questions tagged python algorithm machine-learning nlp entropy or ask your own question.
|
asked |
7 months ago |
|
viewed |
199 times |
|
active |
Community Bulletin


Related
Hot Network Questions
Is it recommended or advised against to play sounds for warning and error messages?
How would you describe the grammatical case that allows this joke?
Should I list skills on my résumé if I have no interest in using them again?
Why is the moment of inertia for a hollow sphere higher than a uniform sphere?
Why are parents so concerned about not letting the babies sleep with them?
| TECHNOLOGY | LIFE / ARTS | CULTURE / RECREATION | SCIENCE | OTHER | ||
|---|---|---|---|---|---|---|
How to calculate bits per character of a string? (bpc) to read的更多相关文章
- 209. First Unique Character in a String
Description Find the first unique character in a given string. You can assume that there is at least ...
- PHP icov转码报错解决方法,iconv(): Detected an illegal character in input string
iconv(): Detected an illegal character in input string 错误解决方法 //转码 function iconv_gbk_to_uft8($strin ...
- LeetCode_387. First Unique Character in a String
387. First Unique Character in a String Easy Given a string, find the first non-repeating character ...
- PHP出现iconv(): Detected an illegal character in input string
PHP传给JS字符串用ecsape转换加到url里,又用PHP接收,再用网上找的unscape函数转换一下,这样得到的字符串是UTF-8的,但我需要的是GB2312,于是用iconv转换 开始是这样用 ...
- [LeetCode] First Unique Character in a String 字符串第一个不同字符
Given a string, find the first non-repeating character in it and return it's index. If it doesn't ex ...
- 387. First Unique Character in a String
Given a string, find the first non-repeating character in it and return it's index. If it doesn't ex ...
- LeetCode 387. First Unique Character in a String
Problem: Given a string, find the first non-repeating character in it and return it's index. If it d ...
- LeetCode First Unique Character in a String
原题链接在这里:https://leetcode.com/problems/first-unique-character-in-a-string/ 题目: Given a string, find t ...
- leetcode修炼之路——387. First Unique Character in a String
最近公司搬家了,有两天没写了,今天闲下来了,继续开始算法之路. leetcode的题目如下: Given a string, find the first non-repeating characte ...
随机推荐
- 面试题中关于String的常见操作
题目1: 将用户输入的一段话,每个单词的首字母大写, 每个单词之间的空格调整为只有一个,遇到数字,将数字与后一个单词用下划线 "_" 进行连接 题目2:将 i @@ am @@@ ...
- 在vscode中,自定义代码片段,例vue组件的模板
1---- 2---- 输入vue, 选 vue.json 3---- 在vue.json中编辑, 有说明 a. tab符,要用空格, 也可以转义 4---- 新建vue文件, 输入自定义 ...
- Type mismatch: cannot convert from javax.servlet.http.Cookie[] to org.apache.tomcat.util.http.parser.Cookie[] 的一种可能
今天用到Cookie时,写了一个Cookie数组,发现报错“Type mismatch: cannot convert from javax.servlet.http.Cookie[] to org. ...
- LRU缓存机制
运用你所掌握的数据结构,设计和实现一个 LRU (最近最少使用) 缓存机制.它应该支持以下操作: 获取数据 get 和 写入数据 put . 获取数据 get(key) - 如果密钥 (key) 存 ...
- Unity --- 如何降低UI的填充率
1.首先简单介绍一下什么叫填充率: Fill Rate(填充率)是指显卡每帧或者说每秒能够渲染的像素数.在每帧绘制中,如果一个像素被反复绘制的次数越多,那么它占用的资源也必然更多.目前在移动设备上,F ...
- 有关C#中List排序的总结
这里有一篇文章作者总结的就比较详细: https://blog.csdn.net/jimo_lonely/article/details/51711821 在这里只记录一点: 对list或者数组中的数 ...
- python中函数与函数式编程(二)
首先要明白为什么要用到返回值,返回值的作用就是为了分情况来处理下面的程序(个人见解总结) 1.函数返回值 def test1(): pass def test2(): return 0 def tes ...
- (转)c# 属性与索引器
属性是一种成员,它提供灵活的机制来读取.写入或计算私有字段的值. 属性可用作公共数据成员,但它们实际上是称为“访问器”的特殊方法. 这使得可以轻松访问数据,还有助于提高方法的安全性和灵活性. 一个简单 ...
- 文献导读 - Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation
参考: Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation 前所未有!1 ...
- ASA与N6K对接
ASA5545配置interface GigabitEthernet0/0 channel-group 10 mode active no nameif no security-level no ip ...

