how-to-calculate-bits-per-character-of-a-string-bpc

A paper I was reading, http://www.cs.toronto.edu/~ilya/pubs/2011/LANG-RNN.pdf, uses bits per character as a test metric for estimating the quality of generative computer models of text but doesn't reference how it was calculated. Googling around, I can't really find anything about it.

Does anyone know how to calculate it? Python preferably, but pseudo-code or anything works. Thanks!

python algorithm machine-learning nlp entropy

share|improve this question

edited Jul 23 '13 at 1:41

jogojapan
24k42859

asked Jul 22 '13 at 21:40

Newmu
48619

Are you talking about the stuff defined in C CHAR_BITtigcc.ticalc.org/doc/limits.html#CHAR_BIT ? – woozyking Jul 23 '13 at 0:31

Nope, this is related to information theory and entropy, not actual bit size. – Newmu Jul 23 '13 at 0:46

add comment

2 Answers

active oldest votes

up vote5down voteaccepted

Bits per character is a measure of the performance of compression methods. It's applied by compressing a string and then measuring how many bits the compressed representation takes in total, divided by how many symbols (i.e. characters) there were in the original string. The fewer bits per character the compressed version takes, the more effective the compression method is.

In other words, the authors use their generative language model, among other things, for compression and make an assumption that a higheffectiveness of the resulting compression method indicates highaccuracy of the underlying generative model.

In section 1 they state:

The goal of the paper is to demonstrate the power of large RNNs trained with the new Hessian-Free optimizer by applying them to the task of predicting the next character in a stream of text. This is an important problem because a better character-level language model could improve compression of text ﬁles (Rissanen & Langdon, 1979) [...]

The Rissanen & Langdon (1979) article is the original description ofarithmetic coding, a well-known method for text compression.

Arithmetic coding operates on the basis of a generative language model, such as the one the authors have built. Given a (possibly empty) sequence of characters, the model predicts what character may come next. Humans can do that, too, for example given the input sequence hello w, we can guess probabilities for the next character: o has high probability (because hello world is a plausible continuation), but characters like h as in hello where can I find.. or i as in hello winston also have non-zero probability. So we can establish aprobability distribution of characters for this particular input, and that's exactly what the authors' generative model does as well.

This fits naturally with arithmetic coding: Given an input sequence that has already been encoded, the bit sequence for the next character is determined by the probability distribution of possible characters: Characters with high probability get a short bit sequence, characters with low probability get a longer sequence. Then the next character is read from the input and encoded using the bit sequence that was determined from the probability distribution. If the language model is good, the character will have been predicted with high probability, so the bit sequence will be short. Then the compression continues with the next character, again using the input so far to establish a probability distribution of characters, determining bit sequences, and then reading the actual next character and encoding it accordingly.

Note that the generative model is used in every step to establish a new probability distribution. So this is an instance of adaptive arithmetic coding.

After all input has been read and encoded, the total length (in bits) of the result is measured and divided by the number of characters in the original, uncompressed input. If the model is good, it will have predicted the characters with high accuracy, so the bit sequence used for each character will have been short on average, hence the total bits per character will be low.

Regarding ready-to-use implementations

I am not aware of an implementation of arithmetic coding that allows for easy integration of your own generative language model. Most implementations build their own adaptive model on-the-fly, i.e. they adjust character frequency tables as they read input.

One option for you may be to start with arcode. I looked at the code, and it seems as though it may be possible to integrate your own model, although it's not very easy. The self._ranges member represents the language model; basically as an array of cumulative character frequencies, so self._ranges[ord('d')] is the total relative frequency of all characters that are less than d (i.e. a, b, c if we assume lower-case alphabetic characters only). You would have to modify that array after every input character and map the character probabilities you get from the generative model to character frequency ranges.

share|improve this answer

edited Jul 23 '13 at 11:01

answered Jul 23 '13 at 1:40

jogojapan
24k42859

Excellent introduction and explanation, thank you! – Newmu Jul 23 '13 at 4:00

The model I'm working on generates a probability distribution for the next character given a fixed length of previous characters, so it looks like it'll be some work but I should be able to figure it out. – Newmu Jul 23 '13 at 4:14

@Newmu It should definitely be possible. But I agree it will be easy to make mistakes. You'll need to test this carefully after implementing it. (Unfortunately I have no insider knowledge about arcode. The code seemed to be relatively easy to understand, which is why I suggested this module. There are others, too, though. If you do a search for "python arithmetic coding" or similar, you may be able to find implementations more suitable.) – jogojapan Jul 23 '13 at 4:31

add comment

up vote0down vote

The sys library has a getsizeof() function, this may be helpful?http://docs.python.org/dev/library/sys

share|improve this answer

answered Jul 23 '13 at 0:57

ChrisProsser
2,5691620

add comment

Not the answer you're looking for? Browse other questions tagged python algorithm machine-learning nlp entropy or ask your own question.

asked	7 months ago
viewed	199 times
active	7 months ago

Community Bulletin

event

Moderator candidates' answers to your questions– ends in 6 hours

event

2014 Community Moderator Election– ends in 6 hours

blog

Podcast #55 – Don’t Call It A Comeback

How Do You Categorize Based On Text Content?

228

How does the Google “Did you mean?” Algorithm work?

How to calculate tag-wise precision and recall for POS tagger?

How do I approximate “Did you mean?” without using Google?

How to recognise a particular user in a long multi-user internet chat log?

How do I approach this named-entity classification task?

How does Apple find dates, times and addresses in emails?

Python NLTK: How to retrieve percentage confidence in classifier prediction

How to implement Bag of words feature hashing in python?

Calculate entropy of a file

Hot Network Questions

TECHNOLOGY	LIFE / ARTS	CULTURE / RECREATION	SCIENCE	OTHER
Stack Overflow Server Fault Super User Web Applications Ask Ubuntu Webmasters Game Development TeX - LaTeX	Programmers Unix & Linux Ask Different (Apple) WordPress Answers Geographic Information Systems Electrical Engineering Android Enthusiasts Information Security	Database Administrators Drupal Answers SharePoint User Experience Mathematica more (14)	Photography Science Fiction & Fantasy Seasoned Advice (cooking) Home Improvement more (13)	English Language & Usage Skeptics Mi Yodeya (Judaism) Travel Christianity Arqade (gaming) Bicycles Role-playing Games more (21)	Mathematics Cross Validated (stats) Theoretical Computer Science Physics MathOverflow more (7)	Stack Apps Meta Stack Overflow Area 51 Stack Overflow Careers

How to calculate bits per character of a string? (bpc) to read的更多相关文章

209. First Unique Character in a String
Description Find the first unique character in a given string. You can assume that there is at least ...
PHP icov转码报错解决方法，iconv(): Detected an illegal character in input string
iconv(): Detected an illegal character in input string 错误解决方法 //转码 function iconv_gbk_to_uft8($strin ...
LeetCode_387. First Unique Character in a String
387. First Unique Character in a String Easy Given a string, find the first non-repeating character ...
PHP出现iconv(): Detected an illegal character in input string
PHP传给JS字符串用ecsape转换加到url里,又用PHP接收,再用网上找的unscape函数转换一下,这样得到的字符串是UTF-8的,但我需要的是GB2312,于是用iconv转换开始是这样用 ...
[LeetCode] First Unique Character in a String 字符串第一个不同字符
Given a string, find the first non-repeating character in it and return it's index. If it doesn't ex ...
387. First Unique Character in a String
Given a string, find the first non-repeating character in it and return it's index. If it doesn't ex ...
LeetCode 387. First Unique Character in a String
Problem: Given a string, find the first non-repeating character in it and return it's index. If it d ...
LeetCode First Unique Character in a String
原题链接在这里:https://leetcode.com/problems/first-unique-character-in-a-string/ 题目: Given a string, find t ...
leetcode修炼之路——387. First Unique Character in a String
最近公司搬家了,有两天没写了,今天闲下来了,继续开始算法之路. leetcode的题目如下: Given a string, find the first non-repeating characte ...

随机推荐

ones测试用例管理平台
https://ones.ai 团队信息: 公司信息,公司logo付费信息:绑定第三方账户: 成员信息: userid,user_email,激活状态,所属部门组织架构:所属部门: 新建组团队权钱: ...
学习笔记39—笑谈FireFox标签不同步（IOS和Wiindows）
为了解决国内用户连接全球同步服务器困难的问题,火狐中国版推出了全球服务和本地服务两套服务. 这两套服务的账号和数据并不通用,并且只有中国版提供了切换功能,因此当你在同步过程中遇到“未知账号 ...
python paramiko 模块简单介绍
背景,公司的很多服务包括数据库访问都需要通过跳板机访问,为日常工作及使用带来了麻烦,特别数python直接操作数据更是麻烦了,所以一直想实现python 通过跳板机访问数据库的操作. 首先了解到了 p ...
fiddler学习笔记&&基本使用
周末在网上找了些fiddler相关的资料来看,学习下如何使用这个工具(平时接口测试用得比较多,在没有接口文档的情况下,可以通过抓包工具来提取需要测试的接口,ps.好久没写博客了,争取5月结束前再写2篇 ...
Ubuntu16.04安装8821CE 无线网卡无驱动
已解决参考链接:https://unix.stackexchange.com/question ... -mint-18-2 内容 Worked solution (Requirements: ke ...
Inotify&Sersync文件监视工具配置
一.Inotify介绍:一共安装2个工具(命令),即inotifywait和inotifywatchinotifywait:在被监控的文件或目录上等待特定文件系统事件(open.close.delet ...
【消息队列】从各方面比较下kafka、activemq、rabbitmq、rocketmq之间的区别
一.单机吞吐量ActiveMQ:万级,吞吐量比RocketMQ和Kafka要低了一个数量级RabbitMQ:万级,吞吐量比RocketMQ和Kafka要低了一个数量级RocketMQ:10万级,Roc ...
深刻理解Web标准，对可用性、可访问性、可维护性等相关知识有实际的了解和实践经验
WEB标准不是某一个标准,而是一系列标准的集合.网页主要由三部分组成:结构(Structure).表现(Presentation)和行为(Behavior).对应的标准也分三方面:结构化标准语言主要包 ...
Kali安装nessus
下载在官方网站下载对应的 Nessus 版本:http://www.tenable.com/products/nessus/select-your-operating-system 这里选择 Kal ...
1.2 面向对象 Object-oriented
前导课程 1.UML(统一建模语言) 2.OOAD Concept(Object-oriented Analysis and Design 概念) 3.Design Pattern(设计模式) 4.面 ...

How to calculate bits per character of a string? (bpc) to read

http://stackoverflow.com/questions/17797922/how-to-calculate-bits-per-character-of-a-string-bpc

2 Answers

Not the answer you're looking for? Browse other questions tagged python algorithm machine-learning nlp entropy or ask your own question.

Community Bulletin

Related

Hot Network Questions

How to calculate bits per character of a string? (bpc) to read的更多相关文章

随机推荐

热门专题