From:  johndcook.com/blog

For a set of positive probabilities pi summing to 1, their entropy is defined as

(For this post, log will mean log base 2, not natural log.)

This post looks at a couple questions about computing entropy. First, are there any numerical problems computing entropy directly from the equation above?

Second, imagine you don’t have the pi values directly but rather counts ni that sum to N. Then pi = ni/N. To apply the equation directly, you’d first need to compute N, then go back through the data again to compute entropy. If you have a large amount of data, could you compute the entropy in one pass?

To address the second question, note that

so you can sum ni and ni log ni in the same pass.

One of the things you learn in numerical analysis is to look carefully at subtractions. Subtracting two nearly equal numbers can result in a loss of precision. Could the numbers above be nearly equal? Maybe if the ni are ridiculously large. Not just astronomically large — astronomically large numbers like the number of particles in the universe are fine — but ridiculously large, numbers whose logarithms approach the limits of machine-representable numbers. (If we’re only talking about numbers as big as the number of particles in the universe, their logs will be at most three-digit numbers).

Now to the problem of computing the sum of ni log ni. Could the order of the terms matter? This also applies to the first question of the post if we look at summing the pi log pi. In general, you’ll get better accuracy summing a lot positive numbers by sorting them and adding from smallest to largest and worse accuracy by summing largest to smallest. If adding a sorted list gives essentially the same result when summed in either direction, summing the list in any other order should too.

To test the methods discussed here, I used two sets of count data, one on the order of a million counts and the other on the order of a billion counts. Both data sets had approximately a power law distribution with counts varying over seven or eight orders of magnitude. For each data set I computed the entropy four ways: two equations times two orders. I convert the counts to probabilities and use the counts directly, and I sum smallest to largest and largest to smallest.

For the smaller data set, all four methods produced the same answer to nine significant figures. For the larger data set, all four methods produced the same answer to seven significant figures. So at least for the kind of data I’m looking at, it doesn’t matter how you calculate entropy, and you might as well use the one-pass algorithm to get the result faster.

[转载] Calculating Entropy的更多相关文章

  1. 【转载】7 Steps for Calculating the Largest Lyapunov Exponent of Continuous Systems

    原文地址:http://sprott.physics.wisc.edu/chaos/lyapexp.htm The usual test for chaos is calculation of the ...

  2. [转载] TLS协议分析 与 现代加密通信协议设计

    https://blog.helong.info/blog/2015/09/06/tls-protocol-analysis-and-crypto-protocol-design/?from=time ...

  3. 转载:scikit-learn学习之决策树算法

    版权声明:<—— 本文为作者呕心沥血打造,若要转载,请注明出处@http://blog.csdn.net/gamer_gyt <—— 目录(?)[+] ================== ...

  4. 【转载】计算机视觉(CV)前沿国际国内期刊与会议

    计算机视觉(CV)前沿国际国内期刊与会议这里的期刊大部分都可以通过上面的专家们的主页间接找到1.国际会议 2.国际期刊 3.国内期刊 4.神经网络 5.CV 6.数字图象 7.教育资源,大学 8.常见 ...

  5. 【转载】nginx 并发数问题思考:worker_connections,worker_processes与 max clients

    注:这个文章主要是作者一直在研究nginx作为http server和反向代理服务器时候所谓最大的max_clients和 worker_connections的计算公式, 其实最后的结论也没有卡上公 ...

  6. ZOJ3827 ACM-ICPC 2014 亚洲区域赛的比赛现场牡丹江I称号 Information Entropy 水的问题

    Information Entropy Time Limit: 2 Seconds      Memory Limit: 131072 KB      Special Judge Informatio ...

  7. 最大似然估计 (Maximum Likelihood Estimation), 交叉熵 (Cross Entropy) 与深度神经网络

    最近在看深度学习的"花书" (也就是Ian Goodfellow那本了),第五章机器学习基础部分的解释很精华,对比PRML少了很多复杂的推理,比较适合闲暇的时候翻开看看.今天准备写 ...

  8. 基于Spark自动扩展scikit-learn (spark-sklearn)(转载)

    转载自:https://blog.csdn.net/sunbow0/article/details/50848719 1.基于Spark自动扩展scikit-learn(spark-sklearn)1 ...

  9. softmax,softmax loss和cross entropy的区别

     版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/u014380165/article/details/77284921 我们知道卷积神经网络(CNN ...

随机推荐

  1. ElasticSearch介绍 【未完成】

    ElasticSearch应用于搜索是一个不错的选择,虽有Lucene,但ELK的搜索方便. http://joelabrahamsson.com/elasticsearch-101/ 一.下载 ht ...

  2. postgresql 配置文件优化

    postgresql 配置文件优化 配置文件 默认的配置配置文件是保存在/etc/postgresql/VERSION/main目录下的postgresql.conf文件 如果想查看参数修改是否生效, ...

  3. 技术文档--svn

    1.什么是版本控制,说出常见的版本控制系统及其区别版本控制它是一种软件工程籍以在开发的过程中,确保由不同人所编辑的同一档案都得到更新,它透过文档控制记录程序各个模块的改动,并为每次改动编上序号,并且编 ...

  4. C++类库介绍

    如果你有一定的C基础可能学起来比较容易些,但是学习C++的过程中又要尽量避免去使用一些C中的思想:平时还要多看一些高手写的代码,遇到问题多多思考,怎样才能把问题抽象化,以使自己头脑中有类的概念:最后别 ...

  5. SQLSERVER执行性能统计工具SQLQueryStress

    SQLSERVER执行时间统计工具SQLQueryStress 有时候需要检测一下SQL语句的执行时间,相信大家都会用SET STATISTICS TIME ON开关打开SQLSERVER内置的时间统 ...

  6. centos7 安装 notejs

    1.安装集成工具 yum -y install gcc make gcc-c++ 2.安装notejs 自行选择版本:https://nodejs.org/dist/ wget https://nod ...

  7. [Java Web] 2、Web开发中的一些架构

    1.企业开发架构: 企业平台开发大量采用B/S开发模式,不管采用何种动态Web实现手段,其操作形式都是一样的,其核心操作的大部分都是围绕着数据库进行的.但是如果使用编程语言进行数据库开发,要涉及很多诸 ...

  8. ASP.NET 5系列教程 (六): 在 MVC6 中创建 Web API

    ASP.NET 5.0 的主要目标之一是统一MVC 和 Web API 框架应用. 接下来几篇文章中您会了解以下内容: ASP.NET MVC 6 中创建简单的web API. 如何从空的项目模板中启 ...

  9. 《介绍一款开源的类Excel电子表格软件》续:七牛云存储实战(C#)

    两个月前的发布的博客<介绍一款开源的类Excel电子表格软件>引起了热议:在博客园有近2000个View.超过20个评论. 同时有热心读者电话咨询如何能够在SpreadDesing中实现存 ...

  10. vs emulator for android使用

    在windows上调试android程序,可以利用hyperv虚拟化功能,微软也提供了模拟工具和android studio.eclipse的配置说明,不再累述. 关于启动vs模拟器的cmd命令: e ...