Descriptive Stats + percentiles in numpy and scipy.stats

https://dev.to/sayemmh/descriptive-stats-percentiles-in-numpy-and-scipystats-59a7

Abbreviations of Statistics:

CDF vs. PDF: What’s the Difference?, BY ZACH BOBBITTPOSTED ON JUNE 13, 2019

PDF: P.D.F.(Probability Density Function)
CDF: C.D.F.(Cumulative Distribution Function)

Quantile 是P.D.F.特形

Definition: Quantile(分位数)指的就是连续分布函数的一个点，这个点对应概率p。

若概率0<p<1，随机变量X或它的概率分布的分位数Pa，是指满足条件p(X≤Pa)=α的实数 [1]。

Quantile(分位数, 亦称分位点)

分位数（Quantile），亦称分位点，是指将一个随机变量的概率分布范围分为几个等份的数值点，常用的有Median(中位数, 即二分位数）、Quartile(四分位数)、Percentile(百分位数)等。

Median(中位数或中值)

是按顺序排列的一组数据序列上处于中间位置的数. 统计学中的专有名词.

代表一个样本、种群或概率分布的一个数值，可将数值集合划分为数量等同的上下两部分。

对于有限的数集，可将所有观察值升序排序后，找出正中序号的一个作为Median(中位数)。
如果观察值有偶数个，通常取最中间的两个数值的平均数作为Median(中位数)。

Quartile(四分位数（Quartile）也称四分位点

统计学上, 把所有数值升序(由小到大)排列并分成四等份，处于三个分割点位置的数值。

多应用于统计学的箱线图绘制。它是一组数据排序后处于25%, 50% 和 75%位置上的值**。

四分位数是通过3个点将全部数据分为4等份，其中每部分包含25%的数据。
中间的四分位数就是中位数，因此通常说的四分位数是指:
- 下四分位数: 处在25%位置上的数值,
- 上四分位数: 处在75%位置上的数值。
根据未分组数据计算四分位数时:
- 首先对数据进行排序，
- 然后确定四分位数所在的位置，
- 该位置上的数值就是四分位数。
- 大体上与"中位数的计算方法"类似，但是
  
  与中位数不同的是，四分位数位置的确定方法有几种，
  
  每种方法得到的结果会有一定差异，但差异不会很大。 [1]

函数$\large percentile(P)\text{ where } P \in [0, 1]$

找出数据集上的一个目标数据值$\large V = percentile(P)$，

保证整个数据集上确定,

至少有 (P)*100% 的数据** 小于或等于 $\large V$

至少有 (1 - P)*100% 的数据** 大于或等于 $\large V$

percentile()是P.D.F.特形

将"数据集"排序成一"序列";
并用"百分数"确定目标数据值的"序位号";
最终用此"序位号"索取"序列(数据集)"的"数据值"并计算得目标值percentile.

假设 $\large 数据集D(样本空间)$ 总计有 $\large N个数据$ , 求其 percent 为$\large P$ 的 percentile :

排序数据集: 升序(由小到大)排成 $\large pandas.Series(序列)$ 并用 $\large S$ 代指其;
确定目标序位号: 用公式 $ Index = N * P$
用 $\large Index$ 作 "索取数(脚标)" 索取序列 $\large S$的值并计算$percentile$值:
- 若 $\large Index$ 为 $\large Fraction$, 则上收取整后用其索取 $\large S$ 的一个值作目标值:
  
  即 $\large percentile(P) = S[ round(Index) ]$.
- 若 $\large Index$ 为 $\large Integer$ , 就用其索取 $\large S$ 的两个后邻值并取平均值作目标值:
  
  即 $\large percentile(P) = \frac{S[Index] + S[Index + 1]}{2}$

quartile()是P.D.F.(Probability Density Frequency)特例

quartile()即四分位函数，求得25%, 50%, 75% 的 percentile 值将数据集“四分”:
\[\large ( percentile(0.25), percentile(0.50), percentile(0.75) )
\]
quartile() 非常有用.

求 P percentile的例题:

Q.: “求数据集 3, 2, 2, 1, 1 的第50 百分位数 ”。

Answer:

we got N = 5, and P = 0.5 (因为 0.5 = 50/100)
1. make a corresponding sorted sequence S: "1, 1, 2, 2, 3"
2. calculate the Index number: $\large Index = N * P = 5 * 0.5 = 2.5 $
3. $\large percentile(0.5) = S[ round(2.5) ] = S[3] = 2 $
if P = 0.8, then:

$\large Index = N * P = 5 * 0.8 = 4 $

$\large percentile(0.8) = \frac{S[4] +S[4+1]}{2}= \frac{2 + 3}{2} = 2.5 $
Q.: “求数据集 "1, 2, 3, 4, 5, 6, 7, 8, 9, 10" 的 percentile(43%) 和 percentile(80%)”

Answer:

For "percentile(43%)":

we got N = 10, and P = 0.43
1. make a corresponding sorted sequence S: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
2. calculate the Index number: $\large Index = N * P = 10 * 0.43 = 4.3 $
3. $\large percentile(45\%) = S[ round(4.3) ] = S[5] = 5 $
For "percentile(80%)":

we got N = 10, and P = 0.80
1. make a corresponding sorted sequence S: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
2. calculate the Index number: $\large Index = N * P = 10 * 0.80 = 8 $
3. $\large percentile(80\%) = \frac{S[8] +S[8+1]}{2} =\frac{8 + 9}{2} = 8.5 $

DEV Community

Sayem Hoque, Posted on Oct 13, 2022 • Updated on Nov 16, 2022

Descriptive stats + percentiles in numpy and scipy.stats

To get the measures of central tendency in a pandas df, we can use the built in functions to calculate mean, median, mode:

import pandas as pd

import numpy as np

# Load the data

df = pd.read_csv("data.csv")

df.mean()

df.median()

df.mode()

To measure dispersion, we can use built-in functions to calculate std. deviation, variance, interquartile range, and skewness.

A low std. deviation means the data tends to be closer bunched around the mean, and vice versa if the std. deviation is high. The iqr is the difference between the 75th and 25th percentile. To calculate this, scipy.stats is used. Skew refers to how symmetric a distribution is about its' mean. A perfectly symmetric distribution would have equivalent mean, median, and mode.

from scipy.stats import iqr

df.std()

iqr(df['column1'])

df.skew()

from scipy import stats

stats.percentileofscore([1, 2, 3, 4], 3)

>> 75.0

The result of the percentileofscore function is the percentage of values within a distribution that are equal to or below the target. In this case, [1, 2, 3] are <= to 3, so 3/4 are below.

numpy.percentile is actually not the inverse of stats.percentileofscore. numpy.percentile takes in a parameter q to return the q-th percentile in an array of elements. The function sorts the original array of elements, and computes the difference between the max and minimum element. Once that range is calculated, the percentile is computed by finding the nearest two neighbors q/100 away from the minimum. A list of input functions can be used to control the numerical method applied to interpolate the two nearest neighbors. The default method is linear interpolation, taking the average of the nearest two neighbors.

Example:

arr = [0,1,2,3,4,5,6,7,8,9,10]

print("50th percentile of arr : ",

       np.percentile(arr, 50))

print("25th percentile of arr : ",

       np.percentile(arr, 25))

print("75th percentile of arr : ",

       np.percentile(arr, 75))

>>> 50th percentile of arr :  5

>>> 25th percentile of arr :  2.5

>>> 75th percentile of arr :  7.5

Now, using scipy.stats, we can compute the percentile at which a particular value is within a distribution of values. In this example, we are trying to see the percentile score for cur within the non-null values in the column ep_30.

non_nan = features[~features['ep_30'].isnull()]['ep_30']

cur = features['ep_30'][-1]

print(f'''Cur is at the {round(stats.percentileofscore(non_nan, cur, kind='mean'), 2)}th percentile of the distribution.''')

This is at the 7.27th percentile of the distribution.

Before you go

SciTech-Mathematics-Probability+Statistics- Descriptive stats +完全搞懂percentile(百分位数) + quartile(四分位数) + median(中位数) + percentiles() in NumPy+Pandas+SciPy.stats的更多相关文章

perl 计算方差中值平均数 Statistics::Descriptive;
http://search.cpan.org/~shlomif/Statistics-Descriptive-3.0612/lib/Statistics/Descriptive.pm use Stat ...
Scipy教程 - 统计函数库scipy.stats
http://blog.csdn.net/pipisorry/article/details/49515215 统计函数Statistical functions(scipy.stats) Pytho ...
scipy.stats
scipy.stats Scipy的stats模块包含了多种概率分布的随机变量,随机变量分为连续的和离散的两种.所有的连续随机变量都是rv_continuous的派生类的对象,而所有的离散随机变量都是 ...
[原创博文] 用Python做统计分析（Scipy.stats的文档）
[转自] 用Python做统计分析 (Scipy.stats的文档) 对scipy.stats的详细介绍: 这个文档说了以下内容,对python如何做统计分析感兴趣的人可以看看,毕竟Python的库也 ...
关于使用scipy.stats.lognorm来模拟对数正态分布的误区
lognorm方法的参数容易把人搞蒙.例如lognorm.rvs(s, loc=0, scale=1, size=1)中的参数s,loc,scale, 要记住:loc和scale并不是我们通常理解的对 ...
python scipy stats学习笔记
from scipy.stats import chi2 # 卡方分布from scipy.stats import norm # 正态分布from scipy.stats import t # t分 ...
Java进阶专题(二十六) 将近2万字的Dubbo原理解析，彻底搞懂dubbo
前言前面我们研究了RPC的原理,市面上有很多基于RPC思想实现的框架,比如有Dubbo.今天就从Dubbo的SPI机制.服务注册与发现源码及网络通信过程去深入剖析下Dubbo. Dubbo架构 ...
彻底搞懂Javascript的“==”
本文转载自:@manxisuo的<通过一张简单的图,让你彻底地.永久地搞懂JS的==运算>. 大家知道,==是JavaScript中比较复杂的一个运算符.它的运算规则奇怪,容让人犯错,从而 ...
完全搞懂傅里叶变换和小波（2）——三个中值定理<转载>
书接上文,本文章是该系列的第二篇,按照总纲中给出的框架,本节介绍三个中值定理,包括它们的证明及几何意义.这三个中值定理是高等数学中非常基础的部分,如果读者对于高数的内容已经非常了解,大可跳过此部分.当 ...
完全搞懂傅里叶变换和小波（1）——总纲<转载>
无论是学习信号处理,还是做图像.音视频处理方面的研究,你永远避不开的一个内容,就是傅里叶变换和小波.但是这两个东西其实并不容易弄懂,或者说其实是非常抽象和晦涩的! 完全搞懂傅里叶变换和小波,你至少需要 ...

随机推荐

解决多个if-else的方案
参考链接: 遇到大量if记住下面的口诀: 互斥条件表驱动嵌套条件校验链短路条件早return 零散条件可组合解释: 互斥条件,表示几个条件之间是冲突的,不可能同时达成的.比如说一个数字,它不可能 ...
.net6 api添加接口注释
参照: .NET 6 Swagger添加xml注释 - 凡尘一叶~ - 博客园 (cnblogs.com)[这个比较准] .net core的Swagger接口文档使用教程(一):Swashbuckl ...
windows下安装maven环境（windows10）
1.下载maven https://archive.apache.org/dist/maven/maven-3/ 2.安装配置 1.解压后新建本地仓库 2.编辑apache-maven-3.0.5-b ...
Map之“获取map中的key流转为List”
一.获取map中的key转为List 注意这里可以获取map中所有的key来转换为List, 这样后很多方案就不需要另外查询出来处理了代码 @Test public void test() { M ...
Flutter集成微信小程序技术教程
.markdown-body { color: rgba(89, 89, 89, 1); font-size: 15px; font-family: -apple-system, system-ui, ...
从一个例子看tvm执行流程
TVM整体流程(参考:TVM介绍) 机器学习模型在用TVM优化编译器框架进行变换时的步骤: 从Tensorflow/pytorch或ONNX等框架导入模型 import层是TVM从其他框架中导入模型的 ...
京东首页html+css1.0
小菜鸟的学习记录,还望各位猿兄不吝赐教文章目录效果图源码 HTML文件 css文件效果图源码 HTML文件 <!DOCTYPE html> <html> <he ...
Spring Boot 整合 JMS(Active MQ 实现)
我们使用jms一般是使用spring-jms和activemq相结合,通过spring Boot为我们配置好的JmsTemplate发送消息到指定的目的地Destination.本文以点到 ...
Spring AOP 面向切面编程之搞定表单重复提交实战
摘要:客户端在5秒内请求同一URL,而且关键请求参数相等,则视此次请求为重复提交,利用自定义注解 .Spring AOP 和 Guava Cache 技术栈在服务器端实现拦截表单重复提交,防止刷单. ...
ubuntu18.04安装网易云音乐1.2.1提示“加载失败，网络错误，可以在设置中发送反馈”
https://music.163.com/#/download 这是网易云的官网,右上角有下载Linux的链接听说会出现无法打开网易云的问题,请参考这篇https://blog.csdn.net/ ...

SciTech-Mathematics-Probability+Statistics- Descriptive stats +完全搞懂percentile(百分位数) + quartile(四分位数) + median(中位数) + percentiles() in NumPy+Pandas+SciPy.stats

Descriptive Stats + percentiles in numpy and scipy.stats

Abbreviations of Statistics:

Quantile 是P.D.F.特形

Quantile(分位数, 亦称分位点)

Median(中位数或中值)

Quartile(四分位数（Quartile）也称四分位点

函数\(\large percentile(P)\text{ where } P \in [0, 1]\)

找出数据集上的一个目标数据值\(\large V = percentile(P)\)，

保证整个数据集上确定,

至少有 (P)*100% 的数据** 小于或等于 \(\large V\)

至少有 (1 - P)*100% 的数据** 大于或等于 \(\large V\)

percentile()是P.D.F.特形

quartile()是P.D.F.(Probability Density Frequency)特例

求 P percentile的例题:

SciTech-Mathematics-Probability+Statistics- Descriptive stats +完全搞懂percentile(百分位数) + quartile(四分位数) + median(中位数) + percentiles() in NumPy+Pandas+SciPy.stats的更多相关文章

随机推荐

热门专题

SciTech-Mathematics-Probability+Statistics- Descriptive stats +完全搞懂percentile(百分位数) + quartile(四分位数) + median(中位数) + percentiles() in NumPy+Pandas+SciPy.stats

Descriptive Stats + percentiles in numpy and scipy.stats

Abbreviations of Statistics:

Quantile 是P.D.F.特形

Quantile(分位数, 亦称分位点)

Median(中位数 或 中值)

Quartile(四分位数（Quartile）也称四分位点

函数\(\large percentile(P)\text{ where } P \in [0, 1]\)

找出数据集上的一个目标数据值\(\large V = percentile(P)\)，

保证整个数据集上确定,

至少有 (P)*100% 的数据** 小于或等于 \(\large V\)

至少有 (1 - P)*100% 的数据** 大于或等于 \(\large V\)

percentile()是P.D.F.特形

quartile()是P.D.F.(Probability Density Frequency)特例

求 P percentile的例题:

SciTech-Mathematics-Probability+Statistics- Descriptive stats +完全搞懂percentile(百分位数) + quartile(四分位数) + median(中位数) + percentiles() in NumPy+Pandas+SciPy.stats的更多相关文章

随机推荐

热门专题

Median(中位数或中值)