Pandas 之 描述性统计案例
认识
jupyter地址: https://nbviewer.jupyter.org/github/chenjieyouge/jupyter_share/blob/master/share/pandas- 描述性统计.ipynb
import numpy as np
import pandas as pd
pandas objects are equipped(配备的) with a set of common mathematical and statistical methods. Most of these fall into the categrory of reductions or summary statistics, methods that exract(提取) a single value(like the sum or mean) from a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they built-in handling for missiing data. Consider a small DataFarme -> (pandas提供了一些常用的统计函数, 输入通常是一个series的值, 或df的行, 列; 值得一提的是, pandas提供了缺失值处理, 在统计的时候, 不列入计算)
df = pd.DataFrame([
[1.4, np.nan],
[7.6, -4.5],
[np.nan, np.nan],
[3, -1.5]
],
index=list('abcd'), columns=['one', 'two'])
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | |
---|---|---|
a | 1.4 | NaN |
b | 7.6 | -4.5 |
c | NaN | NaN |
d | 3.0 | -1.5 |
Calling DataFrame's sum method returns a Series containing column sums:
"默认axis=0, 行方向, 下方, 展示每列, 忽略缺失值"
df.sum()
df.mean()
"在计算平均值时, NaN 不计入样本"
'默认axis=0, 行方向, 下方, 展示每列, 忽略缺失值'
one 12.0
two -6.0
dtype: float64
one 4.0
two -3.0
dtype: float64
'在计算平均值时, NaN 不计入样本'
Passing axis='columns' or axis=1 sums across the columns instead. -> axis方向
"按行统计, aixs=1, 列方向, 右边"
df.sum(axis=1)
'按行统计, aixs=1, 列方向, 右边'
a 1.4
b 3.1
c 0.0
d 1.5
dtype: float64
NA values are excluded unless the entire slice (row or column in the case) is NA. This can be disabled with the skipna option: -> 统计计算会自动忽略缺失值, 不计入样本
"默认是忽略缺失值的, 要缺失值, 则手动指定一下"
df.mean(skipna=False, axis='columns') # 列方向, 行哦
'默认是忽略缺失值的, 要缺失值, 则手动指定一下'
a NaN
b 1.55
c NaN
d 0.75
dtype: float64
See Table 5-7 for a list of common options for each reduction method.
Method | Description |
---|---|
axis | Axis to reduce over, 0 for DataFrame's rows and 1 for columns |
skipna | Exclude missing values; True by default |
level | Reduce grouped by level if the axis is hierachically indexed(MaltiIndex) |
Some methods, like idmax and idmin, return indirect statistics like the index where the minimum or maximum values are attained(取得).
"idxmax() 返回最大值的第一个索引标签"
df.idxmax()
'idxmax() 返回最大值的第一个索引标签'
one b
two d
dtype: object
Other methods are accumulations: 累积求和-默认axis=0 行方向
"累积求和, 默认axis=0, 忽略NA"
df.cumsum()
"也可指定axis=1列方向"
df.cumsum(axis=1)
'累积求和, 默认axis=0, 忽略NA'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | |
---|---|---|
a | 1.4 | NaN |
b | 9.0 | -4.5 |
c | NaN | NaN |
d | 12.0 | -6.0 |
'也可指定axis=0列方向'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | |
---|---|---|
a | 1.4 | NaN |
b | 7.6 | 3.1 |
c | NaN | NaN |
d | 3.0 | 1.5 |
Another type of method is neither a reduction(聚合) nor an accumulation. describe is one such example, producing multiple summary statistic in one shot: --> (describe()方法是对列变量做描述性统计)
"describe() 返回列变量分位数, 均值, count, std等常用统计指标"
" roud(2)保留2位小数"
df.describe().round(2)
'describe() 返回列变量分位数, 均值, count, std等常用统计指标'
' roud(2)保留2位小数'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | |
---|---|---|
count | 3.00 | 2.00 |
mean | 4.00 | -3.00 |
std | 3.22 | 2.12 |
min | 1.40 | -4.50 |
25% | 2.20 | -3.75 |
50% | 3.00 | -3.00 |
75% | 5.30 | -2.25 |
max | 7.60 | -1.50 |
On non-numeric data, describe produces alternative(供选择的) summary statistics: --> 对于分类字段, 能自动识别并返回分类汇总信息
obj = pd.Series(['a', 'a', 'b', 'c']*4)
"describe()对分类字段自动分类汇总"
obj.describe()
'describe()对分类字段自动分类汇总'
count 16
unique 3
top a
freq 8
dtype: object
See Table 5-8 for a full list of summary statistics and related methods.
Method | Description |
---|---|
count | Number of non-NA values |
describe | 描述性统计Series或DataFrame的列 |
min, max | 极值 |
argmin, argmax | 极值所有的位置下标 |
idmin, idmax | 极值所对应的行索引label |
quantile | 样本分位数 |
sum | 求和 |
mean | 求均值 |
median | 中位数 |
var | 方差 |
std | 标准差 |
skew | 偏度 |
kurt | 峰度 |
skew | 偏度 |
cumsum | 累积求和 |
cumprod | 累积求积 |
diff | Compute first arithmetic difference (useful for time series) |
pct_change | Compute percent change |
df.idxmax()
one b
two d
dtype: object
df['one'].argmax()
c:\python\python36\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'
will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
"""Entry point for launching an IPython kernel.
'b'
Correlation and Convariance
Some summary statistics, like correlation and convariance(方差和协方差), are computed from pairs of arguments. Let's consider some DataFrames of stock prices and volumes(体量) obtained from Yahoo! Finace using the add-on pandas-datareader package. If you don't have it install already, it can be obtained via or pip:
(conda) pip install pandas-datareader
I use the pandas_datareader module to dwonload some data for a few stock tickers:
import pandas_datareader.data as web
"字典推导式"
# all_data = {ticker: web.get_data_yahoo(ticker)
# for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
'字典推导式'
"读取二进制数据 read_pickle(), 存为 to_pickle()"
returns = pd.read_pickle("../examples/yahoo_volume.pkl")
returns.tail()
'读取二进制数据 read_pickle(), 存为 to_pickle()'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
AAPL | GOOG | IBM | MSFT | |
---|---|---|---|---|
Date | ||||
2016-10-17 | 23624900 | 1089500 | 5890400 | 23830000 |
2016-10-18 | 24553500 | 1995600 | 12770600 | 19149500 |
2016-10-19 | 20034600 | 116600 | 4632900 | 22878400 |
2016-10-20 | 24125800 | 1734200 | 4023100 | 49455600 |
2016-10-21 | 22384800 | 1260500 | 4401900 | 79974200 |
The corr method of Series computes the correlation of the overlapping, non-NA(线性相关), aligned-by-index values in two Series. Relatedly, cov compute teh convariance: ->(corr 计算相关系数, cov 计算协方差)
returns.describe()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
AAPL | GOOG | IBM | MSFT | |
---|---|---|---|---|
count | 1.714000e+03 | 1.714000e+03 | 1.714000e+03 | 1.714000e+03 |
mean | 9.595085e+07 | 4.111642e+06 | 4.815604e+06 | 4.630359e+07 |
std | 6.010914e+07 | 2.948526e+06 | 2.345484e+06 | 2.437393e+07 |
min | 1.304640e+07 | 7.900000e+03 | 1.415800e+06 | 9.009100e+06 |
25% | 5.088832e+07 | 1.950025e+06 | 3.337950e+06 | 3.008798e+07 |
50% | 8.270255e+07 | 3.710000e+06 | 4.216750e+06 | 4.146035e+07 |
75% | 1.235752e+08 | 5.243550e+06 | 5.520500e+06 | 5.558810e+07 |
max | 4.702495e+08 | 2.976060e+07 | 2.341650e+07 | 3.193179e+08 |
"微软和IBM的相关系数是: {}".format(returns['MSFT'].corr(returns['IBM']))
"微软和IBM的协方差为是: {}".format(returns['MSFT'].cov(returns['IBM']))
'微软和IBM的相关系数是: 0.42589249800808743'
'微软和IBM的协方差为是: 24347708920434.156'
Since(尽管) MSFT is a vaild(无效的) Python attritute, we can alse select these columns using more concise syntax:
"通过 DF.col_name 这样的属性来选取字段, 面对对象, 支持"
returns.MSFT.corr(returns.IBM)
'通过 DF.col_name 这样的属性来选取字段, 面对对象, 支持'
0.42589249800808743
DataFrame's corr and cov methods, on the other hand, return a full correlaton or covariance matrix as a DataFrame, respectively(各自地). -> df.corr 返回相关系数矩阵 df.cov() 返回协方差矩阵哦
"DF.corr() 返回矩阵, 这个厉害了, 不知道有无中心化过程"
returns.corr()
"DF.cov() 返回协方差矩阵"
returns.cov()
'DF.corr() 返回矩阵, 这个厉害了, 不知道有无中心化过程'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
AAPL | GOOG | IBM | MSFT | |
---|---|---|---|---|
AAPL | 1.000000 | 0.576030 | 0.383942 | 0.490353 |
GOOG | 0.576030 | 1.000000 | 0.438424 | 0.490446 |
IBM | 0.383942 | 0.438424 | 1.000000 | 0.425892 |
MSFT | 0.490353 | 0.490446 | 0.425892 | 1.000000 |
'DF.cov() 返回协方差矩阵'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
AAPL | GOOG | IBM | MSFT | |
---|---|---|---|---|
AAPL | 3.613108e+15 | 1.020917e+14 | 5.413005e+13 | 7.184135e+14 |
GOOG | 1.020917e+14 | 8.693806e+12 | 3.032022e+12 | 3.524694e+13 |
IBM | 5.413005e+13 | 3.032022e+12 | 5.501297e+12 | 2.434771e+13 |
MSFT | 7.184135e+14 | 3.524694e+13 | 2.434771e+13 | 5.940884e+14 |
Using the DataFrame's corrwith method, you can compute pairwise(成对的) corrlations between a DataFrame's columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column.
使用DataFrame的corrwith方法,您可以计算DataFrame的列或行与另一个Series或DataFrame之间的成对相关。 传递一个Series会返回一个Series,其中包含为每列计算的相关值。
"corrwith() 计算成对相关"
"计算IMB与其他几个的相关"
returns.corrwith(returns.IBM)
'corrwith() 计算成对相关'
'计算IMB与其他几个的相关'
AAPL 0.383942
GOOG 0.438424
IBM 1.000000
MSFT 0.425892
dtype: float64
returns.corrwith(returns)
AAPL 1.0
GOOG 1.0
IBM 1.0
MSFT 1.0
dtype: float64
Passing axis='column'(列方向, 每行) does things row-by-row instead. In all cases, the data points are aligned by label before the correlation is computed. ->按照行进进行计算, 前提是数据是按label对齐的.
Unique Values, Value Counts, and Membership
Another class of related methods extracts(提取) infomation about the values contained in a one-dimensional Series. To illustrate these, consider this example:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
"unique()返回不重复的值序列"
obj.unique()
'unique()返回不重复的值序列'
array(['c', 'a', 'd', 'b'], dtype=object)
The unique values are not neccessarily returned in sorted order(没有进行排序), but could be sorted ater the fact if needed(uniques.sort()). Relatedly, value_counts computes a Series containing value frequencies: ->value_count()统计频率
"统计词频, value_counts()"
obj.value_counts()
'统计词频, value_counts()'
a 3
c 3
b 2
d 1
dtype: int64
The Series id sorted by value in descending order(降序) as a convenience. value_counts is also available as a top-level pandas method that can be used with any array or sequence: -> 统计词频,并降序排列
"统计词频并降序排列"
"默认是降序的"
pd.value_counts(obj.values)
"手动自动不排序"
pd.value_counts(obj.values, sort=False)
'统计词频并降序排列'
'默认是降序的'
a 3
c 3
b 2
d 1
dtype: int64
'手动自动不排序'
c 3
b 2
d 1
a 3
dtype: int64
isin performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series or column in a DataFrame: -> isin 成员判断
obj
0 c
1 a
2 d
3 a
4 a
5 b
6 b
7 c
8 c
dtype: object
mask = obj.isin(['b', 'c'])
mask
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
"bool 过滤条件, True的则返回"
obj[mask]
'bool 过滤条件, True的则返回'
0 c
5 b
6 b
7 c
8 c
dtype: object
Related to(涉及) isin is the Index.get_indexer method, which gives you can index array from an array of possibly non-distinct values into another array of distinct values:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
"没看懂这是干嘛"
pd.Index(unique_vals).get_indexer(to_match)
'没看懂这是干嘛'
array([0, 2, 1, 1, 0, 2], dtype=int64)
See Table 5-9 for a reference on these methods.
Method | Description |
---|---|
isin | 判断数组的每一个值是否在isin的数组里面, 返回一个bool数组 |
match | 数据对齐用的, 暂时还不会pass |
unique | 数组元素去重后的数组结果 |
value_counts | 词频统计, 默认降序 |
In some cases, you may want to compute a histogram(直方图) on multiple related columns in a DataFrame. Here's an example:
data = pd.DataFrame({
'Qu1': [1, 3, 4, 3, 4],
'Qu2': [2, 3, 1, 2, 3],
'Qu3': [1, 5, 2, 4, 4]})
data
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Qu1 | Qu2 | Qu3 | |
---|---|---|---|
0 | 1 | 2 | 1 |
1 | 3 | 3 | 5 |
2 | 4 | 1 | 2 |
3 | 3 | 2 | 4 |
4 | 4 | 3 | 4 |
Passing pandas.value_counts to this DF's apply function gives: -> 对每列进行词频统计, 没有的用0填充
result = data.apply(pd.value_counts).fillna(0)
result
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Qu1 | Qu2 | Qu3 | |
---|---|---|---|
1 | 1.0 | 1.0 | 1.0 |
2 | 0.0 | 2.0 | 1.0 |
3 | 2.0 | 2.0 | 0.0 |
4 | 2.0 | 0.0 | 2.0 |
5 | 0.0 | 0.0 | 1.0 |
Here, the row labels in the result are the distinct values occuring in all of the columns. The values are the respective counts of these values in each clumns
这里,结果中的行标签是在所有列中出现的不同值。 值是每列中这些值的相应计数
Conclusion
In the nex chapter, we will discuss tools for reading(or loading) and writing datasets with pandas. After that, we will dig deeper into data cleaning, wrangling, analysis, and visualization tool using pandas.
后面的内容, 涉及数据的读写, 数据清理, 转换, 规整, 分析建模, 挖掘, 可视化等.
Pandas 之 描述性统计案例的更多相关文章
- Pandas | 06 描述性统计
有很多方法用来集体计算DataFrame的描述性统计信息和其他相关操作. 其中大多数是sum(),mean()等聚合函数. 一般来说,这些方法采用轴参数,就像ndarray.{sum,std,...} ...
- Pandas描述性统计
有很多方法用来集体计算DataFrame的描述性统计信息和其他相关操作. 其中大多数是sum(),mean()等聚合函数,但其中一些,如sumsum(),产生一个相同大小的对象. 一般来说,这些方法采 ...
- pandas(5):数学统计——描述性统计
Pandas 可以对 Series 与 DataFrame 进行快速的描述性统计,方便快速了解数据的集中趋势和分布差异.源Excel文件descriptive_statistics.xlsx: 一.描 ...
- Python实现描述性统计
该篇笔记由木东居士提供学习小组.资料 描述性统计的概念很好理解,在日常工作中我们也经常会遇到需要使用描述性统计来表述的问题.以下,我们将使用Python实现一系列的描述性统计内容. 有关python环 ...
- SPSS统计分析过程包括描述性统计、均值比较、一般线性模型、相关分析、回归分析、对数线性模型、聚类分析、数据简化、生存分析、时间序列分析、多重响应等几大类
https://www.zhihu.com/topic/19582125/top-answershttps://wenku.baidu.com/search?word=spss&ie=utf- ...
- 数据分析06 /pandas高级操作相关案例:人口案例分析、2012美国大选献金项目数据分析
数据分析06 /pandas高级操作相关案例:人口案例分析.2012美国大选献金项目数据分析 目录 数据分析06 /pandas高级操作相关案例:人口案例分析.2012美国大选献金项目数据分析 1. ...
- MapReduce 单词统计案例编程
MapReduce 单词统计案例编程 一.在Linux环境安装Eclipse软件 1. 解压tar包 下载安装包eclipse-jee-kepler-SR1-linux-gtk-x86_64.ta ...
- 使用Python进行描述性统计
目录 1 描述性统计是什么?2 使用NumPy和SciPy进行数值分析 2.1 基本概念 2.2 中心位置(均值.中位数.众数) 2.3 发散程度(极差,方差.标准差.变异系数) 2.4 偏差程度(z ...
- \(\S1\) 描述性统计
在认识客观世界的过程中,统计学的思想和方法经常起着不可替代的作用.在许多工程及自然科学的专业领域中,包括可靠性分析.质量控制.生物信息.脑科学.心理分析.经济分析.金融风险管理.社会科学推断.行为科学 ...
随机推荐
- sublime3插件BracketHighlighter的配置
BracketHighlighter插件能为Sublime Text提供括号,引号这类高亮功能,但安装此插件后,默认没有高亮,只有下划线表示,不是很醒目,需要配置:1.在Sublime Text中用p ...
- Splay的基本操作(插入/删除,查询)
Splay的基本操作(插入/删除,查询) 概述 这是一棵二叉查找树 让频繁访问的节点尽量靠近根 将查询,插入等操作的点"旋转"至根 树的高度均摊为$log_n$ 变量 int ro ...
- css----单行文本超出部分显示省略号
width: 300px; overflow: hidden; white-space: nowrap; text-overflow: ellipsis;
- 2019 qbxt CSP-S考前冲刺班总结
似乎--也没有太多好说的. 但这是最后一次培训,因此还是应该写点什么的. 记得状态最好的一次培训,是高一的第一次培训.那次是总共的第二次培训.第一次去的时候什么也不会,跟的非常吃力,每天都在疯 ...
- vue自学小demo----前端
vue学习的小demo,实现简单的页面项目的增删 代码如下 <!DOCTYPE html> <html> <head> <meta charset=" ...
- Powershell更新
问题:在vin7电脑启动vagrant up 提示powershell版本过低. 在vin7电脑启动vagrant up 提示powershell版本过低: The version of powers ...
- 51Nod1353 树
51Nod1353 树 传送门 思路 我们定义\(dp[i][j]\)代表第i个点联通块大小为j的方案总数,也可以把它理解为等待分配(不确定归属)的联通块大小为j的方案总数. 那么每次转移我们就使用一 ...
- JS 常见问题
JavaScript 是一种有趣的语言,我们都喜欢它,因为它的性质.浏览器是JavaScript的主要运行的地方,两者在我们的服务中协同工作.JS有一些概念,人们往往会对它掉以轻心,有时可能会忽略不计 ...
- CentOS设置时区
1.使用date命令查看当前时间 2.已软连接的方式设置时区 ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
- ASP.NET Core使用Docker-Compose实现多容器应用部署
一.需求背景 人生苦短,我用.NET Core!前面的<ASP.NET Core使用Docker进行容器化托管和部署>基础课程我们学习了如何使用Docker来部署搭建ASP.NET Cor ...