Python数据分析与展示第3周学习笔记（北京理工大学嵩天等）

入门学习马上结束辽。

1.Pandas库

import pandas as pd

两个数据类型：Series，DataFrame

Series类型：数据+索引

自定义索引

b = pd.Series([9,8,7,6],index=['a','b','c','d'])

b

Out[3]:

a    9

b    8

c    7

d    6

dtype: int64

从标量值创建

s = pd.Series(25,index=['a','b','c'])#index=不能省略

s

Out[7]:

a    25

b    25

c    25

dtype: int64

从字典类型创建

d = pd.Series({'a':9,'b':8,'c':7})

d

Out[9]:

a    9

b    8

c    7

dtype: int64

从ndarray类型创建

import numpy as np

n = pd.Series(np.arange(5))

n

Out[12]:

0    0

1    1

2    2

3    3

4    4

dtype: int32

基本操作

b = pd.Series([9,8,7,6],['a','b','c','d'])

b

Out[14]:

a    9

b    8

c    7

d    6

dtype: int64

b.index

Out[15]: Index(['a', 'b', 'c', 'd'], dtype='object')

b.values

Out[17]: array([9, 8, 7, 6], dtype=int64)

　b.get('d',100)
　Out[18]: 6

Series对象和索引都可以有一个名字，存储在属性.name中

DataFrame类型：共用相同索引的多列数据

从二维ndarray对象创建

import pandas as pd

import numpy as np

d = pd.DataFrame(np.arange(10),reshape(2,5))

Traceback (most recent call last):

  File "<ipython-input-3-8f29c41caece>", line 1, in <module>

    d = pd.DataFrame(np.arange(10),reshape(2,5))

NameError: name 'reshape' is not defined

d = pd.DataFrame(np.arange(10).reshape(2,5))

d

Out[5]:

0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9

从一维ndarray对象字典创建

dt = {'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([9,8,7,6],index=['a','b','c','d'])}

d = pd.DataFrame(dt)

d

Out[11]:

   one  two

a  1.0    9

b  2.0    8

c  3.0    7

d  NaN    6

pd.DataFrame(dt,index=['b','c','d'],columns=['two','three'])

Out[13]:

   two three

b    8   NaN

c    7   NaN

d    6   NaN

从列表类型的字典创建

d1 = {'one':[1,2,3,4],'two':[9,8,7,6]}

d = pd.DataFrame(d1,index=['a','b','c','d'])

d

Out[16]:

   one  two

a    1    9

b    2    8

c    3    7

d    4    6

数据类型操作

如何改变Series和DataFrame对象？

增加或重排：重新索引

.reindex()

import pandas as pd

d1 = {'城市':['北京','上海','广州','深圳','沈阳'],

'环比':[101.5,101.2,101.3,102.0,100.1],

'同比':[101.5,101.2,101.3,102.0,100.1],

'定基':[101.5,101.2,101.3,102.0,100.1]}

d = pd.DataFrame(d1,index=[1,2,3,4,5])

d

Out[4]:

      同比  城市     定基     环比

1  101.5  北京  101.5  101.5

2  101.2  上海  101.2  101.2

3  101.3  广州  101.3  101.3

4  102.0  深圳  102.0  102.0

5  100.1  沈阳  100.1  100.1

d = d.reindex(index=[5,4,3,2,1])

d

Out[6]:

      同比  城市     定基     环比

5  100.1  沈阳  100.1  100.1

4  102.0  深圳  102.0  102.0

3  101.3  广州  101.3  101.3

2  101.2  上海  101.2  101.2

1  101.5  北京  101.5  101.5

d = d.reindex(columns=['城市','同比','环比','定基'])

d

Out[8]:

   城市     同比     环比     定基

5  沈阳  100.1  100.1  100.1

4  深圳  102.0  102.0  102.0

3  广州  101.3  101.3  101.3

2  上海  101.2  101.2  101.2

1  北京  101.5  101.5  101.5

其他参数：

fill_value：重新索引中，勇于填充缺失位置的值

method：填充方法，fill当前值向前填充，bfill向后填充

limit：最大填充量

copy：默认True，生成新的对象，False时，新旧相等不复制

索引类型的常用方法：

.append(idx)：连接另一个Index对象，产生新的Index对象

.diff(idx)：计算差集，产生新的Index对象

.intersection(idx)：计算交集

.union(idx)：计算并集

.delete(loc)：删除loc位置处的元素

.insert(loc,e)：在loc位置增加一个元素e

nc = d.columns.delete(2)

ni = d.index.insert(5,6)

nd = d.reindex(index=ni,columns=nc,method='ffill')

Traceback (most recent call last):

  File "<ipython-input-11-ba08f80a2d41>", line 1, in <module>

    nd = d.reindex(index=ni,columns=nc,method='ffill')

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2831, in reindex

    **kwargs)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2404, in reindex

    fill_value, copy).__finalize__(self)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2772, in _reindex_axes

    fill_value, limit, tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2794, in _reindex_columns

    tolerance=tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2833, in reindex

    tolerance=tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2538, in get_indexer

    indexer = self._get_fill_indexer(target, method, limit, tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2564, in _get_fill_indexer

    limit)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2585, in _get_fill_indexer_searchsorted

    side)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3394, in _searchsorted_monotonic

    raise ValueError('index must be monotonic increasing or decreasing')

ValueError: index must be monotonic increasing or decreasing

ni = d.index.insert(5,0)

nd = d.reindex(index=ni,columns=nc,method='ffill')

Traceback (most recent call last):

  File "<ipython-input-13-ba08f80a2d41>", line 1, in <module>

    nd = d.reindex(index=ni,columns=nc,method='ffill')

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2831, in reindex

    **kwargs)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2404, in reindex

    fill_value, copy).__finalize__(self)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2772, in _reindex_axes

    fill_value, limit, tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2794, in _reindex_columns

    tolerance=tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2833, in reindex

    tolerance=tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2538, in get_indexer

    indexer = self._get_fill_indexer(target, method, limit, tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2564, in _get_fill_indexer

    limit)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2585, in _get_fill_indexer_searchsorted

    side)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3394, in _searchsorted_monotonic

    raise ValueError('index must be monotonic increasing or decreasing')

ValueError: index must be monotonic increasing or decreasing

nd = d.reindex(index=ni,columns=nc).ffill()

nd

Out[15]:

   城市     同比     定基

5  沈阳  100.1  100.1

4  深圳  102.0  102.0

3  广州  101.3  101.3

2  上海  101.2  101.2

1  北京  101.5  101.5

0  北京  101.5  101.5

ValueError: index must be monotonic increasing or decreasing

解决方法见代码

删除：drop

a = pd.Series([9,8,7,6],index=['a','b','c','d'])

a

Out[17]:

a    9

b    8

c    7

d    6

dtype: int64

a.drop(['b','c'])

Out[18]:

a    9

d    6

dtype: int64

pandas库的数据类型运算：

import pandas as pd

import numpy as np

a = pd.DataFrame(np.arange(12),reshape(3,4))

Traceback (most recent call last):

  File "<ipython-input-21-a8c747b1897a>", line 1, in <module>

    a = pd.DataFrame(np.arange(12),reshape(3,4))

NameError: name 'reshape' is not defined

a = pd.DataFrame(np.arange(12).reshape(3,4))

a

Out[23]:

   0  1   2   3

0  0  1   2   3

1  4  5   6   7

2  8  9  10  11

b = pd.DataFrame(np.arange(20).reshape(4,5))

b

Out[25]:

    0   1   2   3   4

0   0   1   2   3   4

1   5   6   7   8   9

2  10  11  12  13  14

3  15  16  17  18  19

a+b

Out[26]:

      0     1     2     3   4

0   0.0   2.0   4.0   6.0 NaN

1   9.0  11.0  13.0  15.0 NaN

2  18.0  20.0  22.0  24.0 NaN

3   NaN   NaN   NaN   NaN NaN

b.add(a,fill_value = 0)

Out[27]:

      0     1     2     3     4

0   0.0   2.0   4.0   6.0   4.0

1   9.0  11.0  13.0  15.0   9.0

2  18.0  20.0  22.0  24.0  14.0

3  15.0  16.0  17.0  18.0  19.0

a.mul(b,fill_value = 0)

Out[28]:

      0     1      2      3    4

0   0.0   1.0    4.0    9.0  0.0

1  20.0  30.0   42.0   56.0  0.0

2  80.0  99.0  120.0  143.0  0.0

3   0.0   0.0    0.0    0.0  0.0

不同维度间为广播运算：

b = pd.DataFrame(np.arange(20).reshape(4,5))

b

Out[31]:

    0   1   2   3   4

0   0   1   2   3   4

1   5   6   7   8   9

2  10  11  12  13  14

3  15  16  17  18  19

c =pd.Series(np.arange(4))

c

Out[33]:

0    0

1    1

2    2

3    3

dtype: int32

c-10

Out[34]:

0   -10

1    -9

2    -8

3    -7

dtype: int32

b-c

Out[35]:

      0     1     2     3   4

0   0.0   0.0   0.0   0.0 NaN

1   5.0   5.0   5.0   5.0 NaN

2  10.0  10.0  10.0  10.0 NaN

3  15.0  15.0  15.0  15.0 NaN

b.sub(c,axis=0)
Out[36]:
0 1 2 3 4
0 0 1 2 3 4
1 4 5 6 7 8
2 8 9 10 11 12
3 12 13 14 15 16

排序：

.sort_index()方法在指定轴上根据索引进行排序，默认升序。

.sort_index(axis=0,ascending=True)

import pandas as pd

import numpy as np

b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])

b

Out[4]:

    0   1   2   3   4

c   0   1   2   3   4

a   5   6   7   8   9

d  10  11  12  13  14

b  15  16  17  18  19

b.sort_index()

Out[5]:

    0   1   2   3   4

a   5   6   7   8   9

b  15  16  17  18  19

c   0   1   2   3   4

d  10  11  12  13  14

b.sort_index(ascending=False)

Out[6]:

    0   1   2   3   4

d  10  11  12  13  14

c   0   1   2   3   4

b  15  16  17  18  19

a   5   6   7   8   9

.sort_values()方法在指定轴上根据数值进行排序，默认升序

Series.sort_values(axis=0,ascending=True)

DataFrame(by,axis=0,ascending=True)

by:axis轴上某个索引或索引列表

NaN统一放到排序末尾

基本统计分析：

.describe()

a = pd.Series([9,8,7,6])

a

Out[8]:

0    9

1    8

2    7

3    6

dtype: int64

a.describe()

Out[9]:

count    4.000000

mean     7.500000

std      1.290994

min      6.000000

25%      6.750000

50%      7.500000

75%      8.250000

max      9.000000

dtype: float64

a.describe()['count']

Out[10]: 4.0

b.describe()

Out[11]:

               0          1          2          3          4

count   4.000000   4.000000   4.000000   4.000000   4.000000

mean    7.500000   8.500000   9.500000  10.500000  11.500000

std     6.454972   6.454972   6.454972   6.454972   6.454972

min     0.000000   1.000000   2.000000   3.000000   4.000000

25%     3.750000   4.750000   5.750000   6.750000   7.750000

50%     7.500000   8.500000   9.500000  10.500000  11.500000

75%    11.250000  12.250000  13.250000  14.250000  15.250000

max    15.000000  16.000000  17.000000  18.000000  19.000000

b.describe()[2]

Out[12]:

count     4.000000

mean      9.500000

std       6.454972

min       2.000000

25%       5.750000

50%       9.500000

75%      13.250000

max      17.000000

Name: 2, dtype: float64

数据的累计统计分析：

.cumsum()依次给出前1、2、。。。n个数的和

.cumprod()积

.cummax()最大值

.cummin()最小值

b.cumsum()

Out[13]:

    0   1   2   3   4

c   0   1   2   3   4

a   5   7   9  11  13

d  15  18  21  24  27

b  30  34  38  42  46

滚动计算

.rolling(w).sum()依次计算相邻w个元素的和

.rolling(w).mean()算术平均值

.rolling(w).var()方差

.rolling(w).std()标准差

.rolling(w).min() .max()最小值、最大值

b.rolling(2).sum()

Out[14]:

      0     1     2     3     4

c   NaN   NaN   NaN   NaN   NaN

a   5.0   7.0   9.0  11.0  13.0

d  15.0  17.0  19.0  21.0  23.0

b  25.0  27.0  29.0  31.0  33.0

Python数据分析与展示第3周学习笔记（北京理工大学嵩天等）的更多相关文章

Python数据分析与展示第2周学习笔记（北理工嵩天）
单元4:Matplotlib库入门 matplotlib.pyplot是绘制各类可视化图形的命令子库,相当于快捷方式 import matplotlib.pyplot as plt # -*- cod ...
Python数据分析与展示[第三周]（pandas简介与数据创建）
第三周的课程pandas 分析数据 http://pandas.pydata.org import pandas as pd 常与numpy matplotlib 一块定义 d=pd.Series(r ...
Python数据分析与展示[第三周]（pandas数据类型操作）
数据类型操作如何改变Series/ DataFrame 对象增加或重排:重新索引删除:drop 重新索引 .reindex() reindex() 能够改变或重排Series和DataFrame ...
Python数据分析与展示[第三周]（pandas数据特征分析单元8）
数据理解基本统计分布/累计统计数据特征数据挖掘数据排序操作索引的排序 .sort_index() 在指定轴上排序,默认升序参数 axis=0 column ascending=True ...
Python数据分析与展示(1)-数据分析之表示(1)-NumPy库入门
Numpy库入门从一个数据到一组数据维度:一组数据的组织形式一维数据:由对等关系的有序或无序数据构成,采用线性方式组织. 可用类型:对应列表.数组和集合不同点: 列表:数据类型可以不同数组: ...
20145213《Java程序设计》第八周学习笔记
20145213<Java程序设计>第八周学习笔记教材学习内容总结 "桃花春欲尽,谷雨夜来收"谷雨节气的到来意味着寒潮天气的基本结束,气温回升加快.刚出冬的我对于这种 ...
《Linux内核分析》第八周学习笔记
<Linux内核分析>第八周学习笔记进程的切换和系统的一般执行过程郭垚原创作品转载请注明出处 <Linux内核分析>MOOC课程http://mooc.study.163 ...
《Linux内核分析》第七周学习笔记
<Linux内核分析>第七周学习笔记可执行程序的装载郭垚原创作品转载请注明出处 <Linux内核分析>MOOC课程http://mooc.study.163.com/co ...
《Linux内核分析》第六周学习笔记
<Linux内核分析>第六周学习笔记进程的描述和创建郭垚原创作品转载请注明出处 <Linux内核分析>MOOC课程http://mooc.study.163.com/co ...

随机推荐

51Nod1049 最大子段和
我们来先看题: N个整数组成的序列a1,a2,a3,-,an,求该序列如ai+ai+1+-+aj的连续子段和的最大值.当所给的整数均为负数时和为0. 例如:-2,11,-4,13,-5,-2,和最大的 ...
Multiarmed Bandit Algorithm在股票中的应用
股票与Bandit Machine看起来相去甚远,但实际上通过限制买入和卖出的行为,股票可以转换为Bandit Machine,比如:规定股票必须在买入一天以后卖出.为什么要大费周折地把股票变成Ban ...
POJ - 2976 Dropping tests(01分数规划---二分(最大化平均值))
题意:有n组ai和bi,要求去掉k组,使下式值最大. 分析: 1.此题是典型的01分数规划. 01分数规划:给定两个数组,a[i]表示选取i的可以得到的价值,b[i]表示选取i的代价.x[i]=1代表 ...
nodejs（14）express获取url中的参数
问号传参获取参数获取 http://127.0.0.1:3001/user?id=10&name=zs 中的查询参数: 直接使用 req.query 获取参数即可: 注意:URL 地址栏中通 ...
leetcode算法题121-123 --78 --python版本
给定一个数组 nums,编写一个函数将所有 0 移动到数组的末尾,同时保持非零元素的相对顺序. 实例输入: [0,1,0,3,12] 输出: [1,3,12,0,0] 说明: 必须在原数组上操作,不能 ...
Dijkstra与Floyd算法
1. Dijkstra算法 1.1 定义概览 Dijkstra(迪杰斯特拉)算法是典型的单源最短路径算法,用于计算一个节点到其他所有节点的最短路径.主要特点是以起始点为中心向外层层扩展,直到扩展到终点 ...
Android 消息推送流程机制
1.引言所谓的消息推送就是从服务器端向移动终端发送连接,传输一定的信息.比如一些新闻客户端,每隔一段时间收到一条或者多条通知,这就是从服务器端传来的推送消息:还比如常用的一些IM软件如微信.GTal ...
JDK11 JAVA11下载安装与快速配置环境变量教程
https://blog.csdn.net/weixin_40928253/article/details/83590136 1.到Oracle官网下载jdk11,并安装.搜索“jdk",选 ...
三阶平面魔方（BFS）
有一个 3×3 的平面魔方,在平面魔方中,每个格子里分别无重复地写上 1 - 9 这 9 个数字.一共有 4 种对平面魔方的操作: 选择某一行左移. 选择某一行右移. 选择某一列上移. 选择某一列下 ...
ZJNU 1125 - A == B ?——中级
处理后再判断即可,处理过程注意考虑全面. /* Written By. StelaYuri */ #include<iostream> #include<string> usi ...

Python数据分析与展示第3周学习笔记（北京理工大学 嵩天等）

Python数据分析与展示第3周学习笔记（北京理工大学 嵩天等）的更多相关文章

随机推荐

热门专题

Python数据分析与展示第3周学习笔记（北京理工大学嵩天等）

Python数据分析与展示第3周学习笔记（北京理工大学嵩天等）的更多相关文章