Pandas 时间序列处理

Pandas 时间序列处理
- 1 Python 的日期和时间处理
  - 1.1 常用模块
  - 1.2 字符串和 datetime 转换
    - datetime -> str
    - str -> datetime
- 2 Pandas 的时间处理及操作

1 Python 的日期和时间处理

1.1 常用模块

datetime time calendar

datetime，以毫秒形式存储日期和时间
datime.timedelta，表示两个 datetime 对象的时间差
datetime 模块中包含的数据类型

类型	说明
date	以公历形式存储日历日期（年、月、日）
time	将时间存储为时、分、秒、毫秒
datetime	存储日期和时间
timedelta	表示两个 datetime 值之间的差（日、秒、毫秒）

1.2 字符串和 datetime 转换

datetime -> str

str(datetime_obj)

dt_obj = datetime(2019, 8, 8)

str_obj = str(dt_obj)

print(type(str_obj))

print(str_obj)

<class 'str'>

2019-08-08 00:00:00

datetime.strftime()

str_obj2 = dt_obj.strftime('%d/%m/%Y')

print(str_obj2)

08/08/2019

str -> datetime

datetime.strptime()

需要指定时间表示的形式

dt_str = '2019-08-8'

dt_obj2 = datetime.strptime(dt_str, '%Y-%m-%d')

print(type(dt_obj2))

print(dt_obj2)

<class 'datetime.datetime'>

2019-08-08 00:00:00

dateutil.parser.parse()

可以解析大部分时间表示形式

from dateutil.parser import parse

dt_str2 = '8-08-2019'

dt_obj3 = parse(dt_str2)

print(type(dt_obj3))

print(dt_obj3)

<class 'datetime.datetime'>

2019-08-08 00:00:00

pd.to_datetime()

可以处理缺失值和空字符串

具体看这

2 Pandas 的时间处理及操作

2.1 创建与基础操作

基本类型，以时间戳为索引的 Series->Datetimelndex

指定 index 为 datetime 的 list

from datetime import datetime

import pandas as pd

import numpy as np

# 指定index为datetime的list

date_list = [datetime(2017, 2, 18), datetime(2017, 2, 19),

             datetime(2017, 2, 25), datetime(2017, 2, 26),

             datetime(2017, 3, 4), datetime(2017, 3, 5)]

time_s = pd.Series(np.random.randn(6), index=date_list)

print(time_s)

print(type(time_s.index))

2017-02-18 -0.230989

2017-02-19 -0.398082

2017-02-25 -0.309926

2017-02-26 -0.179672

2017-03-04 0.942698

2017-03-05 1.053092

dtype: float64

<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

索引

索引位置

print(time_s[0])

-0.230988627576

索引值

print(time_s[datetime(2017, 2, 18)])

-0.230988627576

可以被解析的日期字符串

print(time_s['20170218'])

-0.230988627576

按“年份”、“月份”索引

print(time_s['2017-2'])

2017-02-18   -0.230989

2017-02-19   -0.398082

2017-02-25   -0.309926

2017-02-26   -0.179672

dtype: float64

切片操作

print(time_s['2017-2-26':])

2017-02-26   -0.179672

2017-03-04    0.942698

2017-03-05    1.053092

dtype: float64

过滤

过滤掉日期之前的

time_s.truncate(before='2017-2-25')

2017-02-25   -0.309926

2017-02-26   -0.179672

2017-03-04    0.942698

2017-03-05    1.053092

dtype: float64

过滤掉日期之后的

time_s.truncate(after='2017-2-25')

2017-02-18   -0.230989

2017-02-19   -0.398082

2017-02-25   -0.309926

dtype: float64

pd.date_range()

功能：生成日期范围

dates = pd.date_range('2017-02-18', # 起始日期

                      periods=5,    # 周期

                      freq='W-SAT') # 频率

print(dates)

print(pd.Series(np.random.randn(5), index=dates))

DatetimeIndex(['2017-02-18', '2017-02-25', '2017-03-04', '2017-03-11',

'2017-03-18'],

dtype='datetime64[ns]', freq='W-SAT')

2017-02-18 -1.680280

2017-02-25 0.908664

2017-03-04 0.145318

2017-03-11 -2.940363

2017-03-18 0.152681

Freq: W-SAT, dtype: float64

传入开始、结束日期，默认生成的该时间段的时间点是按天计算的

date_index = pd.date_range('2017/02/18', '2017/03/18')
只传入开始或结束日期，还需要传入时间段

print(pd.date_range(start='2017/02/18', periods=10, freq='4D'))

print(pd.date_range(end='2017/03/18', periods=10))
规范化时间戳

print(pd.date_range(start='2017/02/18 12:13:14', periods=10))

print(pd.date_range(start='2017/02/18 12:13:14', periods=10, normalize=True))



DatetimeIndex(['2017-02-18 12:13:14', '2017-02-19 12:13:14',

               '2017-02-20 12:13:14', '2017-02-21 12:13:14',

               '2017-02-22 12:13:14', '2017-02-23 12:13:14',

               '2017-02-24 12:13:14', '2017-02-25 12:13:14',

               '2017-02-26 12:13:14', '2017-02-27 12:13:14'],

              dtype='datetime64[ns]', freq='D')

DatetimeIndex(['2017-02-18', '2017-02-19', '2017-02-20', '2017-02-21',

               '2017-02-22', '2017-02-23', '2017-02-24', '2017-02-25',

               '2017-02-26', '2017-02-27'],

              dtype='datetime64[ns]', freq='D')

频率与偏移量

频率 Freq，由基础频率的倍数组成，基础频率包括：

1.BM:business end of month，每个月最后一个工作日

2.D：天，M：月等

偏移量，每个基础频率对应一个偏移量

1.偏移量通过加法连接

sum_offset = pd.tseries.offsets.Week(2) + pd.tseries.offsets.Hour(12)

print(sum_offset)

print(pd.date_range('2017/02/18', '2017/03/18', freq=sum_offset))

14 days 12:00:00

DatetimeIndex(['2017-02-18 00:00:00', '2017-03-04 12:00:00'], dtype='datetime64[ns]', freq='348H')

移动数据

沿时间轴将数据前移或后移，保持索引不变

ts = pd.Series(np.random.randn(5), index=pd.date_range('20170218', periods=5, freq='W-SAT'))

print(ts)

2017-02-18   -0.208622

2017-02-25    0.616093

2017-03-04   -0.424725

2017-03-11   -0.361475

2017-03-18    0.761274

Freq: W-SAT, dtype: float64

向后移动一位：print(ts.shift(1))

2017-02-18         NaN

2017-02-25   -0.208622

2017-03-04    0.616093

2017-03-11   -0.424725

2017-03-18   -0.361475

Freq: W-SAT, dtype: float64

pd.to_datetime()

功能：字符串转成时间格式

import pandas as pd

s_obj = pd.Series(['2017/02/18', '2017/02/19', '2017-02-25', '2017-02-26'], name='course_time')

s_obj2 = pd.to_datetime(s_obj)

print(s_obj2)

0 2017-02-18

1 2017-02-19

2 2017-02-25

3 2017-02-26

Name: course_time, dtype: datetime64[ns]

# 处理缺失值

s_obj3 = pd.Series(['2017/02/18', '2017/02/19', '2017-02-25', '2017-02-26'] + [None],

                   name='course_time')

print(s_obj3)

0    2017/02/18

1    2017/02/19

2    2017-02-25

3    2017-02-26

4          None

Name: course_time, dtype: object

时间周期计算

Period 类，通过字符串或整数及基础频率构造
Period 对象可进行数学运算，但要保证具有相同的基础频率
period_range，创建指定规则的时间周期范围，生成 Periodlndex 索引，可用于创建 Series 或 DataFrame
时间周期的频率转换，asfreq
- 如：年度周期->月度周期
按季度计算时间周期频率

2.2 时间数据重采样

重采样（resampling）

将时间序列从一个频率转换到另一个频率的过程，需要聚合
高频率->低频率，downsampling，相反为 upsampling
pandas 中的 resample 方法实现重采样
- 产生 Resampler 对象
- reample（freq）.sum0，resampe（freq）.mean）.…

import pandas as pd

import numpy as np

date_rng = pd.date_range('20170101', periods=100, freq='D')

ser_obj = pd.Series(range(len(date_rng)), index=date_rng)

# 统计每个月的数据总和

resample_month_sum = ser_obj.resample('M').sum()

# 统计每个月的数据平均

resample_month_mean = ser_obj.resample('M').mean()

print('按月求和：', resample_month_sum)

print('按月求均值：', resample_month_mean)

按月求和： 2017-01-31     465

2017-02-28    1246

2017-03-31    2294

2017-04-30     945

Freq: M, dtype: int32

按月求均值： 2017-01-31    15.0

2017-02-28    44.5

2017-03-31    74.0

2017-04-30    94.5

Freq: M, dtype: float64

降采样（downsampling）

将数据聚合到规整的低频率
OHLC重采样，open，high，low，close

# 将数据聚合到5天的频率

five_day_sum_sample = ser_obj.resample('5D').sum()

five_day_mean_sample = ser_obj.resample('5D').mean()

five_day_ohlc_sample = ser_obj.resample('5D').ohlc()

使用 groupby 降采样

使用函数对其进行分组操作

ser_obj.groupby(lambda x: x.month).sum()

ser_obj.groupby(lambda x: x.weekday).sum()

升采样（upsampling）

将数据从低频转到高频，需要插值，否则为 NaN (直接重采样会产生空值)
常用的插值方法

ffill(limit），空值取前面的值填充，limit 为填充个数

df.resample('D').ffill(2)
bfill（limit），空值取后面的值填充

df.resample('D').bfill()
fillna（fill'）或 fllna（‘bfill）

df.resample('D').fillna('ffill')
interpolate，根据插值算法补全数据

线性算法：df.resample('D').interpolate('linear')

具体可以参考：pandas.core.resample.Resampler.interpolate

2.3 滑动窗口

在时间窗口上计算各种统计函数

窗口函数（window functions）

滚动统计（rolling）

obj.rolling().func

import pandas as pd

import numpy as np

ser_obj = pd.Series(np.random.randn(1000),

                    index=pd.date_range('20170101', periods=1000))

ser_obj = ser_obj.cumsum()

r_obj = ser_obj.rolling(window=5)

print(r_obj)

Rolling [window=5,center=False,axis=0]

window

窗口大小
center

窗口是否居中统计

设置居中：

# 画图查看

import matplotlib.pyplot as plt

%matplotlib inline

plt.figure(figsize=(15, 5))

ser_obj.plot(style='r--')

ser_obj.rolling(window=10, center=True).mean().plot(style='b')

不设置居中：

ser_obj.rolling(window=10, center=False).mean().plot(style='b')

Pandas 时间序列处理的更多相关文章

Pandas时间序列
Pandas时间序列 pandas 提供了一组标准的时间序列处理工具和数据算法数据类型及操作 Python 标准库的 datetime datetime 模块中的 datetime. time. c ...
Python Pandas 时间序列双轴折线图
时间序列pv-gmv双轴折线图 import numpy as np import pandas as pd import matplotlib.pyplot as plt n = 12 date_s ...
pandas时间序列滑窗
时间序列数据统计-滑动窗口窗口函数 import pandas as pd import numpy as np ser_obj = pd.Series(np.random.randn(1000), ...
Pandas 时间序列
# 导入相关库 import numpy as np import pandas as pd 在做金融领域方面的分析时,经常会对时间进行一系列的处理.Pandas 内部自带了很多关于时间序列相关的工具 ...
Pandas时间序列和分组聚合
#时间序列import pandas as pd import numpy as np # 生成一段时间范围 ''' 该函数主要用于生成一个固定频率的时间索引,在调用构造方法时,必须指定start.e ...
pandas时间序列常用操作
目录一.时间序列是什么二.时间序列的选取三.时间序列的生成四.时间序列的偏移量五.时间前移或后移五.时区处理六.时期及算术运算七.频率转换一.时间序列是什么时间序列在多个时间点观察 ...
pandas时间序列学习笔记
目录创建一个时间序列 pd.date_range() info() asfred() shifted(),滞后函数 diff()求差分加减乘除 DataFrame.reindex() 通过data ...
Python——Pandas 时间序列数据处理
介绍 Pandas 是非常著名的开源数据处理库,我们可以通过它完成对数据集进行快速读取.转换.过滤.分析等一系列操作.同样,Pandas 已经被证明为是非常强大的用于处理时间序列数据的工具.本节将介绍 ...
pandas 时间序列resample
resample与groupby的区别:resample:在给定的时间单位内重取样groupby:对给定的数据条目进行统计函数原型:DataFrame.resample(rule, how=None ...

随机推荐

对webpack从零配置
一.新建配置文件,文件名一般为webpack.config.js: 二.配置文件目录,一般为根目录,一般会放在./build文件夹下三.配置文件格式一般为module.exports={}, 四.结 ...
POJ 2104 K-th Number ( 求取区间 K 大值 || 主席树 || 离线线段树)
题意 : 给出一个含有 N 个数的序列,然后有 M 次问询,每次问询包含 ( L, R, K ) 要求你给出 L 到 R 这个区间的第 K 大是几分析 : 求取区间 K 大值是个经典的问题,可以使用 ...
POJ 3280 Cheapest Palindrome ( 区间DP && 经典模型 )
题意 : 给出一个由 n 中字母组成的长度为 m 的串,给出 n 种字母添加和删除花费的代价,求让给出的串变成回文串的代价. 分析 : 原始模型 ==> 题意和本题差不多,有添和删但是并无代价 ...
常用的JAVA第三方工具类
转自:https://www.jianshu.com/u/9c5cb1ee4c46 一. org.apache.commons.io.IOUtils 注解说明 closeQuietly 关闭一个IO ...
D1. Kirk and a Binary String (easy version)
D1. Kirk and a Binary String (easy version) 01串找最长不降子序列给定字符串s,要求生成一个等长字符串t,使得任意l到r位置的最长不降子序列长度一致从后 ...
IDEA将maven项目配置到本地tomcat中运行
1. Run->Edit Configurations 2. 点开default,选择tomcatServer选择local 3. 点解server选项卡下的configure,配置本地tomc ...
Java数据结构与算法(2):栈
栈是一种线性表,特点在于它只能在一个位置上进行插入和删除,该位置是表的末端,叫做栈的顶(top).因此栈是后进先出的(FIFO).栈的基本操作有push.peek.pop. 栈的示意图进栈和出栈都只 ...
Hibernate入门简介
什么是Hibernate框架? Hibernate是一种ORM框架,全称为 Object_Relative DateBase-Mapping,在Java对象与关系数据库之间建立某种映射,以实现直接存取 ...
DAY 4 下午
一些图论的知识(主要补充一下之前不了解的和比较重要) 竞赛图:完全图上的边加方向仙人掌:每一条边至多属于一个环前序:中左右中序:左中右后序:左右中先加进去无向边把每一个联通块看成一个大点 ...
职位-CHO：CHO
ylbtech-职位-CHO:CHO 首席人力资源官(Chief Human Resource Officer),简称CHO.是具有制定集团化经营的公司人力资源的战略规划,并监督执行,负责建立畅通的沟 ...

Pandas 时间序列处理

Pandas 时间序列处理

1 Python 的日期和时间处理

1.1 常用模块

1.2 字符串和 datetime 转换

datetime -> str

str -> datetime

2 Pandas 的时间处理及操作

2.1 创建与基础操作

指定 index 为 datetime 的 list

索引

过滤

pd.date_range()

频率与偏移量

移动数据

pd.to_datetime()

时间周期计算

2.2 时间数据重采样

重采样（resampling）

降采样（downsampling）

升采样（upsampling）

2.3 滑动窗口

Pandas 时间序列处理的更多相关文章

随机推荐

热门专题