[Pandas] 04 - Efficient I/O

SQLITE3接口 to Arrary

——从数据库加载数据到dataframe/numpy中。

调动 SQLITE3数据库

import sqlite3 as sq3

query = 'CREATE TABLE numbs (Date date, No1 real, No2 real)'

con = sq3.connect(path + 'numbs.db')

con.execute(query)

con.commit()

commit 命令

COMMIT 命令是用于把事务调用的更改保存到数据库中的事务命令。

COMMIT 命令把自上次 COMMIT 或 ROLLBACK 命令以来的所有事务保存到数据库

返回值处理

返回所有值，就用 fetchall()。

con.execute('SELECT * FROM numbs').fetchmany(10)

pointer = con.execute('SELECT * FROM numbs')

for i in range(3):

    print(pointer.fetchone())



Output:
-------------------------------------------------

('2017-11-18 11:18:51.443295', 0.12, 7.3)

('2017-11-18 11:18:51.466328', 0.9791, -0.01914)

('2017-11-18 11:18:51.466580', -0.88736, 0.19104)

保存到NumPy

第一步、通过初始化直接格式变换即可。

query = 'SELECT * FROM numbers WHERE No1 > 0 AND No2 < 0'

res = np.array( con.execute(query).fetchall() ).round(3)

第二步、可视化数据 by resampling，也就是少取一些点。

res = res[::100]  # every 100th result

import matplotlib.pyplot as plt

%matplotlib inline

plt.plot(res[:, 0], res[:, 1], 'ro')

plt.grid(True); 
plt.xlim(-0.5, 4.5); 
plt.ylim(-4.5, 0.5)

# tag: scatter_query

# title: Plot of the query result

# size: 60

SQLITE3接口 to DataFrame

读取整个表

一张表通常内存可以搞定，全部读取也不是避讳的事情。

import sqlite3 as sq3

filename = path + 'numbs'

con = sq3.Connection(filename + '.db')    

%time data = pd.read_sql('SELECT * FROM numbers', con)

data.head()

表操作

其实已经演变为 ndarray操作。

“与” 条件

%time data[(data['No1'] > 0) & (data['No2'] < 0)].head()

“或” 条件

%%time

res = data[['No1', 'No2']][((data['No1'] > 0.5) | (data['No1'] < -0.5))

                     & ((data['No2'] < -1) | (data['No2'] > 1))]

PyTable的快速I/O

HDF5数据库/文件标准。

"无压缩" 创建一个大表

表定义

import numpy as np

import tables as tb

import datetime as dt

import matplotlib.pyplot as plt

%matplotlib inline

filename = './data/tab.h5'

h5 = tb.open_file(filename, 'w') 

# 有几行：多搞几行，弄一个大表

rows = 2000000

# 有几列

row_des = {

    'Date': tb.StringCol(26, pos=1),

    'No1': tb.IntCol(pos=2),

    'No2': tb.IntCol(pos=3),

    'No3': tb.Float64Col(pos=4),

    'No4': tb.Float64Col(pos=5)

}

创建表

filters = tb.Filters(complevel=0)  # no compression

tab = h5.create_table('/', 'ints_floats', row_des,

                      title='Integers and Floats',

                      expectedrows=rows, filters=filters)

新增数据

此时，表还在内存中，向这个表内添加数据。

(1) 一个关键的列表形式。

pointer = tab.row

(2) 生成随机数填充。

ran_int = np.random.randint(0, 10000, size=(rows, 2))

ran_flo = np.random.standard_normal((rows, 2)).round(5)

(3) 赋值给内存中的表。

传统策略，使用了繁琐的循环。

%%time

for i in range(rows):

    pointer['Date'] = dt.datetime.now()

    pointer['No1']  = ran_int[i, 0]

    pointer['No2']  = ran_int[i, 1]

    pointer['No3']  = ran_flo[i, 0]

    pointer['No4']  = ran_flo[i, 1]

    pointer.append()

      # this appends the data and

      # moves the pointer one row forward


tab.flush() 　　# 相当于SQLITE3中的commit命令

矩阵策略，省掉了循环。

%%time

sarray['Date'] = dt.datetime.now()

sarray['No1'] = ran_int[:, 0]

sarray['No2'] = ran_int[:, 1]

sarray['No3'] = ran_flo[:, 0]

sarray['No4'] = ran_flo[:, 1]

“压缩” 创建一个大表

创建压缩表

因rows中其实已经有了数据，所以创建的同时就同步写入文件。

filename = './data/tab.h5c'

h5c = tb.open_file(filename, 'w')

filters = tb.Filters(complevel=4, complib='blosc')

tabc = h5c.create_table('/', 'ints_floats', sarray,

                        title='Integers and Floats',

                      expectedrows=rows, filters=filters)

dnarray读取

读取内存数据，返回 numpy.ndarray。

%time arr_com = tabc.read()

h5c.close()

内存外计算

比如，处理一个若干GB的数组。

创建一个外存数组 EArray

filename = './data/array.h5'

h5 = tb.open_file(filename, 'w') 

n = 100

ear = h5.create_earray(h5.root, 'ear',

                      atom=tb.Float64Atom(),

                      shape=(0, n))

%%time

rand = np.random.standard_normal((n, n))

for i in range(750):

    ear.append(rand)

ear.flush()

ear.size_on_disk　　# 查看一下，这个E Array是个大数组

创建一个对应的 EArray

第一步、设置外存 workspace。

out = h5.create_earray(h5.root, 'out', atom=tb.Float64Atom(), shape=(0, n))

第二步、通过外存来计算ear大数组。

expr = tb.Expr('3 * sin(ear) + sqrt(abs(ear))')　　　　# 这里是 import tables as tb 中的 Expr，而不是import numexpr as ne

  # the numerical expression as a string object

expr.set_output(out, append_mode=True)

  # target to store results is disk-based array

%time expr.eval()

# evaluation of the numerical expression

# and storage of results in disk-based array

第三步、从外存读入内存，传的自然是“变量“，而非”workspace"。

%time imarray = ear.read()

  # read whole array into memory

End.

[Pandas] 04 - Efficient I/O的更多相关文章

Pandas | 04 Panel 面板
面板(Panel)是3D容器的数据.面板数据一词来源于计量经济学,部分源于名称:Pandas - pan(el)-da(ta)-s. 3轴(axis)这个名称旨在给出描述涉及面板数据的操作的一些语义. ...
[AI] 深度数据 - Data
Data Engineering Data Pipeline Outline [DE] How to learn Big Data[了解大数据] [DE] Pipeline for Data Eng ...
Ubuntu下安装python相关数据处理
01. Ubuntu下安装ipython sudo apt-get install ipython 02. Ubuntu下安装pip $ sudo apt-get install python-pip ...
数据分析04 /基于pandas的DateFrame进行股票分析、双均线策略制定
数据分析04 /基于pandas的DateFrame进行股票分析.双均线策略制定目录数据分析04 /基于pandas的DateFrame进行股票分析.双均线策略制定需求1:对茅台股票分析需求2 ...
04. Pandas 3| 数值计算与统计、合并连接去重分组透视表文件读取
1.数值计算和统计基础常用数学.统计方法数值计算和统计基础基本参数:axis.skipna df.mean(axis=1,skipna=False) -->> axis=1是按行来 ...
Ubuntu16.04下安装配置numpy,scipy,matplotlibm,pandas 以及sklearn+深度学习tensorflow配置+Keras2.0.6（非Anaconda环境）
1.ubuntu镜像源准备(防止下载过慢): 参考博文:http://www.cnblogs.com/top5/archive/2009/10/07/1578815.html 步骤如下: 首先,备份一 ...
ubuntu16.04安装python3,numpy,pandas等量化计算库
ubunt安装python3 sudo add-apt-repository ppa:fkrull/deadsnakessudo apt-get updatesudo apt-get install ...
Desktop Ubuntu 14.04LTS/16.04科学计算环境配置
Desktop Ubuntu 14.04LTS/16.04科学计算环境配置计算机硬件配置 cpu i5 6代内存容量 8G gpu GTX960 显存容量 2G(建议显存在4G以上,否则一些稍具规 ...
pandas基础-Python3
未完 for examples: example 1: # Code based on Python 3.x # _*_ coding: utf-8 _*_ # __Author: "LEM ...

随机推荐

.netcore consul实现服务注册与发现-集群部署
一.Consul的集群介绍 Consul Agent有两种运行模式:Server和Client.这里的Server和Client只是Consul集群层面的区分,与搭建在Cluster之上的应用服务无关 ...
FILEBEAT+ELK日志收集平台搭建流程
filebeat+elk日志收集平台搭建流程 1. 整体简介: 模式:单机平台:Linux - centos - 7 ELK:elasticsearch.logstash.kiban ...
PyQt编写Python GUI程序，简易示例
https://blog.csdn.net/qq_41841569/article/details/81014207
Linux use apktool problem包体变大GLIBC2.14等问题
Linux服务器在线打包遇到的问题转载请标明出处: https://dujinyang.blog.csdn.net/article/details/80110942 本文出自:[奥特曼超人的博客] ...
POJ 1661 暴力dp
题意略. 思路: 很有意思的一个题,我采用的是主动更新未知点的方式,也即刷表法来dp. 我们可以把整个路径划分成横向移动和纵向移动,题目一开始就给出了Jimmy的高度,这就是纵向移动的距离. 我们dp ...
Vmware启动ubuntu 出现错误。
Vmware启动ubuntu 出现错误“以独占方式锁定此配置文件失败. 可能其它正在运行VMware进程在使用此配置文件”. 在网上查找了很多方法,法(1)试过在启动任务管理器中“结束与VMware有 ...
Angular Material 的设计之美
前言 Angular Material 作为 Angular 的官方组件库,无论是设计交互还是易用性都有着极高的质量.正如官方所说其目的就是构建基于 Angular 和 Typescript 的高质量 ...
SPOJ - Find The Determinant III 计算矩阵的行列式答案 + 辗转相除法思想
SPOJ -Find The Determinant III 参考:https://blog.csdn.net/zhoufenqin/article/details/7779707 参考中还有几个关于 ...
POJ3321 - Apple Tree DFS序 + 线段树或树状数组
Apple Tree:http://poj.org/problem?id=3321 题意: 告诉你一棵树,每棵树开始每个点上都有一个苹果,有两种操作,一种是计算以x为根的树上有几个苹果,一种是转换x这 ...
HDU 4479 Shortest path 带限制最短路
题意:给定一个图,求从1到N的递增边权的最短路. 解法:类似于bellman-ford思想,将所有的边先按照权值排一个序,然后依次将边加入进去更新,每条边只更新一次,为了保证得到的路径是边权递增的,每 ...