[转帖]Loading Data into HAWQ
Loading Data into HAWQ
Loading data into the database is required to start using it but how? There are several approaches to achieve this basic requirement but achieve the result by approaching the problem in different ways. This allows you to load data that best matches your use case.
Table Setup
This table will be used for the testing in HAWQ. I have this table
created in a single node VM running Hortonworks HDP with HAWQ 2.0
installed. I’m using the default Resource Manager too.
CREATE TABLE test_data
(id int,
fname text,
lname text)
DISTRIBUTED RANDOMLY;
Singleton
Let’s start with probably the worst way first. Sometimes this way is
ideal because you have very little data to load but in most cases, avoid
singleton inserts. This approach inserts just a single tuple in a
single transaction.
head si_test_data.sql
insert into test_data (id, fname, lname) values (1, 'jon_00001', 'roberts_00001');
insert into test_data (id, fname, lname) values (2, 'jon_00002', 'roberts_00002');
insert into test_data (id, fname, lname) values (3, 'jon_00003', 'roberts_00003');
insert into test_data (id, fname, lname) values (4, 'jon_00004', 'roberts_00004');
insert into test_data (id, fname, lname) values (5, 'jon_00005', 'roberts_00005');
insert into test_data (id, fname, lname) values (6, 'jon_00006', 'roberts_00006');
insert into test_data (id, fname, lname) values (7, 'jon_00007', 'roberts_00007');
insert into test_data (id, fname, lname) values (8, 'jon_00008', 'roberts_00008');
insert into test_data (id, fname, lname) values (9, 'jon_00009', 'roberts_00009');
insert into test_data (id, fname, lname) values (10, 'jon_00010', 'roberts_00010');
This repeats for 10,000 tuples.
time psql -f si_test_data.sql > /dev/null
real 5m49.527s
As you can see, this is pretty slow and not recommended for inserting large amounts of data. Nearly 6 minutes to load 10,000 tuples is crawling.
COPY
If you are familiar with PostgreSQL then you will feel right at home
with this technique. This time, the data is in a file named
test_data.txt and it is not wrapped with an insert statement.
head test_data.txt
1|jon_00001|roberts_00001
2|jon_00002|roberts_00002
3|jon_00003|roberts_00003
4|jon_00004|roberts_00004
5|jon_00005|roberts_00005
6|jon_00006|roberts_00006
7|jon_00007|roberts_00007
8|jon_00008|roberts_00008
9|jon_00009|roberts_00009
10|jon_00010|roberts_00010
COPY test_data FROM '/home/gpadmin/test_data.txt' WITH DELIMITER '|';
COPY 10000
Time: 128.580 ms
This method is significantly faster but it loads the data through the master. This means it doesn’t scale well as the master will become the bottleneck but it does allow you to load data from a host anywhere on your network so long as it has access to the master.
gpfdist
gpfdist is a web server that serves posix files for the segments to
fetch. Segment processes will get the data directly from gpfdist and
bypass the master when doing so. This enables you to scale by adding
more gpfdist processes and/or more segments.
gpfdist -p 8888 &
[1] 128836
[gpadmin@hdb ~]$ Serving HTTP on port 8888, directory /home/gpadmin
Now you’ll need to create a new external table to read the data from gpfdist.
CREATE EXTERNAL TABLE gpfdist_test_data
(id int,
fname text,
lname text)
LOCATION ('gpfdist://hdb:8888/test_data.txt')
FORMAT 'TEXT' (DELIMITER '|');
And to load the data.
INSERT INTO test_data SELECT * FROM gpfdist_test_data;
INSERT 0 10000
Time: 98.362 ms
gpfdist is blazing fast and scales easily. You can add more than one gpfdist location in the external table, use wild cards, use different formats, and much more. The downside is the file must be on a host that all segments can reach. You also have to create a separate gpfdist process on that host.
gpload
gpload is a utility that automates the loading process by using gpfdist.
Review the documentation for more on this utility. Technically, it is
the same as gpfdist and external tables but just automates the commands
for you.
Programmable Extension Framework (PXF)
PXF allows you to read and write data to HDFS using external tables.
Like using gpfdist, it is done by each segment so it scales and executes
in parallel.
For this example, I’ve loaded the test data into HDFS.
hdfs dfs -cat /test_data/* | head
1|jon_00001|roberts_00001
2|jon_00002|roberts_00002
3|jon_00003|roberts_00003
4|jon_00004|roberts_00004
5|jon_00005|roberts_00005
6|jon_00006|roberts_00006
7|jon_00007|roberts_00007
8|jon_00008|roberts_00008
9|jon_00009|roberts_00009
10|jon_00010|roberts_00010
The external table definition.
CREATE EXTERNAL TABLE et_test_data
(id int,
fname text,
lname text)
LOCATION ('pxf://hdb:51200/test_data?Profile=HdfsTextSimple')
FORMAT 'TEXT' (DELIMITER '|');
And now to load it.
INSERT INTO test_data SELECT * FROM et_test_data;
INSERT 0 10000
Time: 227.599 ms
PXF is probably the best way to load data when using the “Data Lake” design. You load your raw data into HDFS and then consume it with a variety of tools in the Hadoop ecosystem. PXF can also read and write other formats.
Outsourcer and gplink
Last but not least are software programs I created. Outsourcer
automates the table creation and load of data directly to Greenplum or
HAWQ using gpfdist. It sources data from SQL Server and Oracle as these
are the two most common OLTP databases.
gplink is another tool that can read external data but this technique
can connect to any valid JDBC source. It doesn’t automate many of the
steps that Oustourcer does but it is a convenient tool to get data from a
JDBC source.
You might be thinking that sqoop does this but not exactly. gplink
and Outsourcer load data into HAWQ and Greenplum tables. It is
optimized for these databases and fixes data for you automatically.
Both remove null and newline characters and escapes the escape and
delimiter characters. With sqoop, you will have to read the data from
HDFS using PXF and then fix the errors that could be in the files.
Both tools are linked above.
Summary
This post gives a brief description on the various ways to load data
into HAWQ. Pick the right technique for your use case. As you can see,
HAWQ is very flexible and can handle a variety of ways to load data.
This entry was posted in Hadoop on July 14, 2016.
[转帖]Loading Data into HAWQ的更多相关文章
- Loading Data into HDFS
How to use a PDI job to move a file into HDFS. Prerequisites In order to follow along with this how- ...
- 使用OGG"Loading data from file to Replicat"的方法应该注意的问题:replicat进程是前台进程
使用OGG的 "Loading data from file to Replicat"的方法应该注意的问题:replicat进程是前台进程 因此.最好是在vncserver中调用该 ...
- OGG "Loading data from file to Replicat"table静态数据同步配置过程
OGG "Loading data from file to Replicat"table静态数据同步配置过程 一个.mgr过程 GGSCI (lei1) 3> view p ...
- Loading Data into a Table;MySQL从本地向数据库导入数据
在localhost中准备好了一个test数据库和一个pet表: mysql> SHOW DATABASES; +--------------------+ | Database | +---- ...
- loading data into a table(亲测有效)
一.实验要求 导入数据到数据库的表里 表内容如下: name owner species sex birth death Fluffy Harold cat f 1993-02-04 Cla ...
- HeadFirst Ruby 第十五章总结 Saving and loading data
前言 在上一章讲述了如何进行基础的操作,比如 处理 GET 请求的 get route, 再比如下载 gem 等等方面的知识.在这一章节,作者告诉我们如何储存.处理数据.整个过程分三步走: 首先,当 ...
- 解决eclipse+adt出现的 loading data for android 问题
因为公司最近做的项目中有用到一些第三方demo,蛋疼的是这些demo还比较旧...eclipse的... 于是给自己的eclipse装上了ADT插件,但是...因为我的eclipse比较新,Versi ...
- [MST] Loading Data from the Server using lifecycle hook
Let's stop hardcoding our initial state and fetch it from the server instead. In this lesson you wil ...
- fake_useragent—Error occurred during loading data报错问题
问题如下 解决方法: 在自己的临时文件下新建一个fake_useragent_0.1.11.json 把下面的文字复制进去 临时文件 直接输入cmd %temp% 即可进去 { "rando ...
随机推荐
- 【最大公约数&链表】权值 @upcexam5921
时间限制: 1 Sec 内存限制: 512 MB 题目描述 给定一个长为n的正整数序列Ai.对于它的任意一个连续的子序列{Al, Al+1, …, Ar},定义其权值W (l, r)为其长度与序列中所 ...
- pygame-KidsCanCode系列jumpy-part2-加速度与摩擦力
上一节,我们整理了一个游戏开发的新框架(即:Game类),本节将运用这个框架,实现基本的加速度及摩托力效果. 先定义游戏的精灵(下面代码命名为sprites.py) from part_02.sett ...
- androidstudio全局搜索快捷键Ctrl+Shift+F失效的解决办法
与输入法设置冲突!!修改了就可以了.用的搜狗输入法,它的此快捷键也为简繁体替换.修改成其他的即可 null
- python 过滤文本中的标点符号(转)
网上搜到的大都太复杂,最后找到一个用正则表达式实现的: import re s = "string. With. Punctuation?" # 如果空白符也需要过滤,使用 r'[ ...
- CentOS7 限制SSH密码尝试次数
编辑配置文件: vi /etc/pam.d/sshd 在文末添加内容: auth required pam_tally2.so deny= unlock_time= 代表失败5次,禁止访问600秒 保 ...
- iOS CALayer 绘图模糊有锯齿的解决方案
在CALayer中绘制图形会出现锯齿和模糊,同样绘图在UIView中就没有问题.经查资料发现不自动处理两倍像素的情况. 解决方案为:设置layer的contentsScale属性为[[UIScreen ...
- 【C++】C++中typedef、auto与decltype的作用
typedef 类型别名(type alias)是一个名字,使用typedef不会真正地创建一种新的数据类型,它只是已经存在数据类型的一个新名称.语法: typedef type name; 其中ty ...
- 【C++】C++中的引用与指针
想必大家对C++中的指针都有所了解,但是什么是引用呢?C++11标准引入了“引用”的新功能. 引用 引用(reference):给对象起了另外一个名字,引用类型引用(refers to)另外一种类型, ...
- Python的数据库mongoDB的入门操作
Python代码: import pymongo # 获取本地端口,激活mongo客户端 client = pymongo.MongoClient('localhost',27017) # 创建一个数 ...
- SNF软件开发机器人2018最新更新内容
SNF软件开发机器人从10月份到现在的更新升级情况如下: 1 表单 表单控件占多列时,宽度默认0,自适应宽度2 excel导出 部分excel导出方法移动到框架中,可通用获取3 生成代码 生成的代码, ...