Azure 云平台用 SQOOP 将 SQL server 2012 数据表导入 HIVE / HBASE
My name is Farooq and I am with HDinsight support team here at Microsoft. In this blog I will try to give some brief overview of Sqoop in HDinsight and then use an example of importing data from a Windows Azure SQL Database table to HDInsight cluster to demonstrate how you can get stated with Sqoop in HDInsight.
What is Sqoop?
Sqoop is an Apache project and part of Hadoop ecosystem. It allows data transfer between Hadoop\HDInsight cluster and relational databases such as SQL, Oracle, MySQL etc. Sqoop is a collection of related tools, for example import, export, list-all-tables, list-databases etc. To use Sqoop, you specify the tool you want to use and the arguments that control the tool. For more information on Sqoop please check Sqoop User Guide.
When do you need to use Sqoop?
You need to use Sqoop only when you are trying to import/export data between Hadoop and a relational Database. HDInsight provides a full-featured Hadoop Distributed File System (HDFS) over Windows Azure Blob storage (WABS) and if you want to upload data to HDInsight or WASB from any other source, for example from your local computer's file system then you should use any of the tools discussed in this article. The same article also discusses how to import data to HDFS from SQL Database/SQL Server using Sqoop. In this blog I will elaborate on the same with an example and try to provide more details information along the way.

What do I need to do for Sqoop to work in my HDInsight cluster?
HDInsight 2.1 includes Sqoop 1.4.3. The Microsoft SQL Server SQOOP Connector for Hadoop is now part of Apache SQOOP 1.4. So you do not need to install the connector separately. All HDInsight clusters also have Microsoft SQL Server JDBC driver installed; so all components that are needed to transfer data between HDInsight cluster and SQL server are already installed in a HDI cluster and you do not have to install anything.
How can I run a Sqoop job?
With HDInsight preview version we could only run the Sqoop commands from Hadoop command line after doing a remote desktop session (RDP) on the HDInsight cluster head node. However the release version of HDInsight SDK includes the PowerShell cmdlet to run Sqoop job remotely. So we can
- Run Sqoop jobs locally from HDInsight head node using Hadoop Command Line
- Run Sqoop job remotely using HDInsight SDK PowerShell cmlets

We recommend that you run your Sqoop commands remotely using HDInsight SDK cmdlets . We will discuss both the options in detail. First let's see how we can run Sqoop jobs locally from HDInsight head node using Hadoop Command Line.
Run Sqoop jobs locally from HDInsight head node using Hadoop Command Line
I am assuming you already have a Windows Azure SQL Database. If you don't and you want to get one please follow the steps in this article. Let's follow the steps below to create a test table and populate with some sample data in your Windows Azure SQL Database which we will import in our HDInsight cluster shortly. I will show how to do this from Windows Azure portal but you can also connect to the Windows Azure SQL Database from SSMS and do the same.
Note: if you want to transfer data from a SQL server on your environment instead then you need to change the Sqoop command with appropriate connection information and it should be very similar to the connection string I have provided later in this blog under 'More sample Sqoop commands' section for SQL server on Window Azure VM.
- Login to your Windows Azure Portal and select 'SQL Databases' from the Left and click 'Manage' at the bottom.

- Provider your Windows Azure SQL Database user ID and password to login and then click 'New Query' to open a new query window to run T-SQL queries.

- Copy paste the following T-SQL query and execute to create a test table Table1.
CREATE TABLE [dbo].[Table1](
[ID] [int] NOT NULL,
[FName] [nvarchar](50) NOT NULL,
[LName] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_Table_4] PRIMARY KEY CLUSTERED
(
[ID] ASC
)
) ON [PRIMARY]
GO
- Run the Following to Populate Table1 with 4 rows.
INSERT INTO [dbo].[Table1] VALUES (1,'Jhon','Doe'), (2,'Harry','Hoe'), (3, 'Carla','Coe'), (4,'Jackie','Joe');
GO
- Now finally run the following T-SQL to make sure that is table is populated with the sample data. You should see the output as below.
SELECT * from [dbo].[Table1]

Now let's follow the steps below to Import the rows in Table1 to the HDInsight Cluster.
- Login to your HDInsight cluster head node via Remote Desktop (RDP) and double click the 'Hadoop Command Line' icon in the desktop to open Hadoop Command Line. RDP access is turned off by default but you can follow the steps inthis blog to enable RDP and then RDP to the head node of your HDInsight cluster.
- In Hadoop Command Line please navigate to the "C:\apps\dist\sqoop-1.4.3.1.3.1.0-06\bin" folder.
Note: Please verify the path for the Sqoop bin folder in your environment. It may slightly vary from version to version.
- Run the following Sqoop command to import all the rows of table "Table1" from Windows Azure SQL Database "mfarooqSQLDB" to HDInsight Cluster.
sqoop.cmd import –-connect "jdbc:sqlserver://<SQLDatabaseServerName>.database.windows.net:1433;username=<SQLDatabasUsername>@<SQLDatabaseServerName>;password=<SQLDatabasePassword>;database=<SQLDatabaseDatabaseName>" --table Table1 --target-dir /user/hdp/SqoopImportTable1
Once the command is executed successfully you should see something similar as below in Hadoop Command Line window.

- There are quite a number of tools available to upload/download and view data in WASB. Let's use Azure Storage Explorer tool. You need to install the tool in your work station and configure for your cluster. Once all is done open the tool and find out /user/hdp/SqoopImportTable1 folder. You should see something similar as below. It shows 4 files indicating 4 map jobs were used. You can select a file and click the 'View' button to see the actual text data.

Now let's export the same rows back to the SQL server from HDInsight cluster. Please use a different table with the same schema as 'Table1'. Otherwise you would get a Primary Key violation error since the rows already exist in 'Table1'.
- Create an empty table 'Table2' with the same schema as 'Table1'.
CREATE TABLE [dbo].[Table2](
[ID] [int] NOT NULL,
[FName] [nvarchar](50) NOT NULL,
[LName] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_Table_2] PRIMARY KEY CLUSTERED
(
[ID] ASC
)
) ON [PRIMARY]
GO
- Run the following Sqoop command from Hadoop Command Line.
sqoop.cmd export --connect "jdbc:sqlserver://<SQLDatabaseServerName>.database.windows.net:1433;username=<SQLDatabasUsername>@<SQLDatabaseServerName>;password=<SQLDatabasePassword>;database=<SQLDatabaseDatabaseName>" --table Table2 --export-dir /user/hdp/SqoopImportTable1 --input-fields-terminated-by ","
More sample Sqoop commands:
Import from a SQL server on Window Azure VM:
sqoop.cmd import --connect "jdbc:sqlserver:// <WindowsAzureVMServerName>.cloudapp.net:1433; username=<SQLServerUserName>; password=<SQLServerPassword>; database=<SQLServerDatabaseName>" --table Table_1 --target-dir /user/hdp/SqoopImportTable
Export to a SQL server on Window Azure VM:
sqoop.cmd export --connect "jdbc:sqlserver://<WindowsAzureVMServerName>.cloudapp.net:1433; username=<SQLServerUserName>; password=<SQLServerPassword>; database=<SQLServerDatabaseName>" --table Table_2 --export-dir /user/hdp/SqoopImportTable2 --input-fields-terminated-by ","
Importing to HIVE from Windows Azure SQL Database:
C:\apps\dist\sqoop-1.4.2\bin>sqoop.cmd import –connect "jdbc:sqlserver://<WindowsAzureVMServerName>.cloudapp.net:1433; username=<SQLServerUserName>; password=<SQLServerPassword>; database=<SQLServerDatabaseName>" --table Table1 --hive-import
Note: This will store the files under hive/warehouse/TableName folder in HDFS (For example hive/warehouse/table1/part-m-00000 )
Run Sqoop job remotely using HDInsight SDK PowerShell cmlets
To use HDInsight PowerShell tools you need to install Windows Azure PowerShell tools first and then install HDInsight PowerShell tools. Then you need to prepare your workstation to use the HDInsight SDK. Please follow the detail steps in this earlier blog post to install the tools and prepare your work station to use the HDInsight SDK.
Once you have installed and configured Windows Azure PowerShell tools and HDInsight SDK running a Sqoop job is very easy. Please follow the steps below to import all the rows of table "Table2" from Windows Azure SQL Database "mfarooqSQLDB" to HDInsight Cluster.
- Open the Windows azure PowerShell console on the workstation and run the following cmdlets one at a time.
Note: You can also use Windows Powershell ISE to type the code and run all at once. Powershell ISE makes edits easy and you can open the tool from "C:\Windows\System32\WindowsPowerShell\v1.0\powershell_ise.exe".
- Set the variables for your Windows Azure Subscription name and the HDInsight cluster name.
$subscriptionName = "<WindowsAzureSubscriptionName>"
$clusterName = "<HDInsightClusterName>"
Select-AzureSubscription $subscriptionName
Use-AzureHDInsightCluster $clusterName -Subscription $subscriptionName
- Define the Sqoop job that we want to run. In this exercise we will import all the rows of table "Table2" that we created earlier in Windows Azure SQL Database.
$sqoop = New-AzureHDInsightSqoopJobDefinition -Command "import --connect jdbc:sqlserver://<SQLDatabaseServerName>.database.windows.net:1433;username=<SQLDatabasUsername>@<SQLDatabaseServerName>; password=<SQLDatabasePassword>; database=<SQLDatabaseDatabaseName> --table Table2 --target-dir /user/hdp/SqoopImportTable8"
- Run the Sqoop job that we just defined.
$sqoopJob = Start-AzureHDInsightJob -Subscription $subscriptionName -Cluster $clusterName -JobDefinition $sqoop
- Run the following to wait for the completion or failure of the HDInsight job and show its progress.
Wait-AzureHDInsightJob -Subscription $subscriptionName -WaitTimeoutInSeconds 3600 -Job $sqoopJob
- Run the following to retrieve the log output for a job from the storage account associated with a specified cluster.
Get-AzureHDInsightJobOutput -Cluster $clusterName -Subscription $subscriptionName -StandardError -JobId $sqoopJob.JobId
If the Sqoop job completes successfully you should see something similar as below in your Windows Azure PowerShell command line window.
Troubleshooting tips
When you run a Sqoop job command it runs MapReduce job in Hadoop Cluster (map only and no reduce task). You can specify the number of map tasks but by default four tasks are used. There is no separate log file specific to Sqoop. So we need to troubleshoot Sqoop job failure or performance issues as any other MapReduce job failure or performance issues and start by checking the task logs. I plan to write more on how to troubleshot Sqoop issues by focusing on some specific scenarios in the near future.
That's all for today and I hope you found this blog useful. I look forward to your comments and suggestions J.
Azure 云平台用 SQOOP 将 SQL server 2012 数据表导入 HIVE / HBASE的更多相关文章
- SQL Server 2012 - 数据表的操作
unicode:双字节编码 variable:可变的 character:字符 T-SQL: Transact Structured Query Language unique:唯 ...
- 不同版本的SQL Server之间数据导出导入的方法及性能比较
原文:不同版本的SQL Server之间数据导出导入的方法及性能比较 工作中有段时间常常涉及到不同版本的数据库间导出导入数据的问题,索性整理一下,并简单比较下性能,有所遗漏的方法也欢迎讨论.补充. 0 ...
- sql server 2012 数据引擎任务调度算法解析(下)
上次我们说到,sql server 2012的企业版的任务调度流程,一直到给新连接分配了scheduler,都是与以前的版本算法是一致的,只有在进行任务分配的时候,算法才有了细微的调整. 新算法的目的 ...
- sql server 2012 数据引擎任务调度算法解析(上)
微软在sql server 2012版本之后,引入了新的任务调度算法,这个算法与之前的版本有一些细微的差别.我在这里试着简单描述一下,一些基本概念就不再赘述了,比如NUMA.scheduler.wor ...
- SQL Server 2012数据导入SQL Server 2008
SQL Server 2012可以降级到2008吗?没有找到方法,似乎也不支持.整理了一个变通的方法,把2012的数据和结构导出,然后再导入2008. 在 SQL Server 2012 使用 Sql ...
- MS SQL Server中数据表、视图、函数/方法、存储过程是否存在判断及创建
前言 在操作数据库的时候经常会用到判断数据表.视图.函数/方法.存储过程是否存在,若存在,则需要删除后再重新创建.以下是MS SQL Server中的示例代码. 数据表(Table) 创建数据表的时候 ...
- SQL Server批量数据导出导入Bulk Insert使用
简介 Bulk insert命令区别于BCP命令之处在于它是SQL server脚本语句,它可以将本地或远程的文件数据批量导入数据库,速度非常之快:远程文件必须共享才行, 文件路径须使用通用约定(UN ...
- SQL Server批量数据导出导入BCP使用
BCP简介 bcp是SQL Server中负责导入导出数据的一个命令行工具,它是基于DB-Library的,并且能以并行的方式高效地导入导出大批量的数据.bcp可以将数据库的表或视图直接导出,也能通过 ...
- SQL Server 查看数据表占用空间大小的SQL语句
) ) if object_id('tempdb..#space') is not null drop table #space ),rows ),data ),index_size ),unused ...
随机推荐
- Linux下Kafka单机安装配置方法(图文)
Kafka是一个分布式的.可分区的.可复制的消息系统.它提供了普通消息系统的功能,但具有自己独特的设计.这个独特的设计是什么样的呢 介绍 Kafka是一个分布式的.可分区的.可复制的消息系统.它提供了 ...
- Spring MVC无法获取ajax POST的参数和值
一.怎么会这个样子 很简单的一个想法,ajax以POST的方式提交一个表单,Spring MVC解析.然而一次次的打印null折磨了我整整一天…… 最后的解决现在看来是很明显的问题,“只是当时已惘然” ...
- [POJ1753]Flip Game(异或方程组,高斯消元,枚举自由变量)
题目链接:http://poj.org/problem?id=1753 题意:同上. 这回翻来翻去要考虑自由变元了,假设返回了自由变元数量,则需要枚举自由变元. /* ━━━━━┒ギリギリ♂ eye! ...
- ORA-01009: 必需的参数缺失
第一步:看看是否有参数没有配: 第二步:如果第一步没问题,那么请在英文半角下把sql语句重新写一遍 以上~
- 屏幕序列Screen Sequences
声明:原创作品,转载时请注明文章来自SAP师太技术博客( 博/客/园www.cnblogs.com):www.cnblogs.com/jiangzhengjun,并以超链接形式标明文章原始出处,否则将 ...
- 打开Domion 提示: 管理员ID过期
今天打开Domion 提示 管理员ID过期,什么操作都做不了,如是在网上趴了下,发现以下方法好用: 管理员ID文件被设置为允许超期,同时又没有其他ID文件可以用于访问服务器.如果尝试用已经超期的管理员 ...
- T-SQL排名函数
提到排名函数我们首先可能想到的是order by,这个是排序,不是排名,排名需要在前面加个名次序号的,order by是没有这个功能的.还可能会想到identity(1,1),它也给了一个序号,但是不 ...
- li下沉 margin-top越界 浮动带来的影响
使用li嵌套实现的二级导航菜单,在IE浏览器下显示正常,而在谷歌及360极速模式下最后的几个li标签下沉了,其实在webkit内核的浏览器中都会有这种情况,如下图: 出现此种状况,有两种可能: 1.导 ...
- delegate基于on
前几天看到事件委托的时候,关于live()方法讲的不是很详细,就去搜了一下关于live()和delegate()的,最后看源码发现bind()和delegate()都是由on()实现的,感兴趣的朋友可 ...
- Websocket————错误总结
websocket 一.需要注意的是,js建立连接处完整的js代码要执行完成退出后才会真正发起建立连接请求,如果在此之前发送消息则会报错如下: InvalidStateError: An attemp ...