Accessing data in Hadoop using dplyr and SQL

payton数据之旅 2024-10-29 05:44:52 原文

If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably want to use SQL. You can write SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr. The dplyrpackage has a generalized backend for data sources that translates your R code into SQL. You can use RStudio and dplyr to work with several of the most popular software packages in the Hadoop ecosystem, including Hive, Impala, HBase and Spark.

There are two methods for accessing data in Hadoop using dplyr and SQL.

ODBC

You can connect R and RStudio to Hadoop with an ODBC connection. This effectively treats Hadoop like any other data source (i.e., as if Hadoop were a relational database). You will need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You will also need a few R packages. We recommend using these R packages: DBI, dplyr, and odbc. Note that the dplyr package may also reference the dbplyr package to help translate R into specific variants of SQL. You can use the odbc package to create a connection with Hadoop and run queries:

library(odbc)

con <- dbConnect(odbc::odbc(),

                 driver = <driver>,

                 host = <host>,

                 dbname = <dbname>,

                 user = <user>,

                 password = <password>,

                 port = 10000)

tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL

dbDisconnect(con)

Spark

If you are running Spark on Hadoop, you may also elect to use the sparklyr package to access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. The sparklyr package communicates with the Spark API to run SQL queries, and it also has a dplyr backend. You can use sparklyr to create a connect with Spark run queries:

library(sparklyr)


con <- spark_connect(master = "yarn-client")

tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL

spark_disconnect(con)

转自：https://support.rstudio.com/hc/en-us/articles/115008241668-Accessing-data-in-Hadoop-using-dplyr-and-SQL

Accessing data in Hadoop using dplyr and SQL的更多相关文章

【Big Data】HADOOP集群的配置（二）
Hadoop集群的配置(二) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
java.lang.IllegalStateException:Couldn't read row 0, col -1 from CursorWindow. Make sure the Cursor is initialized correctly before accessing data from it.
java.lang.RuntimeException: Unable to start activity ComponentInfo{com.xxx...}: java.lang.IllegalSta ...
java.lang.IllegalStateException: Couldn't read row 1, col 0 from CursorWindow. Make sure the Cursor is initialized correctly before accessing data fr
Android中操作Sqlite遇到的错误:java.lang.IllegalStateException: Couldn't read row 1, col 0 from CursorWindow. ...
android 出现Make sure the Cursor is initialized correctly before accessing data from it
Make sure the Cursor is initialized correctly before accessing data from it 详细错误是:java.lang.IllegalS ...
[Big Data]从Hadoop到Spark的架构实践
摘要:本文则主要介绍TalkingData在大数据平台建设过程中,逐渐引入Spark,并且以Hadoop YARN和Spark为基础来构建移动大数据平台的过程. 当下,Spark已经在国内得到了广泛的 ...
【Big Data】HADOOP集群的配置（一）
Hadoop集群的配置(一) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
使用Red Gate Sql Data Compare 数据库同步工具进行SQL Server的两个数据库的数据比较、同步
Sql Data Compare 是比较两个数据库的数据是否相同.生成同步sql的工具. 这一款工具由Red Gate公司出品,我们熟悉的.NET Reflector就是这个公司推出的,它的SQLTo ...
举例说明：Hadoop vs. NoSql vs. Sql vs. NewSql
转自:http://blog.jobbole.com/86269/ 尽管层次数据库如今在大型机上依然被广泛使用,但关系数据库(RDBMS)(SQL)已经占领了数据库市场,并且表现的相当优异.我们存 ...
数据库原理及应用-SQL数据操纵语言（Data Manipulation Language）和嵌入式SQL&存储过程
2018-02-19 18:03:54 一.数据操纵语言(Data Manipulation Language) 数据操纵语言是指插入,删除和更新语言. 二.视图(View) 数据库三级模式,两级映射 ...

随机推荐

git中Untracked files如何清除
$ git status # On branch test # Untracked files: # (use "git add <file>..." to inclu ...
JMeter的下载以及安装使用
下载 https://jmeter.apache.org/download_jmeter.cgi 安装无须安装,解压之后即可使用. 解压到C:\Program Files\apache-jmeter ...
使用由 Intel MKL 支持的 R
我们通常使用的 R 版本是单线程的,即只使用一个 CPU 线程运行所有 R 代码.这样的好处是运行模型比较简单且安全,但是它并没有利用多核计算.Microsoft R Open(MRO,https:/ ...
bzoj 1318 [SPOJ744] Longest Permutation (排列)
大意: 给定序列, 求选出一个长度为k的区间, 使得区间内的数为[1,k]的排列, 且要求k最大这题好神啊. 每个排列有且仅有一个1, 我们按1将序列分成若干子问题来处理, 而每个位置最多属于两个子 ...
EBS 定义并发参数常用值集
1.ORG_ID 2.DATE 3.YES_NO
115. Distinct Subsequences *HARD* -- 字符串不连续匹配
Given a string S and a string T, count the number of distinct subsequences of T in S. A subsequence ...
046——VUE中组件之使用动态组件灵活设置页面布局
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...
PHP:第六章——正则表达式的基本概念
<?php header("Content-Type:text/html;charset=utf-8"); //正则表达式的基本概念: //宽松匹配和严格匹配: //常见的匹 ...
DOM之概述
body, table{font-family: 微软雅黑; font-size: 10pt} table{border-collapse: collapse; border: solid gray; ...
VS2013命令行界面查看虚函数的内存布局
内存布局可能使用vs的界面调试看到的旺旺是一串数字,很不方便,但是vs的命令行界面可以很直观的显示出一个类中具体的内存布局. 打开命令行.界面如下所示: 测试代码如下所示: class Base1 { ...