Accessing data in Hadoop using dplyr and SQL
If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably want to use SQL. You can write SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr. The dplyrpackage has a generalized backend for data sources that translates your R code into SQL. You can use RStudio and dplyr to work with several of the most popular software packages in the Hadoop ecosystem, including Hive, Impala, HBase and Spark.
There are two methods for accessing data in Hadoop using dplyr and SQL.
ODBC
You can connect R and RStudio to Hadoop with an ODBC connection. This effectively treats Hadoop like any other data source (i.e., as if Hadoop were a relational database). You will need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You will also need a few R packages. We recommend using these R packages: DBI, dplyr, and odbc. Note that the dplyr package may also reference the dbplyr package to help translate R into specific variants of SQL. You can use the odbc package to create a connection with Hadoop and run queries:
library(odbc)
con <- dbConnect(odbc::odbc(),
driver = <driver>,
host = <host>,
dbname = <dbname>,
user = <user>,
password = <password>,
port = 10000)
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL
dbDisconnect(con)
Spark
If you are running Spark on Hadoop, you may also elect to use the sparklyr package to access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. The sparklyr package communicates with the Spark API to run SQL queries, and it also has a dplyr backend. You can use sparklyr to create a connect with Spark run queries:
library(sparklyr)
con <- spark_connect(master = "yarn-client")
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL
spark_disconnect(con)
转自:https://support.rstudio.com/hc/en-us/articles/115008241668-Accessing-data-in-Hadoop-using-dplyr-and-SQL
Accessing data in Hadoop using dplyr and SQL的更多相关文章
- 【Big Data】HADOOP集群的配置(二)
Hadoop集群的配置(二) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
- java.lang.IllegalStateException:Couldn't read row 0, col -1 from CursorWindow. Make sure the Cursor is initialized correctly before accessing data from it.
java.lang.RuntimeException: Unable to start activity ComponentInfo{com.xxx...}: java.lang.IllegalSta ...
- java.lang.IllegalStateException: Couldn't read row 1, col 0 from CursorWindow. Make sure the Cursor is initialized correctly before accessing data fr
Android中操作Sqlite遇到的错误:java.lang.IllegalStateException: Couldn't read row 1, col 0 from CursorWindow. ...
- android 出现Make sure the Cursor is initialized correctly before accessing data from it
Make sure the Cursor is initialized correctly before accessing data from it 详细错误是:java.lang.IllegalS ...
- [Big Data]从Hadoop到Spark的架构实践
摘要:本文则主要介绍TalkingData在大数据平台建设过程中,逐渐引入Spark,并且以Hadoop YARN和Spark为基础来构建移动大数据平台的过程. 当下,Spark已经在国内得到了广泛的 ...
- 【Big Data】HADOOP集群的配置(一)
Hadoop集群的配置(一) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
- 使用Red Gate Sql Data Compare 数据库同步工具进行SQL Server的两个数据库的数据比较、同步
Sql Data Compare 是比较两个数据库的数据是否相同.生成同步sql的工具. 这一款工具由Red Gate公司出品,我们熟悉的.NET Reflector就是这个公司推出的,它的SQLTo ...
- 举例说明:Hadoop vs. NoSql vs. Sql vs. NewSql
转自:http://blog.jobbole.com/86269/ 尽管层次数据库如今在大型机上依然被广泛使用,但关系数据库(RDBMS)(SQL)已经占领了数据库市场,并且表现的相当优异.我们存 ...
- 数据库原理及应用-SQL数据操纵语言(Data Manipulation Language)和嵌入式SQL&存储过程
2018-02-19 18:03:54 一.数据操纵语言(Data Manipulation Language) 数据操纵语言是指插入,删除和更新语言. 二.视图(View) 数据库三级模式,两级映射 ...
随机推荐
- Linux内核分析
通过分析汇编代码理解计算机是如何工作的 网易云课堂<Linux内核分析>作业 实验目的: 通过反汇编一个简单的C程序,分析汇编代码理解计算机是如何工作的 实验过程: 登陆实验楼虚拟机h ...
- 再也不学AJAX了!(三)跨域获取资源 ③ - WebSocket & postMessage
让我们先简单回顾一下之前谈到的内容,AJAX是一种无页面刷新的获取服务器资源的混合技术.而基于浏览器的"同源策略",不同"域"之间不可以发送AJAX请求.但是在 ...
- awk根据指定的字符串分割字符串
以从字符串"hello-kitty-red-for-you"中获取-for前面的内容为例: echo "hello-kitty-red-for-you" |aw ...
- 11_MySQL_分页查询
# 分页查询/* 应用场景:要显示的数据,一页显示不全,需要分页提交sql请求 语法: select 查询列表 from 表 [join type]表2 on 连接条件 where 筛选条件 grou ...
- POJ 2771 Guardian of Decency
http://poj.org/problem?id=2771 题意: 一个老师想带几个同学出去,但是他怕他们会谈恋爱,所以带出去的同学两两之间必须满足如下条件之一: ①身高差大于40 ②同性 ③喜欢 ...
- pysam - 多种格式基因组数据(sam/bam/vcf/bcf/cram/…)读写与处理模块(python)--转载
pysam 模块介绍!!!! http://pysam.readthedocs.io/en/latest/index.html 在开发基因组相关流程或工具时,经常需要读取.处理和创建bam.vcf.b ...
- python 输出环境变量
import os # Access all environment variables print('*---------------ENVIRON-------------------*') pr ...
- Java的封装性、继承性和多态性
封装 封装隐藏了类的内部实现机制,可以在不影响使用的情况下改变类的内部结构,同时也保护了数据.对外界而已它的内部细节是隐藏的,暴露给外界的只是它的访问方法. 封装的优点: 便于使用者正确.方便的使用系 ...
- callee 与 caller
arguments.callee 在函数内部指向函数本身 1.函数调用 function sum (num){ if(num <= 1){ return 1; }else{ return num ...
- ✅问题:Rails.ajax的data不支持{}hash格式。必须使用string。 dataType的格式。
Rails.ajax({ url: url, type: "PATCH", data: {"post":{"category_id":thi ...