Accessing data in Hadoop using dplyr and SQL
If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably want to use SQL. You can write SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr. The dplyrpackage has a generalized backend for data sources that translates your R code into SQL. You can use RStudio and dplyr to work with several of the most popular software packages in the Hadoop ecosystem, including Hive, Impala, HBase and Spark.
There are two methods for accessing data in Hadoop using dplyr and SQL.
ODBC
You can connect R and RStudio to Hadoop with an ODBC connection. This effectively treats Hadoop like any other data source (i.e., as if Hadoop were a relational database). You will need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You will also need a few R packages. We recommend using these R packages: DBI, dplyr, and odbc. Note that the dplyr package may also reference the dbplyr package to help translate R into specific variants of SQL. You can use the odbc package to create a connection with Hadoop and run queries:
library(odbc)
con <- dbConnect(odbc::odbc(),
driver = <driver>,
host = <host>,
dbname = <dbname>,
user = <user>,
password = <password>,
port = 10000)
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL
dbDisconnect(con)
Spark
If you are running Spark on Hadoop, you may also elect to use the sparklyr package to access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. The sparklyr package communicates with the Spark API to run SQL queries, and it also has a dplyr backend. You can use sparklyr to create a connect with Spark run queries:
library(sparklyr)
con <- spark_connect(master = "yarn-client")
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL
spark_disconnect(con)
转自:https://support.rstudio.com/hc/en-us/articles/115008241668-Accessing-data-in-Hadoop-using-dplyr-and-SQL
Accessing data in Hadoop using dplyr and SQL的更多相关文章
- 【Big Data】HADOOP集群的配置(二)
Hadoop集群的配置(二) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
- java.lang.IllegalStateException:Couldn't read row 0, col -1 from CursorWindow. Make sure the Cursor is initialized correctly before accessing data from it.
java.lang.RuntimeException: Unable to start activity ComponentInfo{com.xxx...}: java.lang.IllegalSta ...
- java.lang.IllegalStateException: Couldn't read row 1, col 0 from CursorWindow. Make sure the Cursor is initialized correctly before accessing data fr
Android中操作Sqlite遇到的错误:java.lang.IllegalStateException: Couldn't read row 1, col 0 from CursorWindow. ...
- android 出现Make sure the Cursor is initialized correctly before accessing data from it
Make sure the Cursor is initialized correctly before accessing data from it 详细错误是:java.lang.IllegalS ...
- [Big Data]从Hadoop到Spark的架构实践
摘要:本文则主要介绍TalkingData在大数据平台建设过程中,逐渐引入Spark,并且以Hadoop YARN和Spark为基础来构建移动大数据平台的过程. 当下,Spark已经在国内得到了广泛的 ...
- 【Big Data】HADOOP集群的配置(一)
Hadoop集群的配置(一) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
- 使用Red Gate Sql Data Compare 数据库同步工具进行SQL Server的两个数据库的数据比较、同步
Sql Data Compare 是比较两个数据库的数据是否相同.生成同步sql的工具. 这一款工具由Red Gate公司出品,我们熟悉的.NET Reflector就是这个公司推出的,它的SQLTo ...
- 举例说明:Hadoop vs. NoSql vs. Sql vs. NewSql
转自:http://blog.jobbole.com/86269/ 尽管层次数据库如今在大型机上依然被广泛使用,但关系数据库(RDBMS)(SQL)已经占领了数据库市场,并且表现的相当优异.我们存 ...
- 数据库原理及应用-SQL数据操纵语言(Data Manipulation Language)和嵌入式SQL&存储过程
2018-02-19 18:03:54 一.数据操纵语言(Data Manipulation Language) 数据操纵语言是指插入,删除和更新语言. 二.视图(View) 数据库三级模式,两级映射 ...
随机推荐
- MVC 返回对象换成json
错误界面: 这个就是返回对象没有转换成json 就是要再返回的头部添加application/json 代码: using System; using System.Collections.Gener ...
- ISSCC 2017论文导读 Session 14: A 28nm SoC with a 1.2GHz Prediction Sparse Deep-Neural-Network Engine
A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Erro ...
- JQuery Ajax jsonp
JQuery ajax jsonp $.ajax({ method:"POST", url:"http://localhost:8081/ChenLei/PeopleSe ...
- G_M_网络流A_网络吞吐量
调了两天的代码,到最后绝望地把I64d改成lld就过了,我真的是醉了. 网络吞吐量 题面:给出一张(n个点,m条边)带权(点权边权均有)无向图,点权为每个点每秒可以接受发送的最大值,边权为花费,保证数 ...
- LA 3942 背单词
https://vjudge.net/problem/UVALive-3942 题意: 给出一个由S个不同单词组成的字典和一个长字符串.把这个字符串分解成若干个单词的连接,有多少种方法?比如,有4个单 ...
- http cookie的domain使用
问题描述 最近遇到了一个因cookie domain设置不正确导致公司自研的分布式session组件无法生效的问题. 公司自研的这套分布式session组件依赖于设置在cookie中的sessionI ...
- Java回顾之JDBC
这篇文章里,我们来讨论一些和JDBC相关的话题. 概述 尽管在实际开发过程中,我们一般使用ORM框架来代替传统的JDBC,例如Hibernate或者iBatis,但JDBC是Java用来实现数据访问的 ...
- Java网络编程和NIO详解6:Linux epoll实现原理详解
Java网络编程和NIO详解6:Linux epoll实现原理详解 本系列文章首发于我的个人博客:https://h2pl.github.io/ 欢迎阅览我的CSDN专栏:Java网络编程和NIO h ...
- python爬虫脚本下载YouTube视频
python爬虫脚本下载YouTube视频 爬虫 python YouTube视频 工作环境: python 2.7.13 pip lxml, 安装 pip install lxml,主要用xpath ...
- 为了更好更方便地活着——爱上private
刚开始接触OOP的时候,打心底里我不喜欢private与protected.我声明一个public然后不直接用它,不就跟private一样吗?在某些场合下,我还能偷偷地用一下public变量,这不是更 ...