Accessing data in Hadoop using dplyr and SQL
If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably want to use SQL. You can write SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr. The dplyrpackage has a generalized backend for data sources that translates your R code into SQL. You can use RStudio and dplyr to work with several of the most popular software packages in the Hadoop ecosystem, including Hive, Impala, HBase and Spark.
There are two methods for accessing data in Hadoop using dplyr and SQL.
ODBC
You can connect R and RStudio to Hadoop with an ODBC connection. This effectively treats Hadoop like any other data source (i.e., as if Hadoop were a relational database). You will need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You will also need a few R packages. We recommend using these R packages: DBI, dplyr, and odbc. Note that the dplyr package may also reference the dbplyr package to help translate R into specific variants of SQL. You can use the odbc package to create a connection with Hadoop and run queries:
library(odbc)
con <- dbConnect(odbc::odbc(),
driver = <driver>,
host = <host>,
dbname = <dbname>,
user = <user>,
password = <password>,
port = 10000)
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL
dbDisconnect(con)
Spark
If you are running Spark on Hadoop, you may also elect to use the sparklyr package to access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. The sparklyr package communicates with the Spark API to run SQL queries, and it also has a dplyr backend. You can use sparklyr to create a connect with Spark run queries:
library(sparklyr)
con <- spark_connect(master = "yarn-client")
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL
spark_disconnect(con)
转自:https://support.rstudio.com/hc/en-us/articles/115008241668-Accessing-data-in-Hadoop-using-dplyr-and-SQL
Accessing data in Hadoop using dplyr and SQL的更多相关文章
- 【Big Data】HADOOP集群的配置(二)
Hadoop集群的配置(二) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
- java.lang.IllegalStateException:Couldn't read row 0, col -1 from CursorWindow. Make sure the Cursor is initialized correctly before accessing data from it.
java.lang.RuntimeException: Unable to start activity ComponentInfo{com.xxx...}: java.lang.IllegalSta ...
- java.lang.IllegalStateException: Couldn't read row 1, col 0 from CursorWindow. Make sure the Cursor is initialized correctly before accessing data fr
Android中操作Sqlite遇到的错误:java.lang.IllegalStateException: Couldn't read row 1, col 0 from CursorWindow. ...
- android 出现Make sure the Cursor is initialized correctly before accessing data from it
Make sure the Cursor is initialized correctly before accessing data from it 详细错误是:java.lang.IllegalS ...
- [Big Data]从Hadoop到Spark的架构实践
摘要:本文则主要介绍TalkingData在大数据平台建设过程中,逐渐引入Spark,并且以Hadoop YARN和Spark为基础来构建移动大数据平台的过程. 当下,Spark已经在国内得到了广泛的 ...
- 【Big Data】HADOOP集群的配置(一)
Hadoop集群的配置(一) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
- 使用Red Gate Sql Data Compare 数据库同步工具进行SQL Server的两个数据库的数据比较、同步
Sql Data Compare 是比较两个数据库的数据是否相同.生成同步sql的工具. 这一款工具由Red Gate公司出品,我们熟悉的.NET Reflector就是这个公司推出的,它的SQLTo ...
- 举例说明:Hadoop vs. NoSql vs. Sql vs. NewSql
转自:http://blog.jobbole.com/86269/ 尽管层次数据库如今在大型机上依然被广泛使用,但关系数据库(RDBMS)(SQL)已经占领了数据库市场,并且表现的相当优异.我们存 ...
- 数据库原理及应用-SQL数据操纵语言(Data Manipulation Language)和嵌入式SQL&存储过程
2018-02-19 18:03:54 一.数据操纵语言(Data Manipulation Language) 数据操纵语言是指插入,删除和更新语言. 二.视图(View) 数据库三级模式,两级映射 ...
随机推荐
- 20145311 《Java程序设计》第七周学习总结
20145311 <Java程序设计>第七周学习总结 教材学习内容总结 第十二章 Lambda Lambda表达式会使程序更加地简洁,在平行设计的时候,能够进行并行处理. 第十三章 时间与 ...
- 初识PHP(一)基础语法
一直准备学习PHP,结果前一段时间总是有事情,耽误了一阵子.现在赶快迎头赶上! 这个系列只是谈谈我对于PHP的一些看法,不是教程性质的.另外我是小白,只是写写随笔,大神求轻拍.本人学习过c .java ...
- win10下安装lxml
最近在windows平台下开发,用的python3.6,安装lxml遇到点问题,现已解决.特意记下,以供以后再遇到. 解决方法: 1.打开cmd终端,查看pip版本,pip --version,如不是 ...
- Codeforces Round #390 (Div. 2) C. Vladik and chat(dp)
http://codeforces.com/contest/754/problem/C C. Vladik and chat time limit per test 2 seconds memory ...
- mis权限系统
在mis中开发,主要目的是有一个统一的权限管理(即r360.right表),以及一个统一的系统和界面供后台配置管理 1.数据库准备工作: mis后台涉及表: right表是权限操作表,role_rig ...
- PHP如何安装扩展
PHP如何安装扩展 一.总结 一句话总结:兩步: dll php.ini a.下载好扩展的dll,放入指定文件夹下 b.在php.ini配置文件中声明插件 1.什么是php扩展? php核心 不支持 ...
- 雷林鹏分享:Ruby 多线程
Ruby 多线程 每个正在系统上运行的程序都是一个进程.每个进程包含一到多个线程. 线程是程序中一个单一的顺序控制流程,在单个程序中同时运行多个线程完成不同的工作,称为多线程. Ruby 中我们可以通 ...
- WPF——RenderTransform特效
WPF: RenderTransform特效 WPF中的变形(RenderTransform)类是为了达到直接去改变某个Silverlight对象的形状(比如缩放.旋转一个元素)的目的而设计的,Ren ...
- UVA-1152 4 Values whose Sum is 0 (二分)
题目大意:在4个都有n个元素的集合中,每个集合选出一个元素,使得4个数和为0.问有几种方案. 题目分析:二分.任选两组求和,剩下两组求和,枚举第一组中每一个和sum,在第二组和中查找-sum的个数,累 ...
- IIS服务器禁用缓存的方法
IIS服务器禁用缓存的方法: 工作中经常有实施的同事问我为什么界面皮肤图片替换后网站上没反应,要等很久才会出现结果.这个其实是服务器缓存的设置引起的问题,以前不知道怎么解决,临时的清缓存文件夹,有时候 ...