Sparksql 取代 Hive?
sparksql hive
https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
https://cwiki.apache.org/confluence/display/Hive/Home
【服务数仓,支持sql强标准】
Apache Hive
The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.
【执行引擎有Spark】
Built on top of Apache Hadoop™, Hive provides the following features:
- Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
- A mechanism to impose structure on a variety of data formats
Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™
- Query execution via Apache Tez™, Apache Spark™, or MapReduce
- Procedural language with HPL-SQL
- Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
Hive provides standard SQL functionality, including many of the later SQL:2003 and SQL:2011 features for analytics.
Hive's SQL can also be extended with user code via user defined functions (UDFs), user defined aggregates (UDAFs), and user defined table functions (UDTFs).
There is not a single "Hive format" in which data must be stored. Hive comes with built in connectors for comma and tab-separated values (CSV/TSV) text files, Apache Parquet™, Apache ORC™, and other formats.
Users can extend Hive with connectors for other formats. Please see File Formats and Hive SerDe in the Developer Guide for details.
Hive is not designed for online transaction processing (OLTP) workloads. It is best used for traditional data warehousing tasks.
Hive is designed to maximize scalability (scale out with more machines added dynamically to the Hadoop cluster), performance, extensibility, fault-tolerance, and loose-coupling with its input formats.
Components of Hive include HCatalog and WebHCat.
- HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools — including Pig and MapReduce — to more easily read and write data on the grid.
- WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP (REST style) interface.
https://issues.apache.org/jira/browse/HIVE-7292
Spark as an open-source data analytics cluster computing framework has gained significant momentum recently. Many Hive users already have Spark installed as their computing backbone. To take advantages of Hive, they still need to have either MapReduce or Tez on their cluster. This initiative will provide user a new alternative so that those user can consolidate their backend.
Secondly, providing such an alternative further increases Hive's adoption as it exposes Spark users to a viable, feature-rich de facto standard SQL tools on Hadoop.
【在多reducer阶段,性能佳】
Finally, allowing Hive to run on Spark also has performance benefits. Hive queries, especially those involving multiple reducer stages, will run faster, thus improving user experience as Tez does.
This is an umbrella JIRA which will cover many coming subtask. Design doc will be attached here shortly, and will be on the wiki as well. Feedback from the community is greatly appreciated!
【共享Hive元数据】
Sparksql 没有元数据? 通过临时创建元数据 或者 直接用Hive的元数据?
Sparksql 取代 Hive?的更多相关文章
- SparkSQL读取Hive中的数据
由于我Spark采用的是Cloudera公司的CDH,并且安装的时候是在线自动安装和部署的集群.最近在学习SparkSQL,看到SparkSQL on HIVE.下面主要是介绍一下如何通过SparkS ...
- SparkSQL与Hive on Spark的比较
简要介绍了SparkSQL与Hive on Spark的区别与联系 一.关于Spark 简介 在Hadoop的整个生态系统中,Spark和MapReduce在同一个层级,即主要解决分布式计算框架的问题 ...
- 关于sparksql操作hive,读取本地csv文件并以parquet的形式装入hive中
说明:spark版本:2.2.0 hive版本:1.2.1 需求: 有本地csv格式的一个文件,格式为${当天日期}visit.txt,例如20180707visit.txt,现在需要将其通过spar ...
- spark on yarn模式下配置spark-sql访问hive元数据
spark on yarn模式下配置spark-sql访问hive元数据 目的:在spark on yarn模式下,执行spark-sql访问hive的元数据.并对比一下spark-sql 和hive ...
- sparksql 操作hive
写在前面:hive的版本是1.2.1spark的版本是1.6.x http://spark.apache.org/docs/1.6.1/sql-programming-guide.html#hive- ...
- 【完美解决】Spark-SQL、Hive多 Metastore、多后端、多库
[完美解决]Spark-SQL.Hive多 Metastore.多后端.多库 [完美解决]Spark-SQL.Hive多 Metastore.多后端.多库 SparkSQL 支持同时连接多种 Meta ...
- hive on spark VS SparkSQL VS hive on tez
http://blog.csdn.net/wtq1993/article/details/52435563 http://blog.csdn.net/yeruby/article/details/51 ...
- Spark-SQL连接Hive
第一步:修个Hive的配置文件hive-site.xml 添加如下属性,取消本地元数据服务: <property> <name>hive.metastore.local< ...
- SparkSQL与Hive on Spark
SparkSQL与Hive on Spark的比较 简要介绍了SparkSQL与Hive on Spark的区别与联系 一.关于Spark 简介 在Hadoop的整个生态系统中,Spark和MapR ...
随机推荐
- 转 如何在C++中调用C程序
如何在C++中调用C程序? C++和C是两种完全不同的编译链接处理方式,如果直接在C++里面调用C函数,会找不到函数体,报链接错误.要解决这个问题,就要在 C++文件里面显示声明一下哪些函数是C写 ...
- 使用android ndk编译boost动态库
由于以往我写过不少使用boost库开发的项目,而最近准备移植一些代码到android上(上层界面以及jni层我不管,也没研究过,现在只完成代码的移植编译,具体如何调用,由其它人负责),所以先要解决的就 ...
- Codeforces Round #423 (Div. 2, rated, based on VK Cup Finals) E DNA Evolution
DNA Evolution 题目让我们联想到树状数组或者线段树,但是如果像普通那样子统计一段的和,空间会爆炸. 所以我们想怎样可以表示一段区间的字符串. 学习一发大佬的解法. 开一个C[10][10] ...
- LibieOJ 6170 字母树 (Trie)
题目链接 字母树 (以每个点为根遍历,插入到trie中,统计答案即可)——SamZhang #include <bits/stdc++.h> using namespace std; #d ...
- windows10 安装 mysql 5.6 教程
首先是下载 mysql-installer-community-5.6.14.0.msi ,大家可以到 mysql 官方网去下载. win10的安全机制比较严格,安装前最好到<设置>--& ...
- tomcat7设置usernamepassword
因为tomcat是绿色版.今天想在网页上管理项目,却发现没实username和password.打开tomcat-users.xml文件全都是凝视.如图: 将例如以下代码拷贝到tomcat-users ...
- 文本聚类——Kmeans
上两篇文章分别用朴素贝叶斯算法和KNN算法对newgroup文本进行了分类測试.本文使用Kmeans算法对文本进行聚类. 1.文本预处理 文本预处理在前面两本文章中已经介绍,此处(略). 2.文本向量 ...
- Android 扫描Scard卡全部的图片
这几天为了扫描Scard卡全部的图片的事非常纠结,我原本以为这是一件非常easy的事.可是我发现我错了.网上也没有完整的代码.仅仅是零零碎碎的能扫描单个文件的代码.在今天代码调试通过之后,我认为我有必 ...
- wdatepicker ie8等问题
官方文档:http://www.my97.net/demo/resource/2.4.asp 以下内容为使用中遇到的问题,具体该插件具有的方法请自行查阅官方文档. 1.当触发wdatepicker事件 ...
- 【转】Windows2008上传大文件的解决方法(iis7解决上传大容量文件)
2008上传大文件的解决方法:http://wenku.it168.com/d_000091739.shtml 2003上传大文件的解决方法:http://tech.v01.cn/windowsxit ...