sparksql  hive

https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html

https://cwiki.apache.org/confluence/display/Hive/Home

【服务数仓,支持sql强标准】

Apache Hive

The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.

【执行引擎有Spark】

Built on top of Apache Hadoop™, Hive provides the following features:

  • Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
  • A mechanism to impose structure on a variety of data formats
  • Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase

  • Query execution via Apache TezApache Spark, or MapReduce
  • Procedural language with HPL-SQL
  • Sub-second query retrieval via Hive LLAPApache YARN and Apache Slider.

Hive provides standard SQL functionality, including many of the later SQL:2003 and SQL:2011 features for analytics. 
Hive's SQL can also be extended with user code via user defined functions (UDFs), user defined aggregates (UDAFs), and user defined table functions (UDTFs).

There is not a single "Hive format" in which data must be stored. Hive comes with built in connectors for comma and tab-separated values (CSV/TSV) text files, Apache ParquetApache ORC, and other formats. 
Users can extend Hive with connectors for other formats. Please see File Formats and Hive SerDe in the Developer Guide for details.

Hive is not designed for online transaction processing (OLTP) workloads. It is best used for traditional data warehousing tasks. 
Hive is designed to maximize scalability (scale out with more machines added dynamically to the Hadoop cluster), performance, extensibility, fault-tolerance, and loose-coupling with its input formats.

Components of Hive include HCatalog and WebHCat.

  • HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools — including Pig and MapReduce — to more easily read and write data on the grid.
  • WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP (REST style) interface.

https://issues.apache.org/jira/browse/HIVE-7292

Spark as an open-source data analytics cluster computing framework has gained significant momentum recently. Many Hive users already have Spark installed as their computing backbone. To take advantages of Hive, they still need to have either MapReduce or Tez on their cluster. This initiative will provide user a new alternative so that those user can consolidate their backend.

Secondly, providing such an alternative further increases Hive's adoption as it exposes Spark users to a viable, feature-rich de facto standard SQL tools on Hadoop.

【在多reducer阶段,性能佳】

Finally, allowing Hive to run on Spark also has performance benefits. Hive queries, especially those involving multiple reducer stages, will run faster, thus improving user experience as Tez does.

This is an umbrella JIRA which will cover many coming subtask. Design doc will be attached here shortly, and will be on the wiki as well. Feedback from the community is greatly appreciated!

【共享Hive元数据】

Sparksql   没有元数据? 通过临时创建元数据 或者 直接用Hive的元数据?

Sparksql 取代 Hive?的更多相关文章

  1. SparkSQL读取Hive中的数据

    由于我Spark采用的是Cloudera公司的CDH,并且安装的时候是在线自动安装和部署的集群.最近在学习SparkSQL,看到SparkSQL on HIVE.下面主要是介绍一下如何通过SparkS ...

  2. SparkSQL与Hive on Spark的比较

    简要介绍了SparkSQL与Hive on Spark的区别与联系 一.关于Spark 简介 在Hadoop的整个生态系统中,Spark和MapReduce在同一个层级,即主要解决分布式计算框架的问题 ...

  3. 关于sparksql操作hive,读取本地csv文件并以parquet的形式装入hive中

    说明:spark版本:2.2.0 hive版本:1.2.1 需求: 有本地csv格式的一个文件,格式为${当天日期}visit.txt,例如20180707visit.txt,现在需要将其通过spar ...

  4. spark on yarn模式下配置spark-sql访问hive元数据

    spark on yarn模式下配置spark-sql访问hive元数据 目的:在spark on yarn模式下,执行spark-sql访问hive的元数据.并对比一下spark-sql 和hive ...

  5. sparksql 操作hive

    写在前面:hive的版本是1.2.1spark的版本是1.6.x http://spark.apache.org/docs/1.6.1/sql-programming-guide.html#hive- ...

  6. 【完美解决】Spark-SQL、Hive多 Metastore、多后端、多库

    [完美解决]Spark-SQL.Hive多 Metastore.多后端.多库 [完美解决]Spark-SQL.Hive多 Metastore.多后端.多库 SparkSQL 支持同时连接多种 Meta ...

  7. hive on spark VS SparkSQL VS hive on tez

    http://blog.csdn.net/wtq1993/article/details/52435563 http://blog.csdn.net/yeruby/article/details/51 ...

  8. Spark-SQL连接Hive

    第一步:修个Hive的配置文件hive-site.xml 添加如下属性,取消本地元数据服务: <property> <name>hive.metastore.local< ...

  9. SparkSQL与Hive on Spark

    SparkSQL与Hive on Spark的比较 简要介绍了SparkSQL与Hive on Spark的区别与联系  一.关于Spark 简介 在Hadoop的整个生态系统中,Spark和MapR ...

随机推荐

  1. windows7下如何生成ssh公钥(git相关)

    1. 安装git,从程序目录打开 "Git Bash"  2. 键入命令:ssh-keygen -t rsa -C "email@email.com"   &q ...

  2. JVM指令助记符

    以下只是JVM指令助记符,关于JVM指令的详细内容请阅读<JVM指令详解> 变量到操作数栈:iload,iload_,lload,lload_,fload,fload_,dload,dlo ...

  3. MySQL完全备份脚本:数据+二进制日志+备份日志

    一. 脚本须知 1.mysql数据文件和二进制日志文件最好保存在不同的分区或存储设备上 2.备份完成后注意修改数据的权限以防止泄露重要信息,哪些主机哪些用户可以用来恢复 3. 查看导出的2进制日志文件 ...

  4. Python Challenge 第三关

    进入第三关,还是一张图加一句话:One small letter, surrounded by EXACTLY three big bodyguards on each of its sides. 图 ...

  5. [react-router] 平时积累

    path通配符: <Route path="/hello/:name"> // 匹配 /hello/michael // 匹配 /hello/ryan <Rout ...

  6. 微信小程序 使用HMACSHA1和md5为登陆注册报文添加指纹验证签名

    对接口请求报文作指纹验证签名相信在开发中经常碰到, 这次在与java后端一起开发小程序时,就碰到需求对登陆注册请求报文添加指纹验证签名来防止信息被修改 先来看下我们与后端定制签名规则 2.4. 签名规 ...

  7. spring boot原理分析

    1.分析spring-boot-starter-parent <parent> <groupId>org.springframework.boot</groupId> ...

  8. JVM 常量池

    最近正好在研究这个问题,题主问题本身是有问题的,在JDK7中HotSpot的常量池是放在Java Heap中,并非题目中的native memory中.在JDK6中是放在Perm Space.题主可以 ...

  9. 3.4 熟练掌握动态规划——状态压缩DP

    从旅行商问题说起—— 给定一个图,n个节点(n<=15),求从a节点出发,经历每个节点仅一次,最后回到a,需要的最短时间. 分析: 设定状态S代表当前已经走过的城市的集合,显然,S<=(1 ...

  10. 邁向IT專家成功之路的三十則鐵律 鐵律二十二:IT人升遷之道-無為

    升遷管道是許多人求職時相當重要的考量之一,畢竟人除了很愛錢之外更愛顯赫的頭銜,然而在企業中越顯赫的頭銜,其背後通常有更多的罵名,因為許多人的高官厚爵都是踩著一群人的頭頂爬上去的,隨時哪一天跌了下來,都 ...