What Is Hive

Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

What Hive Is NOT

Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial overheads in job submission and scheduling. As a result - latency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as Oracle where analyses are conducted on a significantly smaller amount of data but the analyses proceed much more iteratively with the response times between iterations being less than a few minutes. Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.

Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs). 不支持在线事务处理,不支持实时查询,不支持行数据更新。最好的用途是:批量处理大数据集的不可改变的数据,比如网络日志。

In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables and partitions (which are very similar to what you would find in a traditional relational DBMS) and then illustrate the capabilities of the QL language with the help of some examples.

[Hive-Tutorial] What Is Hive的更多相关文章

  1. Hive Tutorial(上)(Hive 入门指导)

    用户指导 Hive 指导 Hive指导 概念 Hive是什么 Hive不是什么 获得和开始 数据单元 类型系统 内置操作符和方法 语言性能 用法和例子(在<下>里面) 概念 Hive是什么 ...

  2. [Hive - Tutorial] Type System 数据类型

    数据类型Type System Hive supports primitive and complex data types, as described below. See Hive Data Ty ...

  3. Hive Tutorial 阅读记录

    Hive Tutorial 目录 Hive Tutorial 1.Concepts 1.1.What Is Hive 1.2.What Hive Is NOT 1.3.Getting Started ...

  4. Spark入门实战系列--5.Hive(上)--Hive介绍及部署

    [注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .Hive介绍 1.1 Hive介绍 月开源的一个数据仓库框架,提供了类似于SQL语法的HQ ...

  5. hive学习3(hive基本操作)

    hive基本操作 hive的数据类型 1)基本数据类型 TINYINT,SMALLINT,INT,BIGINT FLOAT/DOUBLE BOOLEAN STRING 2)复合类型 ARRAY:一组有 ...

  6. [Hive - LanguageManual] Statistics in Hive

    Statistics in Hive Statistics in Hive Motivation Scope Table and Partition Statistics Column Statist ...

  7. Hive over HBase和Hive over HDFS性能比较分析

    http://superlxw1234.iteye.com/blog/2008274 环境配置: hadoop-2.0.0-cdh4.3.0 (4 nodes, 24G mem/node) hbase ...

  8. Hive学习之一 《Hive的介绍和安装》

    一.什么是Hive Hive是建立在 Hadoop 上的数据仓库基础构架.它提供了一系列的工具,可以用来进行数据提取转化加载(ETL),这是一种可以存储.查询和分析存储在 Hadoop 中的大规模数据 ...

  9. Hive综合HBase——经Hive阅读/书写 HBase桌子

    社论: 本文将Hive与HBase整合在一起,使Hive能够读取HBase中的数据,让Hadoop生态系统中最为经常使用的两大框架互相结合.相得益彰. watermark/2/text/aHR0cDo ...

  10. hive优化之——控制hive任务中的map数和reduce数

    一.    控制hive任务中的map数: 1.    通常情况下,作业会通过input的目录产生一个或者多个map任务.主要的决定因素有: input的文件总个数,input的文件大小,集群设置的文 ...

随机推荐

  1. go的优缺点

    1.1 不允许左花括号另起一行1.2 编译器莫名其妙地给行尾加上分号1.3 极度强调编译速度,不惜放弃本应提供的功能1.4 错误处理机制太原始1.5 垃圾回收器(GC)不完善.有重大缺陷1.6 禁止未 ...

  2. SVN与Eclipse整合

    SVN与Eclipse整合 下载SVN插件(http://subclipse.tigris.org) 我们使用版本eclipse_svn_site-1.6.5.zip 解压到一个文件夹中 进入ecli ...

  3. 遍历并remove HashMap中的元素时,遇到ConcurrentModificationException

    遍历并remove HashMap中的元素时,遇到ConcurrentModificationException for (Map.Entry<ImageView, UserConcise> ...

  4. 什么是I帧,P帧,B帧

    视频压缩中,每帧代表一幅静止的图像.而在实际压缩时,会采取各种算法减少数据的容量,其中IPB就是最常见的.  简单地说,I帧是关键帧,属于帧内压缩.就是和AVI的压缩是一样的. P是向前搜索的意思.B ...

  5. !!流行的php面试题及答案

    分类: 1.在PHP中,当前脚本的名称(不包括路径和查询字符串)记录在预定义变量(1)中:而链接到当前页面的URL记录在预定义变量(2)中. 答:echo $_SERVER['PHP_SELF']; ...

  6. HDU 2836 Traversal 简单DP + 树状数组

    题意:给你一个序列,问相邻两数高度差绝对值小于等于H的子序列有多少个. dp[i]表示以i为结尾的子序列有多少,易知状态转移方程为:dp[i] = sum( dp[j] ) + 1;( abs( he ...

  7. 分布式ActiveMQ集群

    分布式ActiveMQ集群的部署配置细节: 官方资料:http://activemq.apache.org/clustering.html 基本上看这个就足够了,本文就不具体分析配置文件了. 1.Qu ...

  8. LinuxShell算术运算

    Bash shell 的算术运算有四种方式:1:使用 expr 外部程式 加法 r=`expr 4 + 5`echo $r注意! '4' '+' '5' 这三者之间要有空白r=`expr 4 * 5` ...

  9. Python模块整理(三):子进程模块subprocess

    文章 原始出处 http://ipseek.blog.51cto.com/1041109/807513. 本来收集整理网络上相关资料后整理: 从python2.4版本开始,可以用subprocess这 ...

  10. h.264码流解析_一个SPS的nalu及获取视频的分辨率

    00 00 00 01 67 42 00 28 E9 00  A0 0B 77 FE 00 02 00 03 C4 80  00 00 03 00 80 00 00 1A 4D 88  10 94 0 ...