Hortworks Hadoop生态圈简介

Hortworks 作为Apache Hadoop2.0社区的开拓者，构建了一套自己的Hadoop生态圈，包括存储数据的HDFS，资源管理框架YARN，计算模型MAPREDUCE、TEZ等，服务于数据平台的PIG、HIVE&HCATALOG、HBASE，HDFS存储的数据通过FLUME和SQOOP导入导出，集群监控AMBARI、数据生命周期管理FALCON、作业调度系统OOZIE。本文简要介绍了各个系统的概念。另外大多系统都通过Apache开源，读者可以自行下载试用。

Hortworks Hadoop生态圈架构如图1所示。

图1 Hortworks hadoop生态架构图

HDFS，YARN，MAPREDUCE，TEZ这里不再介绍。

1. HDP Hortonworks Data Platform

Hortonworks数据平台，简称为HDP，

2. Apache™ Accumulo

是一个采用单元级别的高性能数据存储和检索系统。

Apache™ Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big Table design that works on top of Apache Hadoop® and Apache ZooKeeper.

3. Apache™ Flume

是一个分布式，稳定，有效的数据收集，聚合工具，并能把大量流式数据存到hdfs。

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

4. Apache™ HBase

是一个运行在HDFS上NoSQL数据库，采用列式存储，提供快速访问、更新、插入与删除。

Apache™ HBase is a non-relational (NoSQL) database that runs on top of the Hadoop® Distributed File System (HDFS). It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data. It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes.

5. Apache™ HCatalog

是基于Hadoop的表格和存储管理层，允许用户使用不同的数据处理工具－Apache Pig, Apache MapReduce, and Apache Hive－在相应框架内更方便的读取和写入。

Apache™ HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.

6. Apache Hive

是一个构建于Hadoop上数据仓库基础设施，目的是提供数据摘要、点对点查询和大数据分析。它提供一种叫做HiveQL（SQL-like的语言）查询存储于Hadoop上结构化数据机制。Hive简化了Hadoop和商业智能与试图工具的集成。

Apache Hive is data warehouse infrastructure built on top of Apache™ Hadoop® for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL). Hive eases integration between Hadoop and tools for business intelligence and visualization.

7. Apache™ Mahout

是基于Hadoop使用Mapreduce规范的大规模机器学习算法库。机器学习是一门专注于让机器学习而没有明确编程的商业智能学科，它普遍基于之前的输出来完善未来的性能。一旦数据存储到HDFS，Mahout提供数据科学工具用于自动找出该数据中有用的模式。Apache Mahout项目专注于更快和更容易把大数据转换为大量信息。

Apache™ Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop® and using the MapReduce paradigm. Machine learning is a discipline of artificial intelligence focused on enabling machines to learn without being explicitly programmed, and it is commonly used to improve future performance based on previous outcomes. Once big data is stored on the Hadoop Distributed File System (HDFS), Mahout provides the data science tools to automatically find meaningful patterns in those big data sets. The Apache Mahout project aims to make it faster and easier to turn big data into big information.

8. Apache™ Pig

允许你使用简单的脚本语言编写复杂的MapReduce转换。Pig Latin定义一系列的数据集转换方法，如聚合、join和排序。Pig 把Pig Latin脚本转换为MapReduce，然后就可以允许在Hadoop上。Pig Latin有时可以使用UDFs (User Defined Functions)执行，即用户可以使用Java或脚本语言写好后由Pig Latin调用。

Apache™ Pig allows you to write complex MapReduce transformations using a simple scripting language. Pig Latin (the language) defines a set of transformations on a data set such as aggregate, join and sort. Pig translates the Pig Latin script into MapReduce so that it can be executed within Hadoop®. Pig Latin is sometimes extended using UDFs (User Defined Functions), which the user can write in Java or a scripting language and then call directly from the Pig Latin.

9. Apache Sqoop

是一个批量转换Hadoop和结构化存储数据如关系型数据库的工具。

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop imports data from external structured datastores into HDFS or related systems like Hive and HBase. Sqoop can also be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses. Sqoop works with relational databases such as: Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

10. Apache Ambari

Apache Ambari是对Hadoop进行监控、管理和生命周期管理的开源项目。它也是一个为Hortonworks数据平台选择管理组建的项目。Ambari向Hadoop MapReduce、HDFS、 HBase、Pig, Hive、HCatalog以及Zookeeper提供服务。

Apache Ambari is a 100-percent open source operational framework for provisioning, managing and monitoring Apache Hadoop clusters. Ambari includes an intuitive collection of operator tools and a robust set of APIs that hide the complexity of Hadoop, simplifying the operation of clusters.

10.1 Ambari provides tools to simplify cluster management. The Web interface allows you to start/stop/test Hadoop services, change configurations and manage ongoing growth of your cluster.

10.2 Monitor a Hadoop cluster

Gain instant insight into the health of your cluster. Ambari pre-configures alerts for watching Hadoop services and visualizes cluster operational data in a simple Web interface.

监控Hadoop集群，可配置监控Hadoop的服务

Ambari also includes job diagnostic tools to visualize job interdependencies and view task timelines as a way to troubleshoot historic job performance execution.

监控作业执行状况，监控作业的性能问题。

Integrate Hadoop with other applications

Ambari provides a RESTful API that enables integration with existing tools, such as Microsoft System Center and Teradata Viewpoint. Ambari also leverages standard technologies and protocols with Nagios and Ganglia for deeper customization.

与其它应用集成，

11. Apache™ Falcon

是一个基于Hadoop为了方便数据生命周期管理和处理的数据管理框架。

Apache™ Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop®. It enables users to configure, manage and orchestrate data motion, pipeline processing, disaster recovery, and data retention workflows. Instead of hard-coding complex data lifecycle capabilities, Hadoop applications can now rely on the well-tested Apache Falcon framework for these functions. Falcon’s simplification of data management is quite useful to anyone building apps on Hadoop.

12. Apache™ Oozie

是一个用于调度Hadoop作业的Web应用。

Apache™ Oozie is a Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. It can also be used to schedule jobs specific to a system, like Java programs or shell scripts.

13. Apache ZooKeeper

为Hadoop集群提供操作服务。ZooKeeper提供一个分布式配置服务、一个同步服务和一个用于分布式系统的名字注册。分布式应用使用ZooKeeper存储和更新重要的配置信息。

Apache ZooKeeper provides operational services for a Hadoop cluster. ZooKeeper provides a distributed configuration service, a synchronization service and a naming registry for distributed systems. Distributed applications use Zookeeper to store and mediate updates to important configuration information.

14. Knox

是一个基于Hadoop集群提供单点认证和访问的系统。

The Knox Gateway (“Knox”) is a system that provides a single point of authentication and access for Apache™ Hadoop® services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster. Knox runs as a server (or cluster of servers) that serve one or more Hadoop clusters.

参考：http://hortonworks.com/hadoop/

Hortworks Hadoop生态圈简介的更多相关文章

Hadoop生态圈-Ambari控制台功能简介
Hadoop生态圈-Ambari控制台功能简介作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 在经历一系列安装过程之后(部署过HDP后我终于发现为什么大家喜欢用它了,部署比CDH简 ...
Hadoop生态圈以及各组成部分的简介
1.Hadoop是什么? 适合大数据的分布式存储与计算平台 HDFS: Hadoop Distributed File System分布式文件系统 MapReduce:并行计算框架解决的问题: HD ...
hadoop生态圈介绍
原文地址:大数据技术Hadoop入门理论系列之一----hadoop生态圈介绍 1. hadoop 生态概况 Hadoop是一个由Apache基金会所开发的分布式系统基础架构. 用户可以在不了解分 ...
大数据技术Hadoop入门理论系列之一----hadoop生态圈介绍
Technorati 标记: hadoop,生态圈,ecosystem,yarn,spark,入门 1. hadoop 生态概况 Hadoop是一个由Apache基金会所开发的分布式系统基础架构. 用 ...
基于Hadoop生态圈的数据仓库实践 —— ETL
使用Hive转换.装载数据 1. Hive简介 (1)Hive是什么 Hive是一个数据仓库软件,使用SQL读.写.管理分布式存储上的大数据集.它建立在Hadoop之上,具有以下功能和 ...
Hadoop生态圈-zookeeper完全分布式部署
Hadoop生态圈-zookeeper完全分布式部署作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 本篇博客部署是建立在Hadoop高可用基础之上的,关于Hadoop高可用部署请参 ...
Hadoop生态圈-单点登录框架之CAS（Central Authentication Service）部署
Hadoop生态圈-单点登录框架之CAS(Central Authentication Service)部署作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.CAS简介 CAS( ...
Hadoop生态圈-使用FreeIPA安装Kerberos和LDAP
Hadoop生态圈-使用FreeIPA安装Kerberos和LDAP 作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 有些大数据平台只是简单地通过防火墙来解决他们的网络安全问题.十分 ...
Hadoop生态圈-Ranger数据安全管理框架
Hadoop生态圈-Ranger数据安全管理框架作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.Ranger简介 Apache Ranger是一款被设计成全面掌握Hadoop生 ...

随机推荐

PHP笔记-PHP中Web Service.
这几天工作需要.net站点免登陆访问PHP的Wiki站点. PHP不熟,感觉很苦逼.任务下来了,必须搞定.准备用SSO,太麻烦了,要改写别人很多代码,这个是第三方CMS,封装的很厉害,不好改.最后我的 ...
低噪声APD偏置电路
低噪声APD偏置电路 APD电源摘要:该电路产生并控制光通信中雪崩光电二极管(APD)的低噪声偏置电压.该可变电压通过控制APD的雪崩增益,优化光纤接收器的灵敏度特性.该电路采用低噪声.固定频率PWM ...
Kinect帮助文档翻译之二手势
使用或创建手势有两种方法可以将手势识别添加到你的unity项目中.第一种:找到KinectManager组件,在例子中它被附在MainCamera上.在组件中有两个列表的属性“Player1 Ges ...
实现压缩access(*.mdb)数据库的方法
下面的函数用来压缩access数据库需要增加ComObj单元 //压缩与修复数据库,覆盖源文件 function CompactDatabase(AFileName,APassWord:string ...
objective-c自学总结（三）---面向对象的封装,继承与多态
面向对象的三大特性封装继承多态 1.封装: 隐藏属性,方法或实现细节的过程称为封装信息隐藏,隐藏对象的实现细节,不允许用户看到将东西包装在一然后以新的完整形式呈现出来例如,两种或多种化学 ...
转载：JS快速获取图片宽高的方法
快速获取图片的宽高其实是为了预先做好排版样式布局做准备,通过快速获取图片宽高的方法比onload方法要节省很多时间,甚至一分钟以上都有可能,并且这种方法适用主流浏览器包括IE低版本浏览器. 我们一步一 ...
揭开NodeJS的神秘面纱！
一.NodeJS是什么? Node是一个服务器端JavaScript解释器.Node.js是一套用来编写高性能网络服务器的JavaScript包. 二.Node的目标是什么? Node 公开宣称的目标 ...
6、android开发中遇到的bug整理
1.使用actionProvider时出现的问题 bug复现: 解决方案: //import android.support.v4.view.ActionProvider; import androi ...
【Subsets】cpp
题目: Given a set of distinct integers, nums, return all possible subsets. Note: Elements in a subset ...
adb 连接时 device offline
继上一篇博文,会发现最后图片上 adb连接时候提示device offline 以下三种方法可以试一下~我是试到最后一种才成功 1.重启手机 2.adb kill-server adb star ...

Hortworks Hadoop生态圈简介

Hortworks Hadoop生态圈简介的更多相关文章

随机推荐

热门专题