Awesome Hadoop
A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHP, Awesome Pythonand Awesome Sysadmin
- Awesome Hadoop
- Hadoop
- YARN
- NoSQL
- SQL on Hadoop
- Data Management
- Workflow, Lifecycle and Governance
- Data Ingestion and Integration
- DSL
- Libraries and Tools
- Realtime Data Processing
- Distributed Computing and Programming
- Packaging, Provisioning and Monitoring
- Monitoring
- Search
- Security
- Benchmark
- Machine learning and Big Data analytics
- Misc.
- Resources
- Other Awesome Lists
Hadoop
- Apache Hadoop - Apache Hadoop
- Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop
- SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
- GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
- Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
- dumbo - Python module that allows you to easily write and run Hadoop programs.
- hadoopy - Python MapReduce library written in Cython.
- mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
- pydoop - Pydoop is a package that provides a Python API for Hadoop.
- hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
- White Elephant - Hadoop log aggregator and dashboard
- Kiji Project
- Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
- Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
- Crunch - Go-based toolkit for ETL and feature extraction on Hadoop
- Apache Ignite - Distributed in-memory platform
YARN
- Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
- Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
- mpich2-yarn - Running MPICH2 on Yarn
NoSQL
Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.
- Apache HBase - Apache HBase
- Apache Phoenix - A SQL skin over HBase supporting secondary indices
- happybase - A developer-friendly Python library to interact with Apache HBase.
- Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
- Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
- hindex - Secondary Index for HBase
- Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
- OpenTSDB - The Scalable Time Series Database
- Apache Cassandra
SQL on Hadoop
SQL on Hadoop
- Apache Hive - The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
- Apache Phoenix A SQL skin over HBase supporting secondary indices
- Apache HAWQ (incubating) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
- Lingual - SQL interface for Cascading (MR/Tez job generator)
- Cloudera Impala
- Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
- Apache Tajo - Data warehouse system for Apache Hadoop
- Apache Drill - Schema-free SQL Query Engine
- Apache Trafodion
Data Management
- Apache Calcite - A Dynamic Data Management Framework
- Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies
Workflow, Lifecycle and Governance
- Apache Oozie - Apache Oozie
- Azkaban
- Apache Falcon - Data management and processing platform
- Apache NiFi - A dataflow system
- Apache AirFlow - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
- Luigi - Python package that helps you build complex pipelines of batch jobs
Data Ingestion and Integration
- Apache Flume - Apache Flume
- Suro - Netflix's distributed Data Pipeline
- Apache Sqoop - Apache Sqoop
- Apache Kafka - Apache Kafka
- Gobblin from LinkedIn - Universal data ingestion framework for Hadoop
DSL
- Apache Pig - Apache Pig
- Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
- vahara - Machine learning and natural language processing with Apache Pig
- packetpig - Open Source Big Data Security Analytics
- akela - Mozilla's utility library for Hadoop, HBase, Pig, etc.
- seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
- Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
- PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
Libraries and Tools
- Kite Software Development Kit - A set of libraries, tools, examples, and documentation
- gohadoop - Native go clients for Apache Hadoop YARN.
- Hue - A Web interface for analyzing data with Apache Hadoop.
- Apache Zeppelin - A web-based notebook that enables interactive data analytics
- Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs.
- Apache Thrift
- Apache Avro - Apache Avro is a data serialization system.
- Elephant Bird - Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
- Spring for Apache Hadoop
- hdfs - A native go client for HDFS
- Oozie Eclipse Plugin - A graphical editor for editing Apache Oozie workflows inside Eclipse.
- snakebite
Realtime Data Processing
- Apache Storm
- Apache Samza
- Apache Spark
- Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.
Distributed Computing and Programming
- Apache Spark
- Spark Packages - A community index of packages for Apache Spark
- SparkHub - A community site for Apache Spark
- Apache Crunch
- Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
- Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
Apache Apex (incubating) - Enterprise-grade unified stream and batch processing engine.
Packaging, Provisioning and Monitoring
Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
- Apache Ambari - Apache Ambari
- Ganglia Monitoring System
- ankush - A big data cluster management tool that creates and manages clusters of different technologies.
- Apache Zookeeper - Apache Zookeeper
- Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
- Buildoop - Hadoop Ecosystem Builder
- Deploop - The Hadoop Deploy System
- Jumbune - An open source MapReduce profiling, MapReduce flow debugging, HDFS data quality validation and Hadoop cluster monitoring tool.
- inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.
Search
- ElasticSearch
- Apache Solr
- SenseiDB - Open-source, distributed, realtime, semi-structured database
- Banana - Kibana port for Apache Solr
Search Engine Framework
- Apache Nutch - Apache Nutch is a highly extensible and scalable open source web crawler software project.
Security
- Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
- Apache Sentry - An authorization module for Hadoop
- Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.
Benchmark
- Big Data Benchmark
- HiBench
- Big-Bench
- hive-benchmarks
- hive-testbench - Testbench for experimenting with Apache Hive at any data scale.
- YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.
Machine learning and Big Data analytics
- Apache Mahout
- Oryx 2 - Lambda architecture on Spark, Kafka for real-time large scale machine learning
- MLlib - MLlib is Apache Spark's scalable machine learning library.
- R - R is a free software environment for statistical computing and graphics.
- RHadoop including RHDFS, RHBase, RMR2, plyrmr
- RHive RHive, for launching Hive queries from R
- Apache Lens
- Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets
Misc.
- Hive Plugins
- UDF
- http://nexr.github.io/hive-udf/
- https://github.com/edwardcapriolo/hive_cassandra_udfs
- https://github.com/livingsocial/HiveSwarm
- https://github.com/ThinkBigAnalytics/Hive-Extensions-from-Think-Big-Analytics
- https://github.com/karthkk/udfs
- https://github.com/twitter/elephant-bird - Twitter
- https://github.com/lovelysystems/ls-hive
- https://github.com/stewi2/hive-udfs
- https://github.com/klout/brickhouse
- https://github.com/markgrover/hive-translate (PostgreSQL translate())
- https://github.com/deanwampler/HiveUDFs
- https://github.com/myui/hivemall (Machine Learning UDF/UDAF/UDTF)
- https://github.com/edwardcapriolo/hive-geoip (GeoIP UDF)
- https://github.com/Netflix/Surus
- Storage Handler
- https://github.com/dvasilen/Hive-Cassandra
- https://github.com/yc-huang/Hive-mongo
- https://github.com/balshor/gdata-storagehandler
- https://github.com/karthkk/hive-hbase-json
- https://github.com/sunsuk7tp/hive-hbase-integration
- https://bitbucket.org/rodrigopr/redisstoragehandler
- https://github.com/zhuguangbin/HiveJDBCStorageHanlder
- https://github.com/chimpler/hive-solr
- https://github.com/bfemiano/accumulo-hive-storage-manager
- SerDe
- Libraries and tools
- https://github.com/forward3d/rbhive
- https://github.com/synctree/activerecord-hive-adapter
- https://github.com/hrp/sequel-hive-adapter
- https://github.com/forward/node-hive
- https://github.com/recruitcojp/WebHive
- shib - WebUI for query engines: Hive and Presto
- clive - Clojure library for interacting with Hive via Thrift
- https://github.com/anjuke/hwi
- https://code.google.com/a/apache-extras.org/p/hipy/
- https://github.com/dmorel/Thrift-API-HiveClient2 (Perl - HiveServer2)
- PyHive - Python interface to Hive and Presto
- https://github.com/recruitcojp/OdbcHive
- Hive-Sharp
- HiveRunner - An Open Source unit test framework for hadoop hive queries based on JUnit4
- Beetest - A super simple utility for testing Apache Hive scripts locally for non-Java developers.
- Hive_test- Unit test framework for hive and hive-service
- UDF
- Flume Plugins
- Flume MongoDB Sink
- Flume HornetQ Channel
- Flume MessagePack Source
- Flume RabbitMQ source and sink
- Flume UDP Source
- Stratio Ingestion - Custom sinks: Cassandra, MongoDB, Stratio Streaming and JDBC
- Flume Custom Serializers
- Real-time analytics in Apache Flume
- .Net FlumeNG Clients
Resources
Various resources, such as books, websites and articles.
Websites
Useful websites and articles
- Hadoop Weekly
- The Hadoop Ecosystem Table
- Hadoop 1.x vs 2
- Apache Hadoop YARN: Yet Another Resource Negotiator
- Introducing Apache Hadoop YARN
- Apache Hadoop YARN - Background and an Overview
- Apache Hadoop YARN - Concepts and Applications
- Apache Hadoop YARN - ResourceManager
- Apache Hadoop YARN - NodeManager
- Migrating to MapReduce 2 on YARN (For Users)
- Migrating to MapReduce 2 on YARN (For Operators)
- Hadoop and Big Data: Use Cases at Salesforce.com
- All you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.
- What is Bigtop, and Why Should You Care?
- Hadoop - Distributions and Commercial Support
- Ganglia configuration for a small Hadoop cluster and some troubleshooting
- Hadoop illuminated - Open Source Hadoop Book
- NoSQL Database
- 10 Best Practices for Apache Hive
- Hadoop Operations at Scale
- AWS BigData Blog
- Hadoop360
- How to monitor Hadoop metrics
Presentations
- Hadoop Summit Presentations - Slide decks from Hadoop Summit
- Hadoop 24/7
- An example Apache Hadoop Yarn upgrade
- Apache Hadoop In Theory And Practice
- Hadoop Operations at LinkedIn
- Hadoop Performance at LinkedIn
- Docker based Hadoop provisioning
Books
- Hadoop: The Definitive Guide
- Hadoop Operations
- Apache Hadoop Yarn
- HBase: The Definitive Guide
- Programming Pig
- Programming Hive
- Hadoop in Practice, Second Edition
- Hadoop in Action, Second Edition
Hadoop and Big Data Events
Awesome Hadoop的更多相关文章
- Hadoop 中利用 mapreduce 读写 mysql 数据
Hadoop 中利用 mapreduce 读写 mysql 数据 有时候我们在项目中会遇到输入结果集很大,但是输出结果很小,比如一些 pv.uv 数据,然后为了实时查询的需求,或者一些 OLAP ...
- 初识Hadoop、Hive
2016.10.13 20:28 很久没有写随笔了,自打小宝出生后就没有写过新的文章.数次来到博客园,想开始新的学习历程,总是被各种琐事中断.一方面确实是最近的项目工作比较忙,各个集群频繁地上线加多版 ...
- hadoop 2.7.3本地环境运行官方wordcount-基于HDFS
接上篇<hadoop 2.7.3本地环境运行官方wordcount>.继续在本地模式下测试,本次使用hdfs. 2 本地模式使用fs计数wodcount 上面是直接使用的是linux的文件 ...
- hadoop 2.7.3本地环境运行官方wordcount
hadoop 2.7.3本地环境运行官方wordcount 基本环境: 系统:win7 虚机环境:virtualBox 虚机:centos 7 hadoop版本:2.7.3 本次先以独立模式(本地模式 ...
- 【Big Data】HADOOP集群的配置(一)
Hadoop集群的配置(一) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
- Hadoop学习之旅二:HDFS
本文基于Hadoop1.X 概述 分布式文件系统主要用来解决如下几个问题: 读写大文件 加速运算 对于某些体积巨大的文件,比如其大小超过了计算机文件系统所能存放的最大限制或者是其大小甚至超过了计算机整 ...
- 程序员必须要知道的Hadoop的一些事实
程序员必须要知道的Hadoop的一些事实.现如今,Apache Hadoop已经无人不知无人不晓.当年雅虎搜索工程师Doug Cutting开发出这个用以创建分布式计算机环境的开源软...... 1: ...
- Hadoop 2.x 生态系统及技术架构图
一.负责收集数据的工具:Sqoop(关系型数据导入Hadoop)Flume(日志数据导入Hadoop,支持数据源广泛)Kafka(支持数据源有限,但吞吐大) 二.负责存储数据的工具:HBaseMong ...
- Hadoop的安装与设置(1)
在Ubuntu下安装与设置Hadoop的主要过程. 1. 创建Hadoop用户 创建一个用户,用户名为hadoop,在home下创建该用户的主目录,就不详细介绍了. 2. 安装Java环境 下载Lin ...
- 基于Ubuntu Hadoop的群集搭建Hive
Hive是Hadoop生态中的一个重要组成部分,主要用于数据仓库.前面的文章中我们已经搭建好了Hadoop的群集,下面我们在这个群集上再搭建Hive的群集. 1.安装MySQL 1.1安装MySQL ...
随机推荐
- java 集合框架(TreeSet操作,自动对数据进行排序,重写CompareTo方法)
/*TreeSet * treeSet存入数据后自动调用元素的compareTo(Object obj) 方法,自动对数据进行排序 * 所以输出的数据是经过排序的数据 * 注:compareTo方法返 ...
- xmlplus 组件设计系列之一 - 图标
网页上使用的图标分可为三种:文件图标.字体图标和 SVG 图标.对于文件图标,下面仅以 PNG 格式来说明. PNG 图标 对于 PNG 图标的引用,有两种方式.一种是直接由 HTML 元素 img ...
- javascript中的==与===
一.主要区别: 1.通俗的来说,==是值的比较,而===不仅仅比较值,还比较引用的是否是同一个对象. 2.用==来比较的时候,如果两个数的操作数的类型不一样,会先转换.而===的操作数则不会进行任何转 ...
- 【Objective-C 基础】3.类
1.类 OC中类分为两个文件: .h类的声明文件,用于声明变量.函数. .m类的实现文件,用于实现.h中的函数 类的声明使用关键字@interface @end 类的实现使用关键字@implemen ...
- Apple本地认证(密码+Touch id)
转载请注明原文链接:http://www.cnblogs.com/zhanggui/p/6839554.html 前言 本片博客主要介绍如何在自己的APP中添加指纹解锁/密码解锁技术.主要是对苹果的L ...
- PHP. 01. C/S架构、B/S架构、服务器类型、服务器软件、HTTP协议/服务器、数据库、服务器web开发、PHP简介/常见语法、PHPheader()、 PHP_POST/GET数据获取和错误处理
C/S架构 Client/Server 指客户端,服务器 架构的意思 优点:性能性高:可将一部分的计算工作放在客户端上,服务器只需处理出局即可 洁面炫酷,可使用更多系统提供的效果 缺点:更新软件需 ...
- Java 多线程详解(一)------概念的引入
这是讲解 Java 多线程的第一章,我们在进入讲解之前,需要对以下几个概念有所了解. 1.并发和并行 并行:指两个或多个时间在同一时刻发生(同时发生): 并发:指两个或多个事件在一个时间段内发生. 在 ...
- Java类加载和卸载的跟踪
博客搬家自https://my.oschina.net/itsyizu/blog/ 什么是类的加载和卸载 Java程序的运行离不开类的加载,为了更好地理解程序的执行,有时候需要知道系统加载了哪些类.一 ...
- 弹性盒布局display:flex详解
一:弹性盒子 随着响应式设计的流行,网站开发者在设计网页布局时往往要考虑到页面在适配不同分辨率的浏览器时其内部组件位置大小都会产生变化,因此需要设计者根据窗口尺寸来调整布局,从而改变组件的尺寸和位置, ...
- 利用HTTP-only Cookie缓解XSS之痛
在Web安全领域,跨站脚本攻击时最为常见的一种攻击形式,也是长久以来的一个老大难问题,而本文将向读者介绍的是一种用以缓解这种压力的技术,即HTTP-only cookie. 我们首先对HTTP-onl ...