看了一点《数据算法:Hadoop/Spark大数据处理技巧》,觉得有必要了解一下 Spark 。

以上。

Spark was introduced by Apache Software Foundation for speeding up the Hadoop computational computing software process.

As against a common belief, Spark is not a modified version of Hadoop and is not, really, dependent on Hadoop because it has its own cluster management. Hadoop is just one of the ways to implement Spark.

Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only.

Hadoop 是对大数据集进行分布式计算的标准工具,允许使用相对便宜的商业硬件集群进行超级计算机级别的计算。不过,Hadoop 也存在很多已知限制。比如说 MapReduce 作业的 I/O 成本很高,导致交互分析和迭代算法(iterative algorithms)开销很大。但是事实上,几乎所有的最优化和机器学习都是迭代的。为了解决这些问题,Hadoop 一直在向一种更为通用的资源管理框架转变,即 YARN(Yet Another Resource Negotiator, 又一个资源协调者)。而 Spark 是第一个脱胎于该转变的快速、通用分布式计算范式,拥有自己的集群管理方式,并不是 Hadoop 的一个新版本。Hadoop 只是 Spark 实现方式的一种,而且只是用来存储数据。

Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.

Spark 是个轻量快速的集群计算技术,设计成能进行快速的计算。使用函数式编程范式扩展了 MapReduce 模型以支持更多计算类型,可以涵盖广泛的工作流,这些工作流之前被实现为 Hadoop 之上的特殊系统。Spark 使用内存缓存来提升性能,因此进行交互式分析也足够快速。缓存同时提升了迭代算法的性能,这使得 Spark 非常适合数据理论任务,特别是机器学习。

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014.

Spark 由 UC(University of California) Berkeley AMP lab (加州大学伯克利分校的AMP实验室) 开发。

Features of Apache Spark

Apache Spark has following features.

  • Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory.
  • Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.
  • Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop components.

There are three ways of Spark deployment as explained below.

  • Standalone − Spark Standalone deployment means Spark occupies the place on top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster.
  • Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack.
  • Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access.

Components of Spark

The following illustration depicts the different components of Spark.

Apache Spark Core

Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems.

包含Spark的基本功能;尤其是定义RDD的API、操作以及这两者上的动作。

Spark SQL

Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.

Spark Streaming

Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.

MLlib (Machine Learning Library)

MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).

一个常用机器学习算法库,算法被实现为对 RDD 的 Spark 操作。

GraphX

GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API. It also provides an optimized runtime for this abstraction.

Spark 库本身包含很多应用元素,这些元素可以用到大部分大数据应用中,其中包括对大数据进行类似 SQL 查询的支持,机器学习和图算法,甚至对实时流数据的支持。

参考资料

Apache Spark : Introduction的更多相关文章

  1. CS100.1x Introduction to Big Data with Apache Spark

    CS100.1x简介 这门课主要讲数据科学,也就是data science以及怎么用Apache Spark去分析大数据. Course Software Setup 这门课主要介绍如何编写和调试Py ...

  2. Apache Spark源码走读之5 -- DStream处理的容错性分析

    欢迎转载,转载请注明出处,徽沪一郎,谢谢. 在流数据的处理过程中,为了保证处理结果的可信度(不能多算,也不能漏算),需要做到对所有的输入数据有且仅有一次处理.在Spark Streaming的处理机制 ...

  3. Spark(1) - Getting Started with Apache Spark

    Introduction Apache Spark is a general-purpose cluster computing system to process big data workload ...

  4. Introducing Apache Spark Datasets(中英双语)

    文章标题 Introducing Apache Spark Datasets 作者介绍 Michael Armbrust, Wenchen Fan, Reynold Xin and Matei Zah ...

  5. .net 开发者尝试Apache Spark™

    本文编译自一篇msdn magazine的文章,原文标题和链接为: Test Run - Introduction to Spark for .NET Developers https://msdn. ...

  6. Apache Spark大数据分析入门(一)

    摘要:Apache Spark的出现让普通人也具备了大数据及实时数据分析能力.鉴于此,本文通过动手实战操作演示带领大家快速地入门学习Spark.本文是Apache Spark入门系列教程(共四部分)的 ...

  7. Apache Spark简单介绍、安装及使用

    Apache Spark简介 Apache Spark是一个高速的通用型计算引擎,用来实现分布式的大规模数据的处理任务. 分布式的处理方式可以使以前单台计算机面对大规模数据时处理不了的情况成为可能. ...

  8. 关于Apache Spark

    Apache Spark :  https://www.oschina.net/p/spark-project

  9. Apache Spark源码剖析

    Apache Spark源码剖析(全面系统介绍Spark源码,提供分析源码的实用技巧和合理的阅读顺序,充分了解Spark的设计思想和运行机理) 许鹏 著   ISBN 978-7-121-25420- ...

随机推荐

  1. 数据库SQLITE3初识

    数据库DataBase,我们都没有接触过数据库,那么数据库是什么? 它是一个有结构的.集成的.可共享的统一管理的数据集合! 所谓有结构的,指的是数据是按一定的模型组织起来的. 简单的说,拿个箱子,用隔 ...

  2. 汇编语言、与C语言、实现--汉诺塔--

    题意描述:   用汇编语言实现汉诺塔.只需要显示移盘次序,不必显示所移盘的大小,例如: X>Z,X>Y,Z>Y,X>Z,..... (n阶Hanoi塔问题)假设有三个分别命名为 ...

  3. iOS开源项目周报0209

    由OpenDigg 出品的iOS开源项目周报第七期来啦.我们的iOS开源周报集合了OpenDigg一周来新收录的优质的iOS开源项目,方便iOS开发人员便捷的找到自己需要的项目工具等.Hedwig 向 ...

  4. python递归——汉诺塔

    汉诺塔的传说 法国数学家爱德华·卢卡斯曾编写过一个印度的古老传说:在世界中心贝拿勒斯(在印度北部)的圣庙里,一块黄铜板上插着三根宝石针.印度教的主神梵天在创造世界的时候,在其中一根针上从下到上地穿好了 ...

  5. 文件夹操作之判断是否存在(Directory)

    Directory类用于操作文件夹,用于创建.移动和枚举目录和子目录的静态方法.DirectoryInfo类用于典型操作,如复制,移动,重命名,创建和删除目录.他们都可用于获取和设置相关属性或有关创建 ...

  6. Java - 方法的参数声明

    给方法的参数加上限制是很常见的,比如参数代表索引时不能为负数.对于某个关键对象引用不能为null,否则会进行一些处理,比如抛出相应的异常信息. 对于这些参数限制,方法的提供者必须在文档中注明,并且在方 ...

  7. 四、spark集群架构

    spark集群架构官方文档:http://spark.apache.org/docs/latest/cluster-overview.html 集群架构 我们先看这张图 这张图把spark架构拆分成了 ...

  8. TCP/IP Socket通信demo

    一个实例通过client端和server端通讯 客户端发送:“我是客户端,请多关照” 服务端回复:“收到来自于"+s.getInetAddress().getHostName()+" ...

  9. 【SSH网上商城项目实战07】Struts2和Json的整合

    转自:https://blog.csdn.net/eson_15/article/details/51332758 上一节我们完成了DataGrid显示jason数据,但是没有和后台联系在一起,只是单 ...

  10. 工作经验:mybatis 处理 oracle Long 类型

    前言:mybatis 接收 oracle 中 LONG 类型的,报错:无效的列类型: getCLOB not implemented for class oracle.jdbc.driver.T4CL ...