chapter 1 introduction to the analysis with spark

the conponents of Sparks

  spark core(contains the basic  functionality of sparks. spark Core  is also the  home to the APIs that defines the RDDs),

  spark sql(structured data ) is the package  for working with the structured data.it allow query  data via SQL as well as Apache hive , and it support many sources of data ,including Hive tables ,Parquet And jason.also allow developers to intermix SQL queries with the programatic data manipulation supported By the RDDs in Python ,java And Scala .

  spark streaming(real-time),enables processing the live of streaming of data.

  MLib(machine learning)

  GraphX(graph processing )is a library for manipulating the graph .

A Brief History of Spark

  spark is a open source project that has beed And is maintained By a thriving And diverse community  of developer .

chapter 2 downloading spark And getting started

  walking through the process of downloding And running the sprak on local mode on single computer .

  you don't needmaster Scala,java orPython.Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM). To run Spark
on either your laptop or a cluster, all you need is an installation of Java 6 or newer. If you wish to use the Python API you will also need a Python interpreter (version 2.6 or newer).Spark does not yet work with Python 3.

downloading spark,select the "pre-build for Hadoop 2.4 And later".

tips:

widows user May run into issues installing .you can use the ziptool untar the .tar file Note :instatll spark in a directionalry with no space (e.g. c:\spark).

after you untar you will get a new directionaru with the same name but without the final .tar suffix .

damn it:

Most of this book includes code in all of Spark’s languages, but interactive shells are
available only in Python and Scala. Because a shell is very useful for learning the API, we recommend using one of these languages for these examples even if you are a Java
developer. The API is similar in every language.

change the directionaty to the spark,type bin\pyspark,you will see the logo.

Introduction  to Core spark concepts

Driver program

  |----your application

  |----distributed datasets that you defined

  usually weapply many operations on thedatasets.

  ***in the preceding example ,the Driver program was the spark she'll intself ,And you can type in the operation that you wanted.

  ***Driver program access the spark through a SparkContext object,which representing the  connection  to  a  computing cluster. what's  more  the  sparkcontext is automatically created for you called as  sc,in the  pyspark ,you can print  the infomation  about this  Object  By  typing "sc".well i think you will  know  the SparkContext have 3 kinds in java ,Python And Scala respectively.

SparkContext  have  many operations ,such as count(),first() and so on. Driver program typically manage a number of nodes called executors. when you call any operation on a cluster different machines might count in different ranges of the file.beacuse we run the spark shell locally,it execute all works on a single machine.

Passing Funtions to Spark

  look at the following example in python:

 lines = sc.textFile(""README.md);
pythonLines = lines.filter(lambda line : "Python" in line);
pythoLines.first();

if you are unfamilat with the lambda sytax. it's a shorthand way to define function inline in Python or Scala, then pass the function's name to the Spark. you can do like this :

def hasPython(line): # this function judge that wheter every line contain the "Python" in a file

    return "Python" in line.

pythonLines = lines.filter(hasPython)

of course you can write in java.but they are defined as classes, implementing interface Funtion:

JavaRDD<String> pythonLines = filter(new Funtion<String, Bollean>()
{
Boolean call(String line)
{
return lines.contains(line);
}
});

nowadays java8 have supported lambda.

Spark qutomatically takes your functions (e.g. lines.contains("Python"))and ships it to executors nodes. Thus, you can write code in a single driver program and automatically have parts of it run on mutiple nodes.

 Standalone Applications

Apart from running Spark interactively, Spark can be linked to standalone applications in either Java, Python or Scala. The main difference from using it in the shell is that you need to initialize your SparkContext.After that ,the functions is same.Remember , if you using it in the shell, the SparkContext is created automatically for you called "sc", you can use it direactly.

The proces linked to Spark varies from languages. In Java and Scala , you give your aplliation Maven dependency on the spark-core artifact.Maven is a popular package managerment tool for java-based languages let you link to libaries in public repositories.you can use Maven itself build your projet , or use other tools that can talk to the Maven repositories, including Scala's sbt od Gradle. Popular IDE like Eclipse also allow you to directly add a Maven dependency to a  project.

In Python,you simply write application as Python scripts, but you must run them using the bin\spark-submit script included in Spark. The spark-submit script include the dependency for us in Python,what's more it sets up the enrivonment for Spark's Python API to function. Simply run your scripts like this:

bin\spak-submit my_script.py

(Note that you will have to use backslashes instead of forward slashes on Window)s

读learning spark lighting chapter1~chapter2的更多相关文章

  1. 【原】Learning Spark (Python版) 学习笔记(一)----RDD 基本概念与命令

    <Learning Spark>这本书算是Spark入门的必读书了,中文版是<Spark快速大数据分析>,不过豆瓣书评很有意思的是,英文原版评分7.4,评论都说入门而已深入不足 ...

  2. 【原】Learning Spark (Python版) 学习笔记(四)----Spark Sreaming与MLlib机器学习

    本来这篇是准备5.15更的,但是上周一直在忙签证和工作的事,没时间就推迟了,现在终于有时间来写写Learning Spark最后一部分内容了. 第10-11 章主要讲的是Spark Streaming ...

  3. 【原】Learning Spark (Python版) 学习笔记(三)----工作原理、调优与Spark SQL

    周末的任务是更新Learning Spark系列第三篇,以为自己写不完了,但为了改正拖延症,还是得完成给自己定的任务啊 = =.这三章主要讲Spark的运行过程(本地+集群),性能调优以及Spark ...

  4. Learning Spark: Lightning-Fast Big Data Analysis 中文翻译

    Learning Spark: Lightning-Fast Big Data Analysis 中文翻译行为纯属个人对于Spark的兴趣,仅供学习. 如果我的翻译行为侵犯您的版权,请您告知,我将停止 ...

  5. Learning Spark (Python版) 学习笔记(一)----RDD 基本概念与命令

    <Learning Spark>这本书算是Spark入门的必读书了,中文版是<Spark快速大数据分析>,不过豆瓣书评很有意思的是,英文原版评分7.4,评论都说入门而已深入不足 ...

  6. 【原】Learning Spark (Python版) 学习笔记(二)----键值对、数据读取与保存、共享特性

    本来应该上周更新的,结果碰上五一,懒癌发作,就推迟了 = =.以后还是要按时完成任务.废话不多说,第四章-第六章主要讲了三个内容:键值对.数据读取与保存与Spark的两个共享特性(累加器和广播变量). ...

  7. Learning Spark中文版--第三章--RDD编程(1)

       本章介绍了Spark用于数据处理的核心抽象概念,具有弹性的分布式数据集(RDD).一个RDD仅仅是一个分布式的元素集合.在Spark中,所有工作都表示为创建新的RDDs.转换现有的RDD,或者调 ...

  8. Learning Spark 第四章——键值对处理

    本章主要介绍Spark如何处理键值对.K-V RDDs通常用于聚集操作,使用相同的key聚集或者对不同的RDD进行聚集.部分情况下,需要将spark中的数据记录转换为键值对然后进行聚集处理.我们也会对 ...

  9. 线性回归的Spark实现 [Linear Regression / Machine Learning / Spark]

    1- 问题提出 2- 线性回归 3- 理论推导 4- Python/Spark实现 # -*- coding: utf-8 -*- from pyspark import SparkContext t ...

随机推荐

  1. KB奇遇记(7):不靠谱的项目实施计划

    在ERP项目启动前期,项目组两方项目经理和我等几个人单独跟总裁开会,讨论了初步的ERP实施计划,本来第一期上线只是考虑上其中一家工厂而已,结果临时加入了深加工的工厂.本来项目组预定计划是2017年1月 ...

  2. 解决IIE8不支持媒体查询的方法

    最近在解决UI问题时碰到以下浏览器不兼容性问题(本人属于UI业余操作者,很多想法就很业余了): 问题:IE8及其以下低版本IE浏览器在缩小窗口时,UI没有按照相应的要求显示窗口缩小时对应的布局:其他浏 ...

  3. 使用ViewPagerAdapter 页面引导适配器设置app启动页,引导页面的实现

    一般的app第一次安装启动的时候,都会有一个启动页面和引导页的画面,然后才进入主程序.anndroid中的ViewPagerAdapter 是一个继承与PageAdapter的 页面引导适配器.由于我 ...

  4. SQL Server-聚焦深入理解动态SQL查询(三十二)

    前言 之前有园友一直关注着我快点出SQL Server性能优化系列,博主我也对性能优化系列也有点小期待,本来打算利用周末写死锁以及避免死锁系列的接着进入SQL Server优化系列,但是在工作中长时间 ...

  5. C# 6 与 .NET Core 1.0 高级编程 - 40 ASP.NET Core(上)

    译文,个人原创,转载请注明出处(C# 6 与 .NET Core 1.0 高级编程 - 40 章  ASP.NET Core(上)),不对的地方欢迎指出与交流. 章节出自<Professiona ...

  6. iphone在iframe页面的宽度不受父页面影响,避免撑开页面

    工作中有个需求,就是产品页面通过iframe引用显示产品协议页,要求不要横向滑动,只需要竖向滑动,但在iphone中引用的iframe会撑开父页的宽度,而在android端浏览器这不会. <di ...

  7. 1.1XAF框架开发视频教程-简单的订单管理实现过程,视频,提纲,及教程源码

    下面是视频教程的提纲: PPT版本的提纲下载 本节源码下载 XAF框架开发教程 快速实现企业级信息系统开发的利器 XAF简介 ´  开发公司:www.devexpress.com,老牌控件公司 ´  ...

  8. STM32中断优先级理解

    STM32优先级理解 学习并使用STM32已经有一段时间了,记得先前一直不太理解STM32优先级中怎么设定抢占优先级和响应优先级,后来也是看了以为网友的博客才明白了STM32的优先级的设定到底是这么回 ...

  9. 读书笔记 effective c++ Item 22 将数据成员声明成private

    我们首先看一下为什么数据成员不应该是public的,然后我们将会看到应用在public数据成员上的论证同样适用于protected成员.最后够得出结论:数据成员应该是private的. 1. 为什么数 ...

  10. Dijkstra算法——单源最短路径问题

    学习一个点到其余各个顶点的最短路径--单源最短路径 Dijkstra算法是由荷兰计算机科学家狄克斯特拉于1959 年提出的,因此又叫狄克斯特拉算法.是从一个顶点到其余各顶点的最短路径算法,解决的是有向 ...