读learning spark lighting chapter1~chapter2
chapter 1 introduction to the analysis with spark
the conponents of Sparks
spark core(contains the basic functionality of sparks. spark Core is also the home to the APIs that defines the RDDs),
spark sql(structured data ) is the package for working with the structured data.it allow query data via SQL as well as Apache hive , and it support many sources of data ,including Hive tables ,Parquet And jason.also allow developers to intermix SQL queries with the programatic data manipulation supported By the RDDs in Python ,java And Scala .
spark streaming(real-time),enables processing the live of streaming of data.
MLib(machine learning)
GraphX(graph processing )is a library for manipulating the graph .
A Brief History of Spark
spark is a open source project that has beed And is maintained By a thriving And diverse community of developer .
chapter 2 downloading spark And getting started
walking through the process of downloding And running the sprak on local mode on single computer .
you don't needmaster Scala,java orPython.Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM). To run Spark
on either your laptop or a cluster, all you need is an installation of Java 6 or newer. If you wish to use the Python API you will also need a Python interpreter (version 2.6 or newer).Spark does not yet work with Python 3.
downloading spark,select the "pre-build for Hadoop 2.4 And later".
tips:
widows user May run into issues installing .you can use the ziptool untar the .tar file Note :instatll spark in a directionalry with no space (e.g. c:\spark).
after you untar you will get a new directionaru with the same name but without the final .tar suffix .
damn it:
Most of this book includes code in all of Spark’s languages, but interactive shells are
available only in Python and Scala. Because a shell is very useful for learning the API, we recommend using one of these languages for these examples even if you are a Java
developer. The API is similar in every language.
change the directionaty to the spark,type bin\pyspark,you will see the logo.
Introduction to Core spark concepts
Driver program
|----your application
|----distributed datasets that you defined
usually weapply many operations on thedatasets.
***in the preceding example ,the Driver program was the spark she'll intself ,And you can type in the operation that you wanted.
***Driver program access the spark through a SparkContext object,which representing the connection to a computing cluster. what's more the sparkcontext is automatically created for you called as sc,in the pyspark ,you can print the infomation about this Object By typing "sc".well i think you will know the SparkContext have 3 kinds in java ,Python And Scala respectively.
SparkContext have many operations ,such as count(),first() and so on. Driver program typically manage a number of nodes called executors. when you call any operation on a cluster different machines might count in different ranges of the file.beacuse we run the spark shell locally,it execute all works on a single machine.
Passing Funtions to Spark
look at the following example in python:
lines = sc.textFile(""README.md);
pythonLines = lines.filter(lambda line : "Python" in line);
pythoLines.first();
if you are unfamilat with the lambda sytax. it's a shorthand way to define function inline in Python or Scala, then pass the function's name to the Spark. you can do like this :
def hasPython(line): # this function judge that wheter every line contain the "Python" in a file
return "Python" in line.
pythonLines = lines.filter(hasPython)
of course you can write in java.but they are defined as classes, implementing interface Funtion:
JavaRDD<String> pythonLines = filter(new Funtion<String, Bollean>()
{
Boolean call(String line)
{
return lines.contains(line);
}
});
nowadays java8 have supported lambda.
Spark qutomatically takes your functions (e.g. lines.contains("Python"))and ships it to executors nodes. Thus, you can write code in a single driver program and automatically have parts of it run on mutiple nodes.
Standalone Applications
Apart from running Spark interactively, Spark can be linked to standalone applications in either Java, Python or Scala. The main difference from using it in the shell is that you need to initialize your SparkContext.After that ,the functions is same.Remember , if you using it in the shell, the SparkContext is created automatically for you called "sc", you can use it direactly.
The proces linked to Spark varies from languages. In Java and Scala , you give your aplliation Maven dependency on the spark-core artifact.Maven is a popular package managerment tool for java-based languages let you link to libaries in public repositories.you can use Maven itself build your projet , or use other tools that can talk to the Maven repositories, including Scala's sbt od Gradle. Popular IDE like Eclipse also allow you to directly add a Maven dependency to a project.
In Python,you simply write application as Python scripts, but you must run them using the bin\spark-submit script included in Spark. The spark-submit script include the dependency for us in Python,what's more it sets up the enrivonment for Spark's Python API to function. Simply run your scripts like this:
bin\spak-submit my_script.py
(Note that you will have to use backslashes instead of forward slashes on Window)s
读learning spark lighting chapter1~chapter2的更多相关文章
- 【原】Learning Spark (Python版) 学习笔记(一)----RDD 基本概念与命令
<Learning Spark>这本书算是Spark入门的必读书了,中文版是<Spark快速大数据分析>,不过豆瓣书评很有意思的是,英文原版评分7.4,评论都说入门而已深入不足 ...
- 【原】Learning Spark (Python版) 学习笔记(四)----Spark Sreaming与MLlib机器学习
本来这篇是准备5.15更的,但是上周一直在忙签证和工作的事,没时间就推迟了,现在终于有时间来写写Learning Spark最后一部分内容了. 第10-11 章主要讲的是Spark Streaming ...
- 【原】Learning Spark (Python版) 学习笔记(三)----工作原理、调优与Spark SQL
周末的任务是更新Learning Spark系列第三篇,以为自己写不完了,但为了改正拖延症,还是得完成给自己定的任务啊 = =.这三章主要讲Spark的运行过程(本地+集群),性能调优以及Spark ...
- Learning Spark: Lightning-Fast Big Data Analysis 中文翻译
Learning Spark: Lightning-Fast Big Data Analysis 中文翻译行为纯属个人对于Spark的兴趣,仅供学习. 如果我的翻译行为侵犯您的版权,请您告知,我将停止 ...
- Learning Spark (Python版) 学习笔记(一)----RDD 基本概念与命令
<Learning Spark>这本书算是Spark入门的必读书了,中文版是<Spark快速大数据分析>,不过豆瓣书评很有意思的是,英文原版评分7.4,评论都说入门而已深入不足 ...
- 【原】Learning Spark (Python版) 学习笔记(二)----键值对、数据读取与保存、共享特性
本来应该上周更新的,结果碰上五一,懒癌发作,就推迟了 = =.以后还是要按时完成任务.废话不多说,第四章-第六章主要讲了三个内容:键值对.数据读取与保存与Spark的两个共享特性(累加器和广播变量). ...
- Learning Spark中文版--第三章--RDD编程(1)
本章介绍了Spark用于数据处理的核心抽象概念,具有弹性的分布式数据集(RDD).一个RDD仅仅是一个分布式的元素集合.在Spark中,所有工作都表示为创建新的RDDs.转换现有的RDD,或者调 ...
- Learning Spark 第四章——键值对处理
本章主要介绍Spark如何处理键值对.K-V RDDs通常用于聚集操作,使用相同的key聚集或者对不同的RDD进行聚集.部分情况下,需要将spark中的数据记录转换为键值对然后进行聚集处理.我们也会对 ...
- 线性回归的Spark实现 [Linear Regression / Machine Learning / Spark]
1- 问题提出 2- 线性回归 3- 理论推导 4- Python/Spark实现 # -*- coding: utf-8 -*- from pyspark import SparkContext t ...
随机推荐
- SQL SERVER 判断是否存在并删除某个数据库、表、视图、触发器、储存过程、函数
-- SQL SERVER 判断是否存在某个触发器.储存过程 -- 判断储存过程,如果存在则删除IF (EXISTS(SELECT * FROM sysobjects WHERE name='proc ...
- swift 运算符快速学习(建议懂OC或者C语言的伙伴学习参考)
昨晚看了swift 的运算符的知识点,先大概说一下,这个点和 c 或者oc 的算运符知识点一样,都是最基础最基础的.其他的最基本的加减乘除就不多说了.注意的有几点点..先说求余数运算: 一 :求余数运 ...
- php解析
vim /usr/local/apache/conf/httpd.conf ##修改apache的网页配置文件 → 解析php文件 /usr/local/apache/bin/apache ...
- php单例模式与工厂模式
单例模式:单例模式又称为职责模式,它用来在程序中创建一个单一功能的访问点,通俗地说就是实例化出来的对象是唯一的. 所有的单例模式至少拥有以下三种公共元素:1. 它们必须拥有一个构造函数,并且必须被标记 ...
- enote笔记语言(3)(ver0.2)
what&why(why not)&how&when&where&which:紫色,象征着神秘而又潜蕴着强大的力量,故取紫色. key&keyword: ...
- PhotoshopCC 如何使用动作文件ATN
非常感谢公司的前端同事,今早给我推荐了一个很好用的插件 atn ,下面简单的总结下 导入 atn 插件的方法: 打开 photoshop 或者 photoshopCC 软件→点击 窗口菜单→找到 动作 ...
- 安装和配置Symfony
为了简化创建新项目的过程,Symfony提供一个安装程序. 安装Symfony Installer 使用Symfony Installer是创建新的Symfony项目的唯一推荐方式,这个install ...
- 以setTimeout来聊聊Event Loop
平时的工作中,也许你会经常用到setTimeout这个方法,可是你真的了解setTimeout吗?本文想通过总结setTimeout的用法,顺便来探索javascript里面的事件执行机制. setT ...
- Javascript正则表达式(上)
正则表达式一般用于验证客户端的用户输入,而服务器端的PHP.ASP.NET等脚本无须再进行验证,节约了后台开销. 1.两种创建方法 var box=new RegExp("Box" ...
- 基于 socket.io, 简单实现多平台类似你猜我画 socket 数据传输
一.前言 socket.io 实现了实时双向的基于事件的通讯机制,是基于 webSocket 的封装,但它不仅仅包括 webSocket,还对轮询(Polling)机制以及其它的实时通信方式封装成了通 ...