Spark记录-官网学习配置篇(二)
### Spark SQL Running the SET -v command will show the entire list of the SQL configuration.
#scala
// spark is an existing SparkSession
spark.sql("SET -v").show(numRows = 200, truncate = false)
#java
// spark is an existing SparkSession
spark.sql("SET -v").show(200, false);
#python
# spark is an existing SparkSession
spark.sql("SET -v").show(n=200, truncate=False);
#R
sparkR.session()
properties <- sql("SET -v")
showDF(properties, numRows = 200, truncate = FALSE)
### Spark Streaming
| Property Name | Default | Meaning |
|---|---|---|
spark.streaming.backpressure.enabled |
false | Enables or disables Spark Streaming's internal backpressure mechanism (since 1.5). This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values spark.streaming.receiver.maxRateand spark.streaming.kafka.maxRatePerPartition if they are set (see below). |
spark.streaming.backpressure.initialRate |
not set | This is the initial maximum receiving rate at which each receiver will receive data for the first batch when the backpressure mechanism is enabled. |
spark.streaming.blockInterval |
200ms | Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark. Minimum recommended - 50 ms. See the performance tuningsection in the Spark Streaming programing guide for more details. |
spark.streaming.receiver.maxRate |
not set | Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate. See the deployment guide in the Spark Streaming programing guide for mode details. |
spark.streaming.receiver.writeAheadLog.enable |
false | Enable write ahead logs for receivers. All the input data received through receivers will be saved to write ahead logs that will allow it to be recovered after driver failures. See the deployment guide in the Spark Streaming programing guide for more details. |
spark.streaming.unpersist |
true | Force RDDs generated and persisted by Spark Streaming to be automatically unpersisted from Spark's memory. The raw input data received by Spark Streaming is also automatically cleared. Setting this to false will allow the raw data and persisted RDDs to be accessible outside the streaming application as they will not be cleared automatically. But it comes at the cost of higher memory usage in Spark. |
spark.streaming.stopGracefullyOnShutdown |
false | If true, Spark shuts down the StreamingContext gracefully on JVM shutdown rather than immediately. |
spark.streaming.kafka.maxRatePerPartition |
not set | Maximum rate (number of records per second) at which data will be read from each Kafka partition when using the new Kafka direct stream API. See the Kafka Integration guide for more details. |
spark.streaming.kafka.maxRetries |
1 | Maximum number of consecutive retries the driver will make in order to find the latest offsets on the leader of each partition (a default value of 1 means that the driver will make a maximum of 2 attempts). Only applies to the new Kafka direct stream API. |
spark.streaming.ui.retainedBatches |
1000 | How many batches the Spark Streaming UI and status APIs remember before garbage collecting. |
spark.streaming.driver.writeAheadLog.closeFileAfterWrite |
false | Whether to close the file after writing a write ahead log record on the driver. Set this to 'true' when you want to use S3 (or any file system that does not support flushing) for the metadata WAL on the driver. |
spark.streaming.receiver.writeAheadLog.closeFileAfterWrite |
false | Whether to close the file after writing a write ahead log record on the receivers. Set this to 'true' when you want to use S3 (or any file system that does not support flushing) for the data WAL on the receivers. |
### SparkR
| Property Name | Default | Meaning |
|---|---|---|
spark.r.numRBackendThreads |
2 | Number of threads used by RBackend to handle RPC calls from SparkR package. |
spark.r.command |
Rscript | Executable for executing R scripts in cluster modes for both driver and workers. |
spark.r.driver.command |
spark.r.command | Executable for executing R scripts in client modes for driver. Ignored in cluster modes. |
spark.r.shell.command |
R | Executable for executing sparkR shell in client modes for driver. Ignored in cluster modes. It is the same as environment variable SPARKR_DRIVER_R, but take precedence over it. spark.r.shell.command is used for sparkR shell while spark.r.driver.command is used for running R script. |
spark.r.backendConnectionTimeout |
6000 | Connection timeout set by R process on its connection to RBackend in seconds. |
spark.r.heartBeatInterval |
100 | Interval for heartbeats sent from SparkR backend to R process to prevent connection timeout. |
### GraphX
| Property Name | Default | Meaning |
|---|---|---|
spark.graphx.pregel.checkpointInterval |
-1 | Checkpoint interval for graph and message in Pregel. It used to avoid stackOverflowError due to long lineage chains after lots of iterations. The checkpoint is disabled by default. |
### Deploy
| Property Name | Default | Meaning |
|---|---|---|
spark.deploy.recoveryMode |
NONE | The recovery mode setting to recover submitted Spark jobs with cluster mode when it failed and relaunches. This is only applicable for cluster mode when running with Standalone or Mesos. |
spark.deploy.zookeeper.url |
None | When `spark.deploy.recoveryMode` is set to ZOOKEEPER, this configuration is used to set the zookeeper URL to connect to. |
spark.deploy.zookeeper.dir |
None | When `spark.deploy.recoveryMode` is set to ZOOKEEPER, this configuration is used to set the zookeeper directory to store recovery state. |
### Cluster Managers Each cluster manager in Spark has additional configuration options. Configurations can be found on the pages for each mode: #### [YARN](running-on-yarn.html#configuration) #### [Mesos](running-on-mesos.html#configuration) #### [Standalone Mode](spark-standalone.html#cluster-launch-scripts) # Environment Variables Certain Spark settings can be configured through environment variables, which are read from the `conf/spark-env.sh` script in the directory where Spark is installed (or `conf/spark-env.cmd` on Windows). In Standalone and Mesos modes, this file can give machine specific information such as hostnames. It is also sourced when running local Spark applications or submission scripts. Note that `conf/spark-env.sh` does not exist by default when Spark is installed. However, you can copy `conf/spark-env.sh.template` to create it. Make sure you make the copy executable. The following variables can be set in `spark-env.sh`:
| Environment Variable | Meaning |
|---|---|
JAVA_HOME |
Location where Java is installed (if it's not on your default PATH). |
PYSPARK_PYTHON |
Python binary executable to use for PySpark in both driver and workers (default is python2.7 if available, otherwise python). Property spark.pyspark.python take precedence if it is set |
PYSPARK_DRIVER_PYTHON |
Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON). Property spark.pyspark.driver.python take precedence if it is set |
SPARKR_DRIVER_R |
R binary executable to use for SparkR shell (default is R). Property spark.r.shell.command take precedence if it is set |
SPARK_LOCAL_IP |
IP address of the machine to bind to. |
SPARK_PUBLIC_DNS |
Hostname your Spark program will advertise to other machines. |
除上述之外,还可以选择设置Spark [独立群集脚本](spark-standalone.html#cluster-launch-scripts),例如每台机器上使用的内核数量和最大内存。由于`spark-env.sh`是一个shell脚本,其中一些可以通过程序设置 - 例如,您可以通过查找特定网络接口的IP来计算`SPARK_LOCAL_IP`。注意:在`cluster`模式下在YARN上运行Spark时,需要使用`conf / spark-defaults.conf`文件中的`spark.yarn.appMasterEnv。[EnvironmentVariableName]`属性来设置环境变量。在`spark-env.sh`中设置的环境变量不会在`cluster`模式中反映在YARN Application Master进程中。有关更多信息,请参阅[与YARN相关的Spark属性](run-on-yarn.html#spark-properties)。#配置日志记录Spark使用[log4j](http://logging.apache.org/log4j/)进行日志记录。你可以通过在`conf`目录下添加`log4j.properties`文件来配置它。一种开始的方法是复制现有的`log4j.properties.template`。#覆盖配置目录要指定不同于默认“SPARK_HOME / conf”的配置目录,可以设置SPARK_CONF_DIR。Spark将使用该目录中的配置文件(spark-defaults.conf,spark-env.sh,log4j.properties等)。#继承Hadoop集群配置如果您计划使用Spark从HDFS进行读写,则需要在Spark类路径中包含两个Hadoop配置文件:*`hdfs-site.xml`,它提供HDFS客户端的默认行为。*`core-site.xml`,其中设置了默认的文件系统名称。这些配置文件的位置因Hadoop版本而异,但常见的位置在`/ etc / hadoop / conf`中。一些工具可以即时创建配置,但提供了一个下载它们的机制。要使这些文件对Spark可见,请将`$ SPARK_HOME / spark-env.sh`中的`HADOOP_CONF_DIR`设置为包含配置文件的位置。
Spark记录-官网学习配置篇(二)的更多相关文章
- Spark记录-官网学习配置篇(一)
参考http://spark.apache.org/docs/latest/configuration.html Spark提供三个位置来配置系统: Spark属性控制大多数应用程序参数,可以使用Sp ...
- Spring官网阅读 | 总结篇
接近用了4个多月的时间,完成了整个<Spring官网阅读>系列的文章,本文主要对本系列所有的文章做一个总结,同时也将所有的目录汇总成一篇文章方便各位读者来阅读. 下面这张图是我整个的写作大 ...
- Knockout.Js官网学习(系列)
1.Knockout.Js官网学习(简介) 2.Knockout.Js官网学习(监控属性Observables) Knockout.Js官网学习(数组observable) 3.Knockout.Js ...
- 【Spark深入学习 -16】官网学习SparkSQL
----本节内容-------1.概览 1.1 Spark SQL 1.2 DatSets和DataFrame2.动手干活 2.1 契入点:SparkSess ...
- Spark源码编译,官网学习
这里以spark-1.6.0版本为例 官网网址 http://spark.apache.org/docs/1.6.0/building-spark.html#building-with-build ...
- 【重点突破】—— UniApp 微信小程序开发官网学习One
一.初步认识 uni-app官网:https://uniapp.dcloud.io/component/README HBuilderX官方IDE下载地址: http://www.dcloud.io/ ...
- 程序员必知的技术官网系列--mysql篇
mysql 官网 https://www.mysql.com/ 官网布局很简单, 其中常用的两块就是下载和文档这两块, 其中下载没什么可讲的, 本次重点依旧是文档. 首页 mysql 文档导航页 ht ...
- React官网学习笔记
欢迎指导与讨论 : ) 前言 本文主要是笔者在React英文官网学习时整理的笔记.由于笔者水平有限,如有错误恳请指出 O(∩_∩)O 一 .Tutoial 篇 1 . React的组件类名的首字母必须 ...
- Tomcat 官网知识总结篇
Tomcat 官网知识总结一.Tomcat 基本介绍 1.关键目录 a) bin 该目录包含了启动.停止和启动其他的脚本,如startup.sh.shutdown.sh等; b) conf 配置文件和 ...
随机推荐
- REST-framework快速构建API--四部曲
代码目录结构: 一.使用原生APIView 使用rest-framework原生的APIView实现过程: 以url(r'^books/$', views.BookView.as_view(),nam ...
- 设计模式 笔记 策略模式 Strategy
//---------------------------15/04/28---------------------------- //Strategy 策略模式----对象行为型模式 /* 1:意图 ...
- Azure : 通过 SendGrid 发送邮件
SendGrid 是什么? SendGrid 是架构在云端的电子邮件服务,它能提供基于事务的可靠的电子邮件传递.并且具有可扩充性和实时分析的能力.常见的用例有:1. 自动回复用户的邮件2. 定期发送信 ...
- [转]申瓯 JSY2000-06 程控电话交换机呼叫转移设置
说明:若申瓯程控电话交换机分机有事不在位置上或遇忙分机正忙时为使某些重要来话不丢失,可设置将呼入本机的电话转移至其他分机及公网固定电话或手机.电话交换机使用了本功能不管分机用户在什么地方都能接听到办公 ...
- Cloud Native Weekly | KubeCon首登中国,华为云亮相KubeCon 2018,微软云服务又罢工
1.KubeCon首登中国,Kubernetes将如何再演进? 11月14日,由CNCF发起的云原生领域全球最大的峰会之一KubeCon+CloudNativeCon首次登陆中国,中国已经成为云原生领 ...
- PAT甲题题解-1037. Magic Coupon (25)-贪心,水
题目说了那么多,就是给你两个序列,分别选取元素进行一对一相乘,求得到的最大乘积. 将两个序列的正和负数分开,排个序,然后分别将正1和正2前面的相乘,负1和负2前面的相乘,累加和即可. #include ...
- Daily Scrum NO.3
工作概况 符美潇(PM) 昨日完成的工作 1.Daily Scrum.日常会议及日常工作的分配和查收. 2.整合各DEV所写的代码,在TFS上进行Beta阶段第一次代码签入. 今日工作 1.Daily ...
- 《Linux内核设计与实现》第一、二章学习笔记
<Linux内核设计与实现>第一.二章学习笔记 姓名:王玮怡 学号:20135116 第一章 Linux内核简介 一.关于Unix ——一个支持抢占式多任务.多线程.虚拟内存.换页.动态 ...
- 20135220谈愈敏Blog1_计算机是如何工作的
计算机是如何工作的 存储程序计算机工作模型 冯诺依曼体系结构 从硬件角度来看:CPU和内存,由总线连接,CPU中有一个名为IP的寄存器,总是指向内存的某一块:CS,代码段,执行命令时就取IP指向的一条 ...
- 设置macbook休眠模式
前言: macbook默认合上盖默认是进入混合休眠模式模式(mode 3),此时电脑还会供电.不想耗电的话关机的话当前的工作状态就丢失了. macbook实际上是可以进入休眠模式的,只是没开放出来,我 ...