spark之scala程序开发(集群运行模式):单词出现次数统计
准备工作:
将运行Scala-Eclipse的机器节点(CloudDeskTop)内存调整至4G,因为需要在该节点上跑本地(local)Spark程序,本地Spark程序会启动Worker进程耗用大量内存资源
其余准备工作可参考:scala程序开发之单词出现次数统计(本地运行模式)
1、启动Spark集群
[hadoop@master01 install]$ cat start-total.sh
#!/bin/bash
echo "请首先确认你已经切换到hadoop用户"
#启动zookeeper集群
for node in hadoop@slave01 hadoop@slave02 hadoop@slave03;do ssh $node "source /etc/profile; cd /software/zookeeper-3.4.10/bin/; ./zkServer.sh start; jps";done #开启dfs集群
cd /software/ && start-dfs.sh && jps #开启spark集群
#启动master01的Master进程,slave节点的Worker进程
cd /software/spark-2.1.1/sbin/ && ./start-master.sh && ./start-slaves.sh && jps
#启动master02的Master进程
ssh hadoop@master02 "cd /software/spark-2.1.1/sbin/; ./start-master.sh; jps" #spark集群的日志服务,一般不开,因为比较占资源
#cd /software/spark-2.1.1/sbin/ && ./start-history-server.sh && cd - && jps
启动Spark集群的脚本:
查看master的状态:
[hadoop@master01 software]$ hdfs haadmin -getServiceState nn1
[hadoop@master01 software]$ hdfs haadmin -getServiceState nn2
需要有一个是active,否则使用如下语句进行转换:
[hadoop@master01 software]$ hdfs haadmin -transitionToActive --forceactive nn1
2、上传一个测试文件wordcount到HDFS集群
[hadoop@master02 install]$ hdfs dfs -ls /test/scala/input
Found 2 items
-rw-r--r-- 3 hadoop supergroup 156 2018-02-07 16:25 /test/scala/input/information
-rw-r--r-- 3 hadoop supergroup 156 2018-02-07 16:25 /test/scala/input/information02
[hadoop@master02 install]$ hdfs dfs -cat /test/scala/input/information
zhnag san shi yi ge hao ren
jin tian shi yi ge hao tian qi
wo zai zhe li zuo le yi ge ce shi
yi ge guan yu scala de ce shi
welcome to mmzs
欢迎 欢迎
[hadoop@master02 install]$ hdfs dfs -cat /test/scala/input/information02
zhnag san shi yi ge hao ren
jin tian shi yi ge hao tian qi
wo zai zhe li zuo le yi ge ce shi
yi ge guan yu scala de ce shi
welcome to mmzs
欢迎 欢迎
3、编写Spark的Job代码
package com.mmzs.bigdata.spark.rdd.cluster import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.conf.Configuration
import java.net.URI
import org.apache.hadoop.fs.Path
import org.apache.spark.rdd.RDD.rddToOrderedRDDFunctions
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions object WordCount {
/*
* scala中有private(本类中访问)、protected(本类和子类中访问)、默认(public工程内访问)三种访问权限
* scala类中属性的默认访问权限是private
* scala中类和方法的默认访问权限是public,但无需显式指定public,因为scala中没有这个关键字
*/
var fs:FileSystem=null; //定义一个实例块
{
val hconf:Configuration=new Configuration();
fs=FileSystem.get(new URI("hdfs://ns1/"), hconf, "hadoop");
} /**
* 主函数
* @param args
*/
def main(args: Array[String]): Unit = {
//读取spark配置文件
val conf:SparkConf = new SparkConf();
//本地测试模式
//conf.setMaster("local");
//集群测试模式
//conf.setMaster("spark://master01:7077");
//设定应用名字
conf.setAppName("Hdfs Scala Spark RDD"); //根据配置获取spark上下文对象
val sc:SparkContext = new SparkContext(conf); //使用sparkContext读取文件到内存并生成RDD对象
//指定一个输入目录即可,目录中的所有文件都将作为输入文件
val lineRdd:RDD[String] = sc.textFile("/test/scala/input");
//使用空格切分每一行的数据为单词数组,并将单词数组中的单词子串释放到外层的RDD集合中
val flatRdd:RDD[String] = lineRdd.flatMap { line => line.split(" "); }
//将RDD中的每一个单词字串转化为元组,以完成单词计数
val mapRDD:RDD[Tuple2[String, Integer]] = flatRdd.map(word=>(word,1));
//按RDD集合中每一个元组的第一个元素(即单词字串)进行分组并完成单词计数
val reduceRDD:RDD[(String, Integer)] = mapRDD.reduceByKey((pre:Integer, next:Integer)=>pre+next);
//交换元素中的key和value的位置便于后续排序
val reduceRDD02:RDD[(Integer, String)] = reduceRDD.map(tuple=>(tuple._2,tuple._1));
//根据key进行排序,第二个参数表示启动的Task数量,设置大了可能会抛出内存溢出的异常
val sortRDD:RDD[(Integer, String)] = reduceRDD02.sortByKey(false, 1);
//排好序之后将顺序换回来
val sortRDD02:RDD[(Integer, String)] = reduceRDD.map(tuple=>(tuple._2,tuple._1)); //指定一个输出目录,如果输出目录已经事先存在则应该将它删除掉
val dst:Path=new Path("/test/scala/output/");
if(fs.exists(dst)&&fs.isDirectory(dst))fs.delete(dst, true);
//将结果保存到指定路径下
reduceRDD.saveAsTextFile("/test/scala/output/"); //停止使用spark上下文对象
sc.stop();
}
}
WordCount
4、打包并提交运行Job
4.1、打包Spark代码
#在Eclipse的工程目录SparkTest下创建打包目录jarTest
[hadoop@CloudDeskTop software]$ mkdir -p /project/scala/SparkRDD/jarTest
#将Eclipse中编译好的bin目录下的com文件夹打包到工程目录下的jarTest目录下
[hadoop@CloudDeskTop software]$ cd /project/scala/SparkTest/bin
[hadoop@CloudDeskTop bin]$ pwd
/project/scala/SparkTest/bin
[hadoop@CloudDeskTop bin]$ jar -cvf /project/scala/SparkRDD/jarTest/wordcount.jar com/
4.2、提交Spark的Job
#切换到Spark安装目录下的bin目录下
[hadoop@CloudDeskTop bin]$ pwd
/software/spark-2.1.1/bin
#提交Job到Spark集群(注意:--master参数值spark://master01:7077不能以斜杠结尾)
[hadoop@CloudDeskTop bin]$ ./spark-submit --master spark://master01:7077 --class com.mmzs.bigdata.spark.rdd.cluster.WordCount /project/scala/SparkRDD/jarTest/wordcount.jar 1
-bash: ./spark-submit: 没有那个文件或目录
[hadoop@CloudDeskTop bin]$ cd /software/spark-2.1.1/bin/
[hadoop@CloudDeskTop bin]$ ./spark-submit --master spark://master01:7077 --class com.mmzs.bigdata.spark.rdd.cluster.WordCount /project/scala/SparkRDD/jarTest/wordcount.jar 1
18/02/08 15:21:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/02/08 15:21:50 INFO spark.SparkContext: Running Spark version 2.1.1
18/02/08 15:21:50 WARN spark.SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
18/02/08 15:21:50 INFO spark.SecurityManager: Changing view acls to: hadoop
18/02/08 15:21:50 INFO spark.SecurityManager: Changing modify acls to: hadoop
18/02/08 15:21:50 INFO spark.SecurityManager: Changing view acls groups to:
18/02/08 15:21:50 INFO spark.SecurityManager: Changing modify acls groups to:
18/02/08 15:21:50 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
18/02/08 15:21:51 INFO util.Utils: Successfully started service 'sparkDriver' on port 36034.
18/02/08 15:21:51 INFO spark.SparkEnv: Registering MapOutputTracker
18/02/08 15:21:51 INFO spark.SparkEnv: Registering BlockManagerMaster
18/02/08 15:21:51 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/02/08 15:21:51 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/02/08 15:21:51 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-00930396-c78b-4931-8433-409ea44280ca
18/02/08 15:21:51 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
18/02/08 15:21:51 INFO spark.SparkEnv: Registering OutputCommitCoordinator
18/02/08 15:21:51 INFO util.log: Logging initialized @5920ms
18/02/08 15:21:52 INFO server.Server: jetty-9.2.z-SNAPSHOT
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7ff38263{/jobs,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4bf57335{/jobs/json,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5f5ec388{/jobs/job,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@467746a2{/jobs/job/json,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@40be59d2{/stages,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@10fb0b33{/stages/json,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@519c49fa{/stages/stage,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6bbce5f1{/stages/stage/json,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3e9c6879{/stages/pool,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@e8f000c{/stages/pool/json,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4e4c1b4b{/storage,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@66940115{/storage/json,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7ed33e4f{/storage/rdd,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5e9ff595{/storage/rdd/json,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@57b439bb{/environment,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@793a50f8{/environment/json,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@639a07f5{/executors,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@158098e9{/executors/json,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2db6f406{/executors/threadDump,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@464ecd5c{/executors/threadDump/json,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5f8c7713{/static,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7eddb166{/,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@ca9e09c{/api,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@64d92842{/jobs/job/kill,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6ce238c7{/stages/stage/kill,null,AVAILABLE,@Spark}
18/02/08 15:21:52 INFO server.ServerConnector: Started Spark@7f3e9c42{HTTP/1.1}{0.0.0.0:4040}
18/02/08 15:21:52 INFO server.Server: Started @6335ms
18/02/08 15:21:52 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
18/02/08 15:21:52 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.154.134:4040
18/02/08 15:21:52 INFO spark.SparkContext: Added JAR file:/project/scala/SparkRDD/jarTest/wordcount.jar at spark://192.168.154.134:36034/jars/wordcount.jar with timestamp 1518074512324
18/02/08 15:21:52 INFO client.StandaloneAppClient$ClientEndpoint: Connecting to master spark://master01:7077...
18/02/08 15:21:52 INFO client.TransportClientFactory: Successfully created connection to master01/192.168.154.130:7077 after 74 ms (0 ms spent in bootstraps)
18/02/08 15:21:53 INFO cluster.StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20180208152153-0011
18/02/08 15:21:53 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20180208152153-0011/0 on worker-20180208121809-192.168.154.133-49922 (192.168.154.133:49922) with 4 cores
18/02/08 15:21:53 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20180208152153-0011/0 on hostPort 192.168.154.133:49922 with 4 cores, 1024.0 MB RAM
18/02/08 15:21:53 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20180208152153-0011/1 on worker-20180208121818-192.168.154.132-43679 (192.168.154.132:43679) with 4 cores
18/02/08 15:21:53 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20180208152153-0011/1 on hostPort 192.168.154.132:43679 with 4 cores, 1024.0 MB RAM
18/02/08 15:21:53 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20180208152153-0011/2 on worker-20180208121826-192.168.154.131-56071 (192.168.154.131:56071) with 4 cores
18/02/08 15:21:53 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20180208152153-0011/2 on hostPort 192.168.154.131:56071 with 4 cores, 1024.0 MB RAM
18/02/08 15:21:53 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35028.
18/02/08 15:21:53 INFO netty.NettyBlockTransferService: Server created on 192.168.154.134:35028
18/02/08 15:21:53 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/02/08 15:21:53 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.154.134, 35028, None)
18/02/08 15:21:53 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.154.134:35028 with 366.3 MB RAM, BlockManagerId(driver, 192.168.154.134, 35028, None)
18/02/08 15:21:53 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20180208152153-0011/0 is now RUNNING
18/02/08 15:21:53 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20180208152153-0011/2 is now RUNNING
18/02/08 15:21:53 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20180208152153-0011/1 is now RUNNING
18/02/08 15:21:53 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.154.134, 35028, None)
18/02/08 15:21:53 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.154.134, 35028, None)
18/02/08 15:21:53 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3589f0{/metrics/json,null,AVAILABLE,@Spark}
18/02/08 15:21:55 INFO scheduler.EventLoggingListener: Logging events to hdfs://ns1/sparkLog/app-20180208152153-0011
18/02/08 15:21:55 INFO cluster.StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
18/02/08 15:21:57 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 202.4 KB, free 366.1 MB)
18/02/08 15:21:57 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.8 KB, free 366.1 MB)
18/02/08 15:21:57 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.154.134:35028 (size: 23.8 KB, free: 366.3 MB)
18/02/08 15:21:57 INFO spark.SparkContext: Created broadcast 0 from textFile at WordCount.scala:46
18/02/08 15:21:58 INFO mapred.FileInputFormat: Total input paths to process : 2
18/02/08 15:21:58 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
18/02/08 15:21:58 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
18/02/08 15:21:58 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
18/02/08 15:21:58 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
18/02/08 15:21:58 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
18/02/08 15:21:58 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
18/02/08 15:21:59 INFO spark.SparkContext: Starting job: saveAsTextFile at WordCount.scala:64
18/02/08 15:21:59 INFO scheduler.DAGScheduler: Registering RDD 3 (map at WordCount.scala:50)
18/02/08 15:21:59 INFO scheduler.DAGScheduler: Got job 0 (saveAsTextFile at WordCount.scala:64) with 2 output partitions
18/02/08 15:21:59 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (saveAsTextFile at WordCount.scala:64)
18/02/08 15:21:59 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/02/08 15:21:59 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/02/08 15:21:59 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:50), which has no missing parents
18/02/08 15:21:59 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.4 KB, free 366.1 MB)
18/02/08 15:21:59 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.5 KB, free 366.1 MB)
18/02/08 15:21:59 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.154.134:35028 (size: 2.5 KB, free: 366.3 MB)
18/02/08 15:21:59 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996
18/02/08 15:21:59 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:50)
18/02/08 15:21:59 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/02/08 15:22:07 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (192.168.154.133:42992) with ID 0
18/02/08 15:22:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.154.133, executor 0, partition 0, ANY, 6045 bytes)
18/02/08 15:22:07 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.154.133, executor 0, partition 1, ANY, 6047 bytes)
18/02/08 15:22:08 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (192.168.154.131:43045) with ID 2
18/02/08 15:22:08 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.154.133:36839 with 413.9 MB RAM, BlockManagerId(0, 192.168.154.133, 36839, None)
18/02/08 15:22:08 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.154.131:39804 with 413.9 MB RAM, BlockManagerId(2, 192.168.154.131, 39804, None)
18/02/08 15:22:09 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (192.168.154.132:50642) with ID 1
18/02/08 15:22:09 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.154.132:53076 with 413.9 MB RAM, BlockManagerId(1, 192.168.154.132, 53076, None)
18/02/08 15:22:10 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.154.133:36839 (size: 2.5 KB, free: 413.9 MB)
18/02/08 15:22:10 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.154.133:36839 (size: 23.8 KB, free: 413.9 MB)
18/02/08 15:22:19 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 11109 ms on 192.168.154.133 (executor 0) (1/2)
18/02/08 15:22:19 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 11280 ms on 192.168.154.133 (executor 0) (2/2)
18/02/08 15:22:19 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:50) finished in 19.217 s
18/02/08 15:22:19 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/02/08 15:22:19 INFO scheduler.DAGScheduler: looking for newly runnable stages
18/02/08 15:22:19 INFO scheduler.DAGScheduler: running: Set()
18/02/08 15:22:19 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)
18/02/08 15:22:19 INFO scheduler.DAGScheduler: failed: Set()
18/02/08 15:22:19 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[8] at saveAsTextFile at WordCount.scala:64), which has no missing parents
18/02/08 15:22:19 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 74.2 KB, free 366.0 MB)
18/02/08 15:22:19 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 27.0 KB, free 366.0 MB)
18/02/08 15:22:19 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.154.134:35028 (size: 27.0 KB, free: 366.2 MB)
18/02/08 15:22:19 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:996
18/02/08 15:22:19 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[8] at saveAsTextFile at WordCount.scala:64)
18/02/08 15:22:19 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
18/02/08 15:22:19 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, 192.168.154.133, executor 0, partition 0, NODE_LOCAL, 5819 bytes)
18/02/08 15:22:19 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, 192.168.154.133, executor 0, partition 1, NODE_LOCAL, 5819 bytes)
18/02/08 15:22:19 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.154.133:36839 (size: 27.0 KB, free: 413.9 MB)
18/02/08 15:22:19 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 192.168.154.133:42992
18/02/08 15:22:19 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 155 bytes
18/02/08 15:22:21 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 1896 ms on 192.168.154.133 (executor 0) (1/2)
18/02/08 15:22:21 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 1933 ms on 192.168.154.133 (executor 0) (2/2)
18/02/08 15:22:21 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
18/02/08 15:22:21 INFO scheduler.DAGScheduler: ResultStage 1 (saveAsTextFile at WordCount.scala:64) finished in 1.938 s
18/02/08 15:22:21 INFO scheduler.DAGScheduler: Job 0 finished: saveAsTextFile at WordCount.scala:64, took 22.116753 s
18/02/08 15:22:21 INFO server.ServerConnector: Stopped Spark@7f3e9c42{HTTP/1.1}{0.0.0.0:4040}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6ce238c7{/stages/stage/kill,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@64d92842{/jobs/job/kill,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@ca9e09c{/api,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7eddb166{/,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5f8c7713{/static,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@464ecd5c{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@2db6f406{/executors/threadDump,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@158098e9{/executors/json,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@639a07f5{/executors,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@793a50f8{/environment/json,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@57b439bb{/environment,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5e9ff595{/storage/rdd/json,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7ed33e4f{/storage/rdd,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@66940115{/storage/json,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4e4c1b4b{/storage,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@e8f000c{/stages/pool/json,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3e9c6879{/stages/pool,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6bbce5f1{/stages/stage/json,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@519c49fa{/stages/stage,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@10fb0b33{/stages/json,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@40be59d2{/stages,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@467746a2{/jobs/job/json,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5f5ec388{/jobs/job,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4bf57335{/jobs/json,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7ff38263{/jobs,null,UNAVAILABLE,@Spark}
18/02/08 15:22:21 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.154.134:4040
18/02/08 15:22:21 INFO cluster.StandaloneSchedulerBackend: Shutting down all executors
18/02/08 15:22:21 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
18/02/08 15:22:21 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/02/08 15:22:21 INFO memory.MemoryStore: MemoryStore cleared
18/02/08 15:22:21 INFO storage.BlockManager: BlockManager stopped
18/02/08 15:22:21 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
18/02/08 15:22:21 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/02/08 15:22:21 INFO spark.SparkContext: Successfully stopped SparkContext
18/02/08 15:22:21 INFO util.ShutdownHookManager: Shutdown hook called
18/02/08 15:22:21 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-b652f8de-76a7-467f-b64d-3e493dfb2195
xshell中运行后的界面效果
4.3、查看运行结果
[hadoop@master02 install]$ hdfs dfs -ls /test/scala/
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2018-02-07 16:25 /test/scala/input
drwxr-xr-x - hadoop supergroup 0 2018-02-08 15:22 /test/scala/output
[hadoop@master02 install]$ hdfs dfs -ls /test/scala/output
Found 3 items
-rw-r--r-- 3 hadoop supergroup 0 2018-02-08 15:22 /test/scala/output/_SUCCESS
-rw-r--r-- 3 hadoop supergroup 134 2018-02-08 15:22 /test/scala/output/part-00000
-rw-r--r-- 3 hadoop supergroup 70 2018-02-08 15:22 /test/scala/output/part-00001
[hadoop@master02 install]$ hdfs dfs -cat /test/scala/output/part-00000
(scala,2)
(zuo,2)
(tian,4)
(shi,8)
(ce,4)
(zai,2)
(欢迎,4)
(wo,2)
(zhnag,2)
(san,2)
(welcome,2)
(yi,8)
(ge,8)
(hao,4)
(qi,2)
(yu,2)
[hadoop@master02 install]$ hdfs dfs -cat /test/scala/output/part-00001
(guan,2)
(jin,2)
(ren,2)
(de,2)
(le,2)
(to,2)
(zhe,2)
(li,2)
(mmzs,2)
[hadoop@master02 install]$
查看hdfs集群上的数据是否生成
5、说明:
对于将Job提交到集群的情况,最好不要直接在Eclipse工程中测试,这种不可预测性太大,容易出现异常,如果需要直接在Eclipse中测试可以设置一下提交的master节点:
//读取spark配置文件
val conf:SparkConf = new SparkConf();
//集群测试模式
//conf.setMaster("spark://master01:7077");
//设定应用名字
conf.setAppName("Hdfs Scala Spark RDD");
//根据配置获取spark上下文对象
val sc:SparkContext = new SparkContext(conf);
同时因为Job中涉及到HDFS的文件件操作,这需要连接到HDFS来完成,所以需要将Hadoop的配置文件拷贝到工程的根目录下
[hadoop@CloudDeskTop software]$ cd hadoop-2.7.3/etc/hadoop/
[hadoop@CloudDeskTop hadoop]$ cp -a core-site.xml hdfs-site.xml /project/scala/SparkTest/src/
完成上述的操作之后就可以在Eclipse中直接测试了,但是经过实践操作发现这种在IDE环境中提交Job到集群的测试会抛出很多异常(比如mutable.List类型转换异常等)
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/02/08 13:23:13 INFO SparkContext: Running Spark version 2.1.1
18/02/08 13:23:13 WARN SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
18/02/08 13:23:13 INFO SecurityManager: Changing view acls to: hadoop
18/02/08 13:23:13 INFO SecurityManager: Changing modify acls to: hadoop
18/02/08 13:23:13 INFO SecurityManager: Changing view acls groups to:
18/02/08 13:23:13 INFO SecurityManager: Changing modify acls groups to:
18/02/08 13:23:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
18/02/08 13:23:14 INFO Utils: Successfully started service 'sparkDriver' on port 33230.
18/02/08 13:23:14 INFO SparkEnv: Registering MapOutputTracker
18/02/08 13:23:14 INFO SparkEnv: Registering BlockManagerMaster
18/02/08 13:23:14 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/02/08 13:23:14 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/02/08 13:23:14 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-7fe67308-7d14-407b-ab2e-a45de1072134
18/02/08 13:23:14 INFO MemoryStore: MemoryStore started with capacity 348.0 MB
18/02/08 13:23:14 INFO SparkEnv: Registering OutputCommitCoordinator
18/02/08 13:23:15 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/02/08 13:23:15 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.154.134:4040
18/02/08 13:23:15 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://master01:7077...
18/02/08 13:23:15 INFO TransportClientFactory: Successfully created connection to master01/192.168.154.130:7077 after 55 ms (0 ms spent in bootstraps)
18/02/08 13:23:15 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20180208132316-0010
18/02/08 13:23:15 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180208132316-0010/0 on worker-20180208121809-192.168.154.133-49922 (192.168.154.133:49922) with 4 cores
18/02/08 13:23:15 INFO StandaloneSchedulerBackend: Granted executor ID app-20180208132316-0010/0 on hostPort 192.168.154.133:49922 with 4 cores, 1024.0 MB RAM
18/02/08 13:23:15 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180208132316-0010/1 on worker-20180208121818-192.168.154.132-43679 (192.168.154.132:43679) with 4 cores
18/02/08 13:23:15 INFO StandaloneSchedulerBackend: Granted executor ID app-20180208132316-0010/1 on hostPort 192.168.154.132:43679 with 4 cores, 1024.0 MB RAM
18/02/08 13:23:15 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180208132316-0010/2 on worker-20180208121826-192.168.154.131-56071 (192.168.154.131:56071) with 4 cores
18/02/08 13:23:15 INFO StandaloneSchedulerBackend: Granted executor ID app-20180208132316-0010/2 on hostPort 192.168.154.131:56071 with 4 cores, 1024.0 MB RAM
18/02/08 13:23:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49476.
18/02/08 13:23:15 INFO NettyBlockTransferService: Server created on 192.168.154.134:49476
18/02/08 13:23:15 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/02/08 13:23:15 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180208132316-0010/1 is now RUNNING
18/02/08 13:23:15 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180208132316-0010/0 is now RUNNING
18/02/08 13:23:15 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.154.134, 49476, None)
18/02/08 13:23:16 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180208132316-0010/2 is now RUNNING
18/02/08 13:23:16 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.154.134:49476 with 348.0 MB RAM, BlockManagerId(driver, 192.168.154.134, 49476, None)
18/02/08 13:23:16 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.154.134, 49476, None)
18/02/08 13:23:16 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.154.134, 49476, None)
18/02/08 13:23:16 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
18/02/08 13:23:17 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 199.5 KB, free 347.8 MB)
18/02/08 13:23:18 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.5 KB, free 347.8 MB)
18/02/08 13:23:18 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.154.134:49476 (size: 23.5 KB, free: 348.0 MB)
18/02/08 13:23:18 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:46
18/02/08 13:23:19 INFO FileInputFormat: Total input paths to process : 2
18/02/08 13:23:20 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
18/02/08 13:23:20 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
18/02/08 13:23:20 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
18/02/08 13:23:20 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
18/02/08 13:23:20 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
18/02/08 13:23:20 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/02/08 13:23:21 INFO SparkContext: Starting job: saveAsTextFile at WordCount.scala:64
18/02/08 13:23:21 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala:50)
18/02/08 13:23:21 INFO DAGScheduler: Got job 0 (saveAsTextFile at WordCount.scala:64) with 2 output partitions
18/02/08 13:23:21 INFO DAGScheduler: Final stage: ResultStage 1 (saveAsTextFile at WordCount.scala:64)
18/02/08 13:23:21 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/02/08 13:23:21 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/02/08 13:23:21 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:50), which has no missing parents
18/02/08 13:23:21 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.4 KB, free 347.8 MB)
18/02/08 13:23:21 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.5 KB, free 347.8 MB)
18/02/08 13:23:21 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.154.134:49476 (size: 2.5 KB, free: 348.0 MB)
18/02/08 13:23:21 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996
18/02/08 13:23:21 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:50)
18/02/08 13:23:21 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/02/08 13:23:33 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (192.168.154.131:58462) with ID 2
18/02/08 13:23:33 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.154.131, executor 2, partition 0, ANY, 5987 bytes)
18/02/08 13:23:33 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.154.131, executor 2, partition 1, ANY, 5989 bytes)
18/02/08 13:23:34 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.154.131:39801 with 413.9 MB RAM, BlockManagerId(2, 192.168.154.131, 39801, None)
18/02/08 13:23:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (192.168.154.133:50331) with ID 0
18/02/08 13:23:35 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.154.133:54974 with 413.9 MB RAM, BlockManagerId(0, 192.168.154.133, 54974, None)
18/02/08 13:23:36 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.154.131:39801 (size: 2.5 KB, free: 413.9 MB)
18/02/08 13:23:36 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (192.168.154.132:39248) with ID 1
18/02/08 13:23:37 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.154.132:51259 with 413.9 MB RAM, BlockManagerId(1, 192.168.154.132, 51259, None)
18/02/08 13:23:37 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 192.168.154.131, executor 2): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) 18/02/08 13:23:37 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 2, 192.168.154.133, executor 0, partition 1, ANY, 5989 bytes)
18/02/08 13:23:37 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on 192.168.154.131, executor 2: java.lang.ClassCastException (cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD) [duplicate 1]
18/02/08 13:23:37 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, 192.168.154.133, executor 0, partition 0, ANY, 5987 bytes)
18/02/08 13:23:38 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.154.133:54974 (size: 2.5 KB, free: 413.9 MB)
18/02/08 13:23:38 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 3) on 192.168.154.133, executor 0: java.lang.ClassCastException (cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD) [duplicate 2]
18/02/08 13:23:38 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 4, 192.168.154.131, executor 2, partition 0, ANY, 5987 bytes)
18/02/08 13:23:38 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on 192.168.154.133, executor 0: java.lang.ClassCastException (cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD) [duplicate 3]
18/02/08 13:23:38 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 5, 192.168.154.131, executor 2, partition 1, ANY, 5989 bytes)
18/02/08 13:23:38 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 4) on 192.168.154.131, executor 2: java.lang.ClassCastException (cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD) [duplicate 4]
18/02/08 13:23:38 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 6, 192.168.154.131, executor 2, partition 0, ANY, 5987 bytes)
18/02/08 13:23:38 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 5) on 192.168.154.131, executor 2: java.lang.ClassCastException (cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD) [duplicate 5]
18/02/08 13:23:38 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 7, 192.168.154.131, executor 2, partition 1, ANY, 5989 bytes)
18/02/08 13:23:38 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on 192.168.154.131, executor 2: java.lang.ClassCastException (cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD) [duplicate 6]
18/02/08 13:23:38 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
18/02/08 13:23:39 INFO TaskSchedulerImpl: Cancelling stage 0
18/02/08 13:23:39 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/02/08 13:23:39 INFO TaskSchedulerImpl: Stage 0 was cancelled
18/02/08 13:23:39 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 7) on 192.168.154.131, executor 2: java.lang.ClassCastException (cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD) [duplicate 7]
18/02/08 13:23:39 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/02/08 13:23:39 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:50) failed in 17.103 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.154.131, executor 2): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) Driver stacktrace:
18/02/08 13:23:39 INFO DAGScheduler: Job 0 failed: saveAsTextFile at WordCount.scala:64, took 17.644346 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.154.131, executor 2): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1226)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1168)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1168)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1168)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1071)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1037)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1037)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1037)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:963)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:963)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:963)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:962)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1489)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1468)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1468)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1468)
at com.mmzs.bigdata.spark.rdd.cluster.WordCount$.main(WordCount.scala:64)
at com.mmzs.bigdata.spark.rdd.cluster.WordCount.main(WordCount.scala)
Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
18/02/08 13:23:39 INFO SparkContext: Invoking stop() from shutdown hook
18/02/08 13:23:39 INFO SparkUI: Stopped Spark web UI at http://192.168.154.134:4040
18/02/08 13:23:39 INFO StandaloneSchedulerBackend: Shutting down all executors
18/02/08 13:23:39 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
18/02/08 13:23:39 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/02/08 13:23:39 INFO MemoryStore: MemoryStore cleared
18/02/08 13:23:39 INFO BlockManager: BlockManager stopped
18/02/08 13:23:39 INFO BlockManagerMaster: BlockManagerMaster stopped
18/02/08 13:23:39 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/02/08 13:23:39 INFO SparkContext: Successfully stopped SparkContext
18/02/08 13:23:39 INFO ShutdownHookManager: Shutdown hook called
18/02/08 13:23:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-0c6925cb-bf3e-4616-94fb-a588da99dee4
博主运行时抛出的异常
spark之scala程序开发(集群运行模式):单词出现次数统计的更多相关文章
- 【Spark】SparkStreaming-提交到集群运行
SparkStreaming-提交到集群运行 spark streaming 提交_百度搜索 SparkStreaming示例在集群中运行 - CSDN博客
- 新闻实时分析系统 Spark2.X集群运行模式
1.几种运行模式介绍 Spark几种运行模式: 1)Local 2)Standalone 3)Yarn 4)Mesos 下载IDEA并安装,可以百度一下免费文档. 2.spark Standalone ...
- 新闻网大数据实时分析可视化系统项目——16、Spark2.X集群运行模式
1.几种运行模式介绍 Spark几种运行模式: 1)Local 2)Standalone 3)Yarn 4)Mesos 下载IDEA并安装,可以百度一下免费文档. 2.spark Standalone ...
- spark之scala程序开发(本地运行模式):单词出现次数统计
准备工作: 将运行Scala-Eclipse的机器节点(CloudDeskTop)内存调整至4G,因为需要在该节点上跑本地(local)Spark程序,本地Spark程序会启动Worker进程耗用大量 ...
- Spark运行模式_本地伪集群运行模式(单机模拟集群)
这种运行模式,和Local[N]很像,不同的是,它会在单机启动多个进程来模拟集群下的分布式场景,而不像Local[N]这种多个线程只能在一个进程下委屈求全的共享资源.通常也是用来验证开发出来的应用程序 ...
- spark集群运行模式
spark的集中运行模式 Local .Standalone.Yarn 关闭防火墙:systemctl stop firewalld.service 重启网络服务:systemctl restart ...
- Spark2.X集群运行模式
rn 启动 先把这三个文件的名字改一下 配置slaves 配置spark-env.sh export JAVA_HOME=/opt/modules/jdk1..0_60 export SCALA_HO ...
- 简单说明hadoop集群运行三种模式和配置文件
Hadoop的运行模式分为3种:本地运行模式,伪分布运行模式,集群运行模式,相应概念如下: 1.独立模式即本地运行模式(standalone或local mode)无需运行任何守护进程(daemon) ...
- hadoop本地运行与集群运行
开发环境: windows10+伪分布式(虚拟机组成的集群)+IDEA(不需要装插件) 介绍: 本地开发,本地debug,不需要启动集群,不需要在集群启动hdfs yarn 需要准备什么: 1/配置w ...
随机推荐
- Maths | 病态问题和条件数
目录 1. 概念定义 1.1. 病态/ 良态问题 1.2. 适定/ 非适定问题 1.3. 良态/ 病态矩阵和条件数 2. 病态的根源 3. 计算条件数的方法 3.1. 与特征值的关系 3.2. 与奇异 ...
- RISC精简指令集系统计算机
特点: 选用使用频率高的简单指令,复杂指令由简单指令组合完成 固定指令长度 只有Load/Store指令访存,其他指令都在寄存器中进行 CPU中寄存器数量多 一定采用指令流水,大部分指令在一个时钟周期 ...
- CSS3——animation的基础(轮播图)
作为前端刚入门的小菜鸟,只想记录一下每天的小收获 对于animation动画 1.实现动画效果的组成: (1)通过类似Flash的关键帧来声明一个动画 (2)在animation属性中调用关键帧声明的 ...
- CentOS---zabbix使用sendEamil发送报警
一.sendEmail简介 sendEmail是一个轻量级,命令行的SMTP邮件客户端.如果你需要使用命令行发送邮件,那么sendEmail是非常完美的选择:使用简单并且功能强大.这个被设计用在php ...
- Jenkins获取运行job的用户名
1. Jenkins获取运行job的用户名 需要安装user build vars plugin 插件,然后就可以取到$BUILD_USER_ID变量. user build vars plugin下 ...
- 使用node自动生成html并调用cmd命令提交代码到仓库
生成html提交到git仓库 基于目前的express博客,写了一点代码,通过request模块来请求站点,将html保存到coding-pages目录,复制静态文件夹到coding-pages,最后 ...
- DOS窗口查看端口占用
背景:最近用tomcat,一直访问不了,要账号密码登录,最后发现问题原因根本是tomcat的默认端口号8080被占用了,下面介绍如何通过dos窗口找到占用端口的进程. 方法: 打开DOS窗口,输入ne ...
- 细说MySQL数据库操作
目录 基本语法: 字符集和校验规则 字符集 校验规则 校验规则的影响 数据库操作相关指令 查询数据库版本 显示数据库语句 显示数据库创建语句 数据库删除语句 查看当前数据库有多少个用户在操作 基本语法 ...
- ubuntu 16.04 下安装动态链接库方法
一般先使用ldd 来查看该应用程序缺少什么东西,然后,再根据sudo apt install XXX 去安装相应的动态库. 假如没有对应的库,可以使用: sudo ln -s /usr/lib/lib ...
- deepin安装docker
deepin在debian的基础上进行了一些修改,因此导致按照debian的安装指引是很难安装上docker的. 最近想学习docker,故尝试了安装docker(个人使用:deepin15.7桌面版 ...