Eclipse提交代码到Spark集群上运行
Spark集群master节点: 192.168.168.200
Eclipse运行windows主机: 192.168.168.100
场景:
为了测试在Eclipse上开发的代码在Spark集群上运行的情况,比如:内存、cores、stdout以及相应的变量传递是否正常!
生产环境是把在Eclipse上开发的代码打包放到Spark集群上,然后使用spark-submit提交运行。当然我们也可以启动远程调试,
但是这样就会造成每次测试代码,我们都需要把jar包复制到Spark集群机器上,十分的不方便。因此,我们希望能够在Eclipse直接
模拟spark-submit提交程序运行,便于调试!
一、准备words.txt文件
words.txt :
- HelloHadoop
- HelloBigData
- HelloSpark
- HelloFlume
- HelloKafka
上传到HDFS文件系统中,如图:
二、创建Spark测试类
- package com.spark.test;
- import java.util.Arrays;
- import java.util.Iterator;
- import org.apache.spark.SparkConf;
- import org.apache.spark.api.java.JavaPairRDD;
- import org.apache.spark.api.java.JavaRDD;
- import org.apache.spark.api.java.JavaSparkContext;
- import org.apache.spark.api.java.function.FlatMapFunction;
- import org.apache.spark.api.java.function.Function2;
- import org.apache.spark.api.java.function.PairFunction;
- import org.apache.spark.api.java.function.VoidFunction;
- import scala.Tuple2;
- publicclassJavaWordCount{
- publicstaticvoid main(String[] args){
- SparkConf sparkConf =newSparkConf().setAppName("JavaWordCount").setMaster("local[2]");
- JavaSparkContext jsc =newJavaSparkContext(sparkConf);
- JavaRDD<String> lines = jsc.textFile("hdfs://192.168.168.200:9000/test/words.txt");
- JavaRDD<String> words = lines.flatMap(newFlatMapFunction<String,String>(){
- publicIterator<String> call(String line){
- returnArrays.asList(line.split(" ")).iterator();
- }
- });
- JavaPairRDD<String,Integer> pairs = words.mapToPair(newPairFunction<String,String,Integer>(){
- publicTuple2<String,Integer> call(String word)throwsException{
- returnnewTuple2<String,Integer>(word,1);
- }
- });
- JavaPairRDD<String,Integer> wordCount = pairs.reduceByKey(newFunction2<Integer,Integer,Integer>(){
- publicInteger call(Integer v1,Integer v2)throwsException{
- return v1 + v2;
- }
- });
- wordCount.foreach(newVoidFunction<Tuple2<String,Integer>>(){
- publicvoid call(Tuple2<String,Integer> pairs)throwsException{
- System.out.println(pairs._1()+":"+ pairs._2());
- }
- });
- jsc.close();
- }
- }
日志输出:
访问spark的web ui : http://192.168.168.200:8080
从中看出spark的master地址为: spark://master:7077
将
- SparkConf sparkConf =newSparkConf().setAppName("JavaWordCount").setMaster("local[2]");
修改为:
- SparkConf sparkConf =newSparkConf().setAppName("JavaWordCount").setMaster("spark://192.168.168.200:7077");
运行,发现会有报org.apache.spark.SparkException的错:
- Exceptionin thread "main" org.apache.spark.SparkException:Job aborted due to stage failure:Task1in stage 0.0 failed 4 times, most recent failure:Lost task 1.3in stage 0.0(TID 6,192.168.168.200): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seqin instance of org.apache.spark.rdd.MapPartitionsRDD
- at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
- at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
- at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
- at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
- at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
- at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
- at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
- at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
- at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
- at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
- at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
- at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
- at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
- at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
- at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
- at org.apache.spark.scheduler.Task.run(Task.scala:86)
- at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
- at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
- at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
- at java.lang.Thread.run(Thread.java:745)
- Driver stacktrace:
- at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
- at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
- at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
- at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
- at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
- at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
- at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
- at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
- at scala.Option.foreach(Option.scala:257)
- at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
- at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
- at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
- at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
- at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
- at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
- at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
- at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
- at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
- at org.apache.spark.SparkContext.runJob(SparkContext.scala:1913)
- at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:894)
- at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:892)
- at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
- at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
- at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
- at org.apache.spark.rdd.RDD.foreach(RDD.scala:892)
- at org.apache.spark.api.java.JavaRDDLike$class.foreach(JavaRDDLike.scala:350)
- at org.apache.spark.api.java.AbstractJavaRDDLike.foreach(JavaRDDLike.scala:45)
- at com.spark.test.JavaWordCount.main(JavaWordCount.java:39)
- Causedby: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seqin instance of org.apache.spark.rdd.MapPartitionsRDD
- at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
- at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
- at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
- at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
- at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
- at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
- at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
- at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
- at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
- at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
- at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
- at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
- at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
- at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
- at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
- at org.apache.spark.scheduler.Task.run(Task.scala:86)
- at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
- at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
- at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
- at java.lang.Thread.run(Thread.java:745)
在网上找到的解决办法是配置jar包的路径即可,先用maven install把程序打包成jar包,然后setJars方法。
- SparkConf sparkConf =newSparkConf().setAppName("JavaWordCount").setMaster("spark://192.168.168.200:7077");
- String[] jars ={"I:\\TestSpark\\target\\TestSpark-0.0.1-jar-with-dependencies.jar"};
- sparkConf.setJars(jars);
最终源码如下:
- package com.spark.test;
- import java.util.Arrays;
- import java.util.Iterator;
- import org.apache.spark.SparkConf;
- import org.apache.spark.api.java.JavaPairRDD;
- import org.apache.spark.api.java.JavaRDD;
- import org.apache.spark.api.java.JavaSparkContext;
- import org.apache.spark.api.java.function.FlatMapFunction;
- import org.apache.spark.api.java.function.Function2;
- import org.apache.spark.api.java.function.PairFunction;
- import org.apache.spark.api.java.function.VoidFunction;
- import scala.Tuple2;
- publicclassJavaWordCount{
- publicstaticvoid main(String[] args){
- SparkConf sparkConf =newSparkConf().setAppName("JavaWordCount").setMaster("spark://192.168.168.200:7077");
- String[] jars ={"I:\\TestSpark\\target\\TestSpark-0.0.1-jar-with-dependencies.jar"};
- sparkConf.setJars(jars);
- JavaSparkContext jsc =newJavaSparkContext(sparkConf);
- JavaRDD<String> lines = jsc.textFile("hdfs://192.168.168.200:9000/test/words.txt");
- JavaRDD<String> words = lines.flatMap(newFlatMapFunction<String,String>(){
- publicIterator<String> call(String line){
- returnArrays.asList(line.split(" ")).iterator();
- }
- });
- JavaPairRDD<String,Integer> pairs = words.mapToPair(newPairFunction<String,String,Integer>(){
- publicTuple2<String,Integer> call(String word)throwsException{
- returnnewTuple2<String,Integer>(word,1);
- }
- });
- JavaPairRDD<String,Integer> wordCount = pairs.reduceByKey(newFunction2<Integer,Integer,Integer>(){
- publicInteger call(Integer v1,Integer v2)throwsException{
- return v1 + v2;
- }
- });
- wordCount.foreach(newVoidFunction<Tuple2<String,Integer>>(){
- publicvoid call(Tuple2<String,Integer> pairs)throwsException{
- System.out.println(pairs._1()+":"+ pairs._2());
- }
- });
- jsc.close();
- }
- }
运行正常,没有出现报错!
查看stdout是否统计正确:
至此,你可以很方便的在Eclipse上开发调试你的代码啦!
Eclipse提交代码到Spark集群上运行的更多相关文章
- 将java开发的wordcount程序提交到spark集群上运行
今天来分享下将java开发的wordcount程序提交到spark集群上运行的步骤. 第一个步骤之前,先上传文本文件,spark.txt,然用命令hadoop fs -put spark.txt /s ...
- [Spark Core] 在 Spark 集群上运行程序
0. 说明 将 IDEA 下的项目导出为 Jar 包,部署到 Spark 集群上运行. 1. 打包程序 1.0 前提 搭建好 Spark 集群,完成代码的编写. 1.1 修改代码 [添加内容,判断参数 ...
- IntelliJ IDEA编写的spark程序在远程spark集群上运行
准备工作 需要有三台主机,其中一台主机充当master,另外两台主机分别为slave01,slave02,并且要求三台主机处于同一个局域网下 通过命令:ifconfig 可以查看主机的IP地址,如下图 ...
- 有关python numpy pandas scipy 等 能在YARN集群上 运行PySpark
有关这个问题,似乎这个在某些时候,用python写好,且spark没有响应的算法支持, 能否能在YARN集群上 运行PySpark方式, 将python分析程序提交上去? Spark Applicat ...
- Spark学习之在集群上运行Spark
一.简介 Spark 的一大好处就是可以通过增加机器数量并使用集群模式运行,来扩展程序的计算能力.好在编写用于在集群上并行执行的 Spark 应用所使用的 API 跟本地单机模式下的完全一样.也就是说 ...
- 在集群上运行Spark
Spark 可以在各种各样的集群管理器(Hadoop YARN.Apache Mesos,还有Spark 自带的独立集群管理器)上运行,所以Spark 应用既能够适应专用集群,又能用于共享的云计算环境 ...
- 06、部署Spark程序到集群上运行
06.部署Spark程序到集群上运行 6.1 修改程序代码 修改文件加载路径 在spark集群上执行程序时,如果加载文件需要确保路径是所有节点能否访问到的路径,因此通常是hdfs路径地址.所以需要修改 ...
- MapReduce编程入门实例之WordCount:分别在Eclipse和Hadoop集群上运行
上一篇博文如何在Eclipse下搭建Hadoop开发环境,今天给大家介绍一下如何分别分别在Eclipse和Hadoop集群上运行我们的MapReduce程序! 1. 在Eclipse环境下运行MapR ...
- Spark学习之在集群上运行Spark(6)
Spark学习之在集群上运行Spark(6) 1. Spark的一个优点在于可以通过增加机器数量并使用集群模式运行,来扩展程序的计算能力. 2. Spark既能适用于专用集群,也可以适用于共享的云计算 ...
随机推荐
- 2D转换下的zoom和transform:scale的区别
一.什么是zoom 在我们做项目和查看别人的网页的时候总会在一些元素的样式里,看到有一个家伙孤零零的待在那里,它到底是谁呢? 它的名字叫zoom,zoom的意思是“变焦”,虽然在摄影的领域经常被提到, ...
- pyx文件 生成pyd 文件用于 cython调用
转于:https://www.2cto.com/kf/201405/304168.html 1. 初衷 最近学用python,python不愧是为程序员考虑的编程语言,写起来很快很方便,大大节省开发效 ...
- HTTP基本原理(转)
1. HTTP简介 HTTP协议(HyperText Transfer Protocol,超文本传输协议)是用于从WWW服务器传输超文本到本地浏览器的传送协议.它可以使浏览器更加高效,使网络传输减少. ...
- Git版本退回和修改
首先我们来看看我们的第一个版本: 我的git文件如下: 那我们来修改一下这个文件 然后提交 那我们来查看一下提交的记录:使用git log 当我们使用 git log --pretty=oneline ...
- 服务器上安装caffe的过程记录
1. 前言 因为新的实验室东西都是新的,所以在服务器上要自己重新配置CAFFE 这里假设所有依赖包学长们都安装好了,我是没有sudo权限的 服务器的配置: CUDA 8.0 Ubuntu 16.04 ...
- run
和配置块不同,运行块在注入器创建之后被执行,它是所有AngularJS应用中第一个被执行的方法运行块通常用来注册全局的事件监听器.例如,我们会在.run()块中设置路由事件的监听器以及过滤未经授权的请 ...
- vmware NAT 网络出现问题的解决方法
nat 网络的配制: 用nat网络相对来说,以后网段出现改变,都不会有大的影响,方便以后做实验. 用dhclient 来获取IP dhclent -r 是用来关闭获取IP的. 自动获取IP,也可以改成 ...
- net框架平台下RPC框架选型
net RPC框架选型 近期开始研究分布式架构,会涉及到一个最核心的组件:RPC(Remote Procedure Call Protocol).这个东西的稳定性与性能,直接决定了分布式架构系统的好坏 ...
- django 一个关于分组查询的问题分析
ret=Emp.objects.values('dep').annotate(avg_salary=Avg('salary')) print(ret) # ---*******单表分组查询ORM总结: ...
- hdu 1556 A - Color the ball 数状数组做法
#include<bits/stdc++.h> using namespace std; ; int n; int c[maxn]; int lowbit(int x) { return ...