We use Redis on Spark to cache our key-value pairs.This is the code:

import com.redis.RedisClient
val r = new RedisClient("192.168.1.101", 6379)
val perhit = perhitFile.map(x => {
val arr = x.split(" ")
val readId = arr(0).toInt
val refId = arr(1).toInt
val start = arr(2).toInt
val end = arr(3).toInt
val refStr = r.hmget("refStr", refId).get(refId).split(",")(1)
val readStr = r.hmget("readStr", readId).get(readId)
val realend = if(end > refStr.length - 1) refStr.length - 1 else end
val refOneStr = refStr.substring(start, realend)
(readStr, refOneStr, refId, start, realend, readId)
})

But compiler gave me feedback like this:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at org.apache.spark.rdd.RDD.map(RDD.scala:270)
at com.ynu.App$.main(App.scala:511)
at com.ynu.App.main(App.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: com.redis.RedisClient
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 12 more

Could somebody tell me how to serialize the data get from Redis.Thanks a lot.

asked Jan 18 '15 at 2:18
fanhk

173211
 

2 Answers

In Spark, the functions on RDDs (like map here) are serialized and send to the executors for processing. This implies that all elements contained within those operations should be serializable.

The Redis connection here is not serializable as it opens TCP connections to the target DB that are bound to the machine where it's created.

The solution is to create those connections on the executors, in the local execution context. There're few ways to do that. Two that pop to mind are:

  • rdd.mapPartitions: lets you process a whole partition at once, and therefore amortize the cost of creating connections)
  • Singleton connection managers: Create the connection once per executor

mapPartitions is easier as all it requires is a small change to the program structure:

val perhit = perhitFile.mapPartitions{partition =>
val r = new RedisClient("192.168.1.101", 6379) // create the connection in the context of the mapPartition operation
val res = partition.map{ x =>
...
val refStr = r.hmget(...) // use r to process the local data
}
r.close // take care of resources
res
}

A singleton connection manager can be modeled with an object that holds a lazy reference to a connection (note: a mutable ref will also work).

object RedisConnection extends Serializable {
lazy val conn: RedisClient = new RedisClient("192.168.1.101", 6379)
}

This object can then be used to instantiate 1 connection per worker JVM and is used as a Serializable object in an operation closure.

val perhit = perhitFile.map{x =>
val param = f(x)
val refStr = RedisConnection.conn.hmget(...) // use RedisConnection to get a connection to the local data
}
}

The advantage of using the singleton object is less overhead as connections are created only once by JVM (as opposed to 1 per RDD partition)

There're also some disadvantages:

  • cleanup of connections is tricky (shutdown hook/timers)
  • one must ensure thread-safety of shared resources

(*) code provided for illustration purposes. Not compiled or tested.

answered Jan 19 '15 at 12:00
maasg

17.3k34166
 
    
Thank you for answering! I use this library github.com/debasishg/scala-redis. It haven't a method named "close", instead, it is "disconnect".I've no idea if it works. Could you tell me which library you are using now to deal with Redis data? – fanhk Jan 20 '15 at 4:33
    
Plus 1 for the Singleton solution. Can you give an example on how to manage the closing of the connection?– Sohaib Dec 4 '15 at 11:11
    
@Sohaib given this is a VM-bound object, you'll need to register a shutdown hook to cleanly close connections. – maasg Dec 11 '15 at 9:06
 

You're trying to serialize the client. You have one RedisClientr, that you're trying to use inside themap that will be run across different cluster nodes. Either get the data you want out of redis separately before doing a cluster task, or create the client individually for each cluster task inside yourmap block (perhaps by using mapPartitions rather than map, as creating a new redis client for each individual row is probably a bad idea).

answered Jan 18 '15 at 8:42
lmm

10.6k11225
 
    
Thank you for answering, but could you tell me how to use mapPartitions in this situation? – fanhk Jan 18 '15 at 11:49
    
Call mapPartitions passing a block that accepts an iterable (as you can see from the signature ofmapPartitions), creates the RedisClient inside the block, and then uses it to map the Iterable as you were doing. The point is that the RedisClient gets created inside the processing for a single partition. What did you try and where did you get stuck? – lmm Jan 19 '15 at 14:57
    
Problem solved,thank you! – fanhk Jan 20 '15 at 4:42

Redis on Spark:Task not serializable的更多相关文章

  1. spark2.1注册内部函数spark.udf.register("xx", xxx _),运行时抛出异常:Task not serializable

    函数代码: class MySparkJob{ def entry(spark:SparkSession):Unit={ def getInnerRsrp(outer_rsrp: Double, we ...

  2. spark出现task不能序列化错误的解决方法 org.apache.spark.SparkException: Task not serializable

    import org.elasticsearch.cluster.routing.Murmur3HashFunction; import org.elasticsearch.common.math.M ...

  3. Spark运行程序异常信息: org.apache.spark.SparkException: Task not serializable 解决办法

    错误信息: 17/05/20 18:51:39 ERROR JobScheduler: Error running job streaming job 1495277499000 ms.0 org.a ...

  4. 【原创】大叔问题定位分享(19)spark task在executors上分布不均

    最近提交一个spark应用之后发现执行非常慢,点开spark web ui之后发现卡在一个job的一个stage上,这个stage有100000个task,但是绝大部分task都分配到两个execut ...

  5. Hadoop MapReduce Task的进程模型与Spark Task的线程模型

    Hadoop的MapReduce的Map Task和Reduce Task都是进程级别的:而Spark Task则是基于线程模型的. 多进程模型和多线程模型 所谓的多进程模型和多线程模型,指的是同一个 ...

  6. Kafka Topic ISR不全,个别Spark task处理时间长

    现象 Spark streaming读kafka数据做业务处理时,同一个stage的task,有个别task的运行时间比多数task时间都长,造成业务延迟增大. 查看业务对应的topic发现当topi ...

  7. Spark Task 概述

    Task的执行流程: 1. Driver端中的 CoarseGrainSchedulerBackend 给 CoarseGrainExecutorBacken 发送 LaunchTask 消息 2. ...

  8. 大数据学习day34---spark14------1 redis的事务(pipeline)测试 ,2. 利用redis的pipeline实现数据统计的exactlyonce ,3 SparkStreaming中数据写入Hbase实现ExactlyOnce, 4.Spark StandAlone的执行模式,5 spark on yarn

    1 redis的事务(pipeline)测试 Redis本身对数据进行操作,单条命令是原子性的,但事务不保证原子性,且没有回滚.事务中任何命令执行失败,其余的命令仍会被执行,将Redis的多个操作放到 ...

  9. 【原】 Spark中Task的提交源码解读

    版权声明:本文为原创文章,未经允许不得转载. 复习内容: Spark中Stage的提交 http://www.cnblogs.com/yourarebest/p/5356769.html Spark中 ...

随机推荐

  1. 算法笔记_208:第六届蓝桥杯软件类决赛真题(Java语言A组)

    目录 1 胡同门牌号 2 四阶幻方 3 显示二叉树 4 穿越雷区 5 切开字符串 6 铺瓷砖   前言:以下代码仅供参考,若有错误欢迎指正哦~ 1 胡同门牌号 标题:胡同门牌号 小明家住在一条胡同里. ...

  2. SVN mime-type 笔记

    背景: 1.最近使用执行svn diff的时候发现有些文本文件无法显示: 2.浏览器会通过判断获取文件的 MIME 类型, 调用不同的客户端程序或使用不同的方式来执行.如果文件的 MIME 缺失或者有 ...

  3. 强制MySQL查询走索引和强制查询不缓存

    有些情况下,表中创建了索引但是EXPLAIN的查看执行计划的时候发现并没有走索引.是因为优化器认为该语句不使用索引效率更好. 当然也可以强制走索引.类似: SELECT uid,uname FROM ...

  4. Ubuntu vim+ ctags(包括系统函数) + taglist 配置

    阅读大型代码,我们常常须要打开非常多的代码文件,搜索各种定义.windows下用惯了ide的朋友.转战Linux的时候可能会认为非常难受,找不到合适的阅读工具. 事实上万能的vim就能够实现. 以下介 ...

  5. Openerp 7.0 附件存储位置

    我们知道对OpenERP中的每个内部对象(比如:业务伙伴,采购订单,销售订单,发货单,等等)我们都可以添加任意的附件,如图片,文档,视频等.那么这些附件在OpenERP内部是如何管理的呢? 默认情况下 ...

  6. crontab 格式

  7. 按部就班——图解配置IIS5的SSL安全访问(转)

    作者:mikespook 版本:1.0 最后更新:2004-12-22 16:04 按部就班——图解配置IIS5的SSL安全访问... 1 写在前面的... 1 第一步:       准备工作... ...

  8. CEF 各个版本适应的平台参考表

    This Wiki page provides information about CEF branches and instructions for downloading, building an ...

  9. 腾讯云-搭建 .NET Core 开发环境

    搭建 .NET Core 开发环境 安装 .Net Core 执行代码 任务时间:时间未知 .NET Core 的官方文档很详细,本实验带你建立一个.NET Core 1.1的Web运行环境,更多内容 ...

  10. 打开Activity时,不自动显示(弹出)虚拟键盘

    打开Activity时,不自动显示(弹出)虚拟键盘 在AndroidManifest.xml文件中<activity>标签中添加属性 android:windowSoftInputMode ...