pyspark MLlib踩坑之model predict+rdd map zip，zip使用尤其注意啊啊啊！

Updated:use model broadcast, mappartition+flatmap,see:

from pyspark import SparkContext

import numpy as np

from sklearn import ensemble

def batch(xs):

    yield list(xs)

N = 1000

train_x = np.random.randn(N, 10)

train_y = np.random.binomial(1, 0.5, N)

model = ensemble.RandomForestClassifier(10).fit(train_x, train_y)

test_x = np.random.randn(N * 100, 10)

sc = SparkContext()

n_partitions = 10

rdd = sc.parallelize(test_x, n_partitions).zipWithIndex()

b_model = sc.broadcast(model)

result = rdd.mapPartitions(batch) \

    .map(lambda xs: ([x[0] for x in xs], [x[1] for x in xs])) \

    .flatMap(lambda x: zip(x[1], b_model.value.predict(x[0])))

print(result.take(100))

see: https://gist.github.com/lucidfrontier45/591be3eb78557d1844ca

----------------------

一开始是因为没法直接在pyspark里使用map 来做model predict，但是scala是可以的！如下：

When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is to simply map over RDD:

val labelAndPreds = testData.map { point =>

  val prediction = model.predict(point.features)

  (point.label, prediction)

}

Unfortunately similar approach in PySpark doesn't work so well:

labelsAndPredictions = testData.map(

    lambda lp: (lp.label, model.predict(lp.features))

labelsAndPredictions.first()

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Instead of that official documentation recommends something like this:

predictions = model.predict(testData.map(lambda x: x.features))

labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

而这就是万恶的根源，因为zip在某些情况下并不能得到你想要的结果，就是说zip后的顺序是混乱的！！！我就在项目里遇到了！！！

This appears to imply that even the trivial a.map(f).zip(a) is not guaranteed to be equivalent to a.map(x => (f(x),x)). What is the situation when zip() results are reproducible?

见：https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method

原因：

“

zip is generally speaking a tricky operation. It requires both RDDs not only to have the same number of partitions but also the same number of elements per partition.

Excluding some special cases this is guaranteed only if both RDDs have the same ancestor and there are not shuffles and operations potentially changing number of elements (filter, flatMap) between the common ancestor and the current state. Typically it means only map (1-to-1) transformations.

“

见：https://stackoverflow.com/questions/32084368/can-only-zip-with-rdd-which-has-the-same-number-of-partitions-error

根源是因为我的ancestor rdd做了shuffle和filter的操作！最后在他们的子rdd上使用zip就会出错（数据乱序了）！！！真是太郁闷了，折腾一天这个问题，感谢上帝终于解决了！阿门！

最后我的解决方法是：

1、直接将rdd做union操作，rdd = rdd.union(sc.parallelize([]))，然后map，zip就能输出正常结果了！

2、或者是直接将预测的rdd collect到driver机器，使用model predict，是比较丑陋的做法！

pyspark MLlib踩坑之model predict+rdd map zip，zip使用尤其注意啊啊啊！的更多相关文章

Spark踩坑记——从RDD看集群调度
[TOC] 前言在Spark的使用中,性能的调优配置过程中,查阅了很多资料,之前自己总结过两篇小博文Spark踩坑记--初试和Spark踩坑记--数据库(Hbase+Mysql),第一篇概况的归纳了 ...
Django model重写save方法及update踩坑记录
一个非常实用的小方法试想一下,Django中如果我们想对保存进数据库的数据做校验,有哪些实现的方法? 我们可以在view中去处理,每当view接收请求,就对提交的数据做校验,校验不通过直接返回错误, ...
pyspark.mllib.feature module
Feature Extraction Feature Extraction converts vague features in the raw data into concrete numbers ...
tensorflow踩坑合集2. TF Serving & gRPC 踩坑
这一章我们借着之前的NER的模型聊聊tensorflow serving,以及gRPC调用要注意的点.以下代码为了方便理解做了简化,完整代码详见Github-ChineseNER ,里面提供了训练好的 ...
Spark踩坑记——Spark Streaming+Kafka
[TOC] 前言在WeTest舆情项目中,需要对每天千万级的游戏评论信息进行词频统计,在生产者一端,我们将数据按照每天的拉取时间存入了Kafka当中,而在消费者一端,我们利用了spark strea ...
Spark踩坑记——数据库（Hbase+Mysql）
[TOC] 前言在使用Spark Streaming的过程中对于计算产生结果的进行持久化时,我们往往需要操作数据库,去统计或者改变一些值.最近一个实时消费者处理任务,在使用spark streami ...
Spark踩坑记——共享变量
[TOC] 前言 Spark踩坑记--初试 Spark踩坑记--数据库(Hbase+Mysql) Spark踩坑记--Spark Streaming+kafka应用及调优在前面总结的几篇spark踩 ...
[转]Spark 踩坑记：数据库（Hbase+Mysql）
https://cloud.tencent.com/developer/article/1004820 Spark 踩坑记:数据库(Hbase+Mysql) 前言在使用Spark Streaming ...
ABP框架入门踩坑-配置数据库表前缀
配置数据库表前缀 ABP踩坑记录-目录本篇其实和ABP关系并不大,主要是EF Core的一些应用-.-. 起因支持数据库表前缀应该是很多应用中比较常见的功能,而在ABP中并没直接提供这一功能,所以 ...

随机推荐

多线程之HttpClient
在程序用调用 Http 接口.请求 http 资源.编写 http 爬虫等的时候都需要在程序集中进行 Http 请求. 很多人习惯的 WebClient.HttpWebRequest 在 TPL 下很 ...
lua中.和:的区别
local myTable = {} function myTable:putMyname(val) print(val) print(self and self.name) end myTable. ...
(转)Oracle分区表和索引的创建与管理
今天用到了Oracle表的分区,就顺便写几个例子把这个表的分区说一说: 一.创建分区表 1.范围分区根据数据表字段值的范围进行分区举个例子,根据学生的不同分数对分数表进行分区,创建一个分区表如下: ...
项目随笔之springmvc中freemark如何获取项目路径
转载:http://blog.csdn.net/whatlookingfor/article/details/51538995 在SpringMVC框架中使用Freemarker试图时,要获取根路径的 ...
一个不错的学习android的网站
http://androiddoc.qiniudn.com/guide/topics/ui/overview.html,最近想学下android的开发,找了一下网上的资料,中文的说的觉得太概括,看不太 ...
插入排序InsertSort
插入排序:从第二个数开始一直和前面的数组比较获得排序定位代码 /** *插入排序 */ public class InsertSort { public static void inser ...
js常用正则表达式大全--如：数字，字符等
一.校验数字的表达式 1 数字:^[0-9]*$ 2 n位的数字:^\d{n}$ 3 至少n位的数字:^\d{n,}$ 4 m-n位的数字:^\d{m,n}$ 5 零和非零开头的数字:^(0|[1-9 ...
C语言提高 (2) 第二天用指针对字符串进行操作
2 昨日回顾 p++: (把地址转换成整型加上它所指向的数据的大小 3指针成立条件和间接赋值条件一:有两个变量其中至少一个是指针条件二:建立关联条件三:间接操作 4间接操作的例子 5间接操作 ...
js中数组常用方法
1.Array.push() 此方法是在数组的后面添加新加元素,此方法改变了数组的长度: var aa=[1,2,3]; var bb=aa.push(4,5); console.log(bb) ...
使用LeNet训练自己的手写图片数据
一.前言本文主要尝试将自己的数据集制作成lmdb格式,送进lenet作训练和测试,参考了http://blog.csdn.net/liuweizj12/article/details/5214974 ...

pyspark MLlib踩坑之model predict+rdd map zip，zip使用尤其注意啊啊啊！

pyspark MLlib踩坑之model predict+rdd map zip，zip使用尤其注意啊啊啊！的更多相关文章

随机推荐

热门专题