spark-submit python egg 解决三方件依赖问题

bonelee 2024-10-21 20:23:35 原文

假设spark里用到了purl这个三方件，https://github.com/ultrabluewolf/p.url，他还额外依赖futures这个三方件（six的话，anaconda2自带）。

pyspark 代码如下：

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("My test App")

sc = SparkContext(conf=conf)

#from purl import Purl

def get_purl(x):

    from purl import Purl

    url = Purl('https://github.com/search?q={}'.format(x))

    return str(url.add_query('name', 'dog'))

int_rdd = sc.parallelize([1, 2, 3, 4])

r =int_rdd.map(lambda x: get_purl(x))

print(r.collect())

下面说明如何编译打包egg。

通过https://pypi.org/project/p.url/#files 下载源码。然后解压：

python setup.py bdist_egg

在dist目录下可以看到有egg文件生成。

同理，下载https://pypi.org/project/future/#files futures的源码，然后解压生成egg文件。

最终运行：

spark-submit --py-files p.url-0.1.0a4-py2.7.egg,future-0.17.1-py2.7.egg main_dep.py

结果输出：

['https://github.com/search?q=1&name=dog', 'https://github.com/search?q=2&name=dog', 'https://github.com/search?q=3&name=dog', 'https://github.com/search?q=4&name=dog']

补充官方文档，比较蛋疼，没有说具体操作：

Complex Dependencies

Some operations rely on complex packages that also have many dependencies. For example, the following code snippet imports the Python pandas data analysis library:

def import_pandas(x):

 import pandas

 return x

int_rdd = sc.parallelize([1, 2, 3, 4])

int_rdd.map(lambda x: import_pandas(x))

int_rdd.collect()

pandas depends on NumPy, SciPy, and many other packages. Although pandas is too complex to distribute as a *.py file, you can create an egg for it and its dependencies and send that to executors.

Limitations of Distributing Egg Files

In both self-contained and complex dependency scenarios, sending egg files is problematic because packages that contain native code must be compiled for the specific host on which it will run. When doing distributed computing with industry-standard hardware, you must assume is that the hardware is heterogeneous. However, because of the required C compilation, a Python egg built on a client host is specific to the client CPU architecture. Therefore, distributing an egg for complex, compiled packages like NumPy, SciPy, and pandas often fails. Instead of distributing egg files you should install the required Python packages on each host of the cluster and specify the path to the Python binaries for the worker hosts to use.

spark-submit python egg 解决三方件依赖问题的更多相关文章

[Dynamic Language] pyspark Python3.7环境设置及py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe解决!
pyspark Python3.7环境设置及py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spa ...
spark submit参数及调优(转载)
spark submit参数介绍你可以通过spark-submit --help或者spark-shell --help来查看这些参数. 使用格式: ./bin/spark-submit \ -- ...
windows命令行模式下无法打开python程序解决方法
今天刚开始学Python,首先编写一个简单地hello world程序,想在命令行模式运行,结果出现下面: 经过一番思考,发现用cd命令可以解决这件事,看下图: 这样就解决了.
【原创】大数据基础之Spark（1）Spark Submit即Spark任务提交过程
Spark2.1.1 一 Spark Submit本地解析 1.1 现象提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...
spark编程python实例
spark编程python实例 ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PyS ...
HarmonyOS三方件开发指南（12）——cropper图片裁剪
鸿蒙入门指南,小白速来!0基础学习路线分享,高效学习方法,重点答疑解惑--->[课程入口] 目录:1. cropper组件功能介绍2. cropper使用方法3. cropper组件开发实现4. ...
HarmonyOS三方件开发指南(13)-SwipeLayout侧滑删除
鸿蒙入门指南,小白速来!0基础学习路线分享,高效学习方法,重点答疑解惑--->[课程入口] 目录:1. SwipeLayout组件功能介绍2. SwipeLayout使用方法3. SwipeLa ...
HarmonyOS三方件开发指南(14)-Glide组件功能介绍
<HarmonyOS三方件开发指南>系列文章合集引言在实际应用开发中,会用到大量图片处理,如:网络图片.本地图片.应用资源.二进制流.Uri对象等,虽然官方提供了PixelMap进行图 ...
HarmonyOS三方件开发指南(15)-LoadingView功能介绍
目录: 1. LoadingView组件功能介绍2. Lottie使用方法3. Lottie开发实现4.<HarmonyOS三方件开发指南>系列文章合集 1. LoadingView组件功 ...

随机推荐

css3逐帧动画
写css3动画的时候,我们经常用到animation来实现,默认情况下,animation是属于连贯性的ease动画.我们熟悉的animation动画有ease.ease-in.ease-out.li ...
用ArcMap在PostgreSQL中创建要素类需要执行”create enterprise geodatabase”吗
问:用Acmap在PostgreSQL中创建要素类需要执行"create enterprise geodatabase"吗? 关于这个问题,是在为新员工做postgresql培训后 ...
ubuntu12下安装unixODBC（mysql）
转自:https://blog.51cto.com/dreamylights/1321678 1. 需要的包 unixODBC源码包unixODBC-2.2.14.tar.gz mysql 驱动 my ...
【转帖】处理器史话 | 当Power架构的发展之路遭遇“滑铁卢”
处理器史话 | 当Power架构的发展之路遭遇“滑铁卢” https://www.eefocus.com/mcu-dsp/366740 (8)Power8:决定了 Power 平台的未来发展 2014 ...
Nvidia Jetson TX2开发板学习历程（ 2 ）－更换pip源，提高下载速度
通过将pip的源更换为国内源,来提高下载速度,这也将成为今后学习过程下载Python包的基础,建议前期一定要完成! 知名的国内源清华:https://pypi.tuna.tsinghua.edu.c ...
18年10月 python 中出现 ValueError: need more than 1 value to unpack 解决办法（笨办法）
eg:a,b = argv :错误,我的理解也许不正确,但是能解决办法 a,b= argv,argv 正确 :经测试不会出现错误. ------------------------------ ...
Vue.js 2.x render 渲染函数 & JSX
Vue.js 2.x render 渲染函数 & JSX Vue绝大多数情况下使用template创建 HTML.但是比如一些重复性比较高的场景,需要运用 JavaScript 的完全编程能力 ...
[洛谷P5323][BJOI2019]光线
题目大意:有$n$层玻璃,每层玻璃会让$a\%$的光通过,并把$b\%$的光反射.有一束光从左向右射过,问多少的光可以透过这$n$层玻璃题解:事实上会发现,可以把连续的几层玻璃合成一层玻璃,但是要注 ...
正则表达式"(^|&)" ,什么意思?
^匹配字符串开头,&就是&字符 (^|&)匹配字符串开头或者&字符,如果其后还有正则,那么必须出现在字符串开始或&字符之后用法一: 限定开头文档上给出了 ...
WPF 的 Application.Current.Dispatcher 中，为什么 Current 可能为 null
原文:WPF 的 Application.Current.Dispatcher 中,为什么 Current 可能为 null 在 WPF 程序中,可能会存在 Application.Current.D ...