spark-submit python egg 解决三方件依赖问题

bonelee 2024-10-21 20:23:35 原文

假设spark里用到了purl这个三方件，https://github.com/ultrabluewolf/p.url，他还额外依赖futures这个三方件（six的话，anaconda2自带）。

pyspark 代码如下：

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("My test App")

sc = SparkContext(conf=conf)

#from purl import Purl

def get_purl(x):

    from purl import Purl

    url = Purl('https://github.com/search?q={}'.format(x))

    return str(url.add_query('name', 'dog'))

int_rdd = sc.parallelize([1, 2, 3, 4])

r =int_rdd.map(lambda x: get_purl(x))

print(r.collect())

下面说明如何编译打包egg。

通过https://pypi.org/project/p.url/#files 下载源码。然后解压：

python setup.py bdist_egg

在dist目录下可以看到有egg文件生成。

同理，下载https://pypi.org/project/future/#files futures的源码，然后解压生成egg文件。

最终运行：

spark-submit --py-files p.url-0.1.0a4-py2.7.egg,future-0.17.1-py2.7.egg main_dep.py

结果输出：

['https://github.com/search?q=1&name=dog', 'https://github.com/search?q=2&name=dog', 'https://github.com/search?q=3&name=dog', 'https://github.com/search?q=4&name=dog']

补充官方文档，比较蛋疼，没有说具体操作：

Complex Dependencies

Some operations rely on complex packages that also have many dependencies. For example, the following code snippet imports the Python pandas data analysis library:

def import_pandas(x):

 import pandas

 return x

int_rdd = sc.parallelize([1, 2, 3, 4])

int_rdd.map(lambda x: import_pandas(x))

int_rdd.collect()

pandas depends on NumPy, SciPy, and many other packages. Although pandas is too complex to distribute as a *.py file, you can create an egg for it and its dependencies and send that to executors.

Limitations of Distributing Egg Files

In both self-contained and complex dependency scenarios, sending egg files is problematic because packages that contain native code must be compiled for the specific host on which it will run. When doing distributed computing with industry-standard hardware, you must assume is that the hardware is heterogeneous. However, because of the required C compilation, a Python egg built on a client host is specific to the client CPU architecture. Therefore, distributing an egg for complex, compiled packages like NumPy, SciPy, and pandas often fails. Instead of distributing egg files you should install the required Python packages on each host of the cluster and specify the path to the Python binaries for the worker hosts to use.

spark-submit python egg 解决三方件依赖问题的更多相关文章

[Dynamic Language] pyspark Python3.7环境设置及py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe解决!
pyspark Python3.7环境设置及py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spa ...
spark submit参数及调优(转载)
spark submit参数介绍你可以通过spark-submit --help或者spark-shell --help来查看这些参数. 使用格式: ./bin/spark-submit \ -- ...
windows命令行模式下无法打开python程序解决方法
今天刚开始学Python,首先编写一个简单地hello world程序,想在命令行模式运行,结果出现下面: 经过一番思考,发现用cd命令可以解决这件事,看下图: 这样就解决了.
【原创】大数据基础之Spark（1）Spark Submit即Spark任务提交过程
Spark2.1.1 一 Spark Submit本地解析 1.1 现象提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...
spark编程python实例
spark编程python实例 ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PyS ...
HarmonyOS三方件开发指南（12）——cropper图片裁剪
鸿蒙入门指南,小白速来!0基础学习路线分享,高效学习方法,重点答疑解惑--->[课程入口] 目录:1. cropper组件功能介绍2. cropper使用方法3. cropper组件开发实现4. ...
HarmonyOS三方件开发指南(13)-SwipeLayout侧滑删除
鸿蒙入门指南,小白速来!0基础学习路线分享,高效学习方法,重点答疑解惑--->[课程入口] 目录:1. SwipeLayout组件功能介绍2. SwipeLayout使用方法3. SwipeLa ...
HarmonyOS三方件开发指南(14)-Glide组件功能介绍
<HarmonyOS三方件开发指南>系列文章合集引言在实际应用开发中,会用到大量图片处理,如:网络图片.本地图片.应用资源.二进制流.Uri对象等,虽然官方提供了PixelMap进行图 ...
HarmonyOS三方件开发指南(15)-LoadingView功能介绍
目录: 1. LoadingView组件功能介绍2. Lottie使用方法3. Lottie开发实现4.<HarmonyOS三方件开发指南>系列文章合集 1. LoadingView组件功 ...

随机推荐

[LeetCode] 127. Word Ladder 单词阶梯
Given two words (beginWord and endWord), and a dictionary's word list, find the length of shortest t ...
centos 安装 swoole_framework 框架
composer require "matyhtf/swoole_framework" 运行以上命令 Using version ^1.20 for matyhtf/swoole_ ...
Kafka工具教程 - Apache Kafka中的2个重要工具
1.目标 - 卡夫卡工具在我们上一期的Kafka教程中,我们讨论了Kafka Workflow.今天,我们将讨论Kafka Tool.首先,我们将看到卡夫卡的意义.此外,我们将了解两个Kafka工具 ...
1.RabbitMQ工作模型与基本原理
1.了解 MQ 的本质和 RabbitMQ 的特性: 2.掌握 RabbitMQ 的 Java API 编程和 Spring 集成 RabbitMQ 1. MQ 了解 1.1. 消息队列简介 ...
嵌入式02 STM32 实验04跑马灯
开学STM32 跑马灯的实验主要就是了解GPIO口的配置及使用,我这里是使用库函数进行编程,主要需要设置以下两方面: 1.使能需要使用的IO口的时钟,一共有A.B.C.D.E.F.G七组IO口 2.初 ...
Zabbix案例实践|Zabbix屏蔽告警
近期项目中,客户要求在凌晨00:00到02:00的CPU屏蔽虚拟化监控上ESXI的红色告警,红色告警是由于某台vmCPU利用率过高而产生的.做法如下:1. 找到红色告警的触发器,通过触发器找到监控项, ...
神奇的print
一:多看看 1. #大小写转换 ,有大写的全转化为大写 s = 'fds Kkg' print(s.swapcase()) #下划线等各种插入 s = 'fdsfkg' print('_'.join ...
用GDB调试程序（四）
查看栈信息————— 当程序被停住了,你需要做的第一件事就是查看程序是在哪里停住的.当你的程序调用了一个函数,函数的地址,函数参数,函数内的局部变量都会被压入“栈”(Stack)中.你可以用GDB命令 ...
1.Rabbitmq学习记录《本质介绍，协议AMQP分析》
1.RabbitMQ是一个由erlang开发的AMQP(Advanced Message Queue )的开源实现. RabbitMQ的优势-: 除了Qpid,RabbitMQ是唯一一个实现了AMQP ...
MySQL Group Replication的安装部署
一.简介这次给大家介绍下MySQL官方最新版本5.7.17中GA的新功能 Group Replication . Group Replication是一种可用于实现容错系统的技术.复制组是一组通过消 ...