When I write PySpark code, I use Jupyter notebook to test my code before submitting a job on the cluster. In this post, I will show you how to install and run PySpark locally in Jupyter Notebook on Windows. I’ve tested this guide on a dozen Windows 7 and 10 PCs in different languages.

A. Items needed

  • Spark distribution from spark.apache.org

  • The findspark Python module, which can be installed by running python -m pip install findspark either in Windows command prompt or Git bash if Python is installed in item 2. You can find command prompt by searching cmd in the search box.

  • If you don’t have Java or your Java version is 7.x or less, download and install Java from Oracle. I recommend getting the latest JDK (current version 9.0.1).

  • If you don’t know how to unpack a .tgz file on Windows, you can download and install 7-zip on Windows to unpack the .tgz file from Spark distribution in item 1 by right-clicking on the file icon and select 7-zip > Extract Here.

B. Installing PySpark

After getting all the items in section A, let’s set up PySpark.

  • Unpack the .tgz file. For example, I unpacked with 7zip from step A6 and put mine under D:\spark\spark-2.2.1-bin-hadoop2.7

  • Move the winutils.exe downloaded from step A3 to the \bin folder of Spark distribution. For example, D:\spark\spark-2.2.1-bin-hadoop2.7\bin\winutils.exe

  • Add environment variables: the environment variables let Windows find where the files are when we start the PySpark kernel. You can find the environment variable settings by putting “environ…” in the search box.

The variables to add are, in my example,

Name Value
SPARK_HOME D:\spark\spark-2.2.1-bin-hadoop2.7
HADOOP_HOME D:\spark\spark-2.2.1-bin-hadoop2.7
PYSPARK_DRIVER_PYTHON jupyter
PYSPARK_DRIVER_PYTHON_OPTS notebook

  • In the same environment variable settings window, look for the Path or PATH variable, click edit and add D:\spark\spark-2.2.1-bin-hadoop2.7\bin to it. In Windows 7 you need to separate the values in Path with a semicolon ; between the values.

  • (Optional, if see Java related error in step C) Find the installed Java JDK folder from step A5, for example, D:\Program Files\Java\jdk1.8.0_121, and add the following environment variable

Name Value
JAVA_HOME D:\Progra~1\Java\jdk1.8.0_121

If JDK is installed under \Program Files (x86), then replace the Progra~1 part by Progra~2 instead. In my experience, this error only occurs in Windows 7, and I think it’s because Spark couldn’t parse the space in the folder name.

Edit (1/23/19): You might also find Gerard’s comment helpful: http://disq.us/p/1z5qou4

C. Running PySpark in Jupyter Notebook

To run Jupyter notebook, open Windows command prompt or Git Bash and run jupyter notebook. If you use Anaconda Navigator to open Jupyter Notebook instead, you might see a Java gateway process exited before sending the driver its port number error from PySpark in step C. Fall back to Windows cmd if it happens.

Once inside Jupyter notebook, open a Python 3 notebook

In the notebook, run the following code

import findspark
findspark.init() import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() df = spark.sql('''select 'spark' as hello ''')
df.show()

When you press run, it might trigger a Windows firewall pop-up. I pressed cancel on the pop-up as blocking the connection doesn’t affect PySpark.

If you see the following output, then you have installed PySpark on your Windows system!

如何在Windows上的Jupyter Notebook中安装和运行PySpark的更多相关文章

  1. 解决在jupyter notebook中遇到的ImportError: matplotlib is required for plotting问题

    昨天学习pandas和matplotlib的过程中, 在jupyter notebook遇到ImportError: matplotlib is required for plotting错误, 以下 ...

  2. 在jupyter notebook 中同时使用安装不同版本的python内核-从而可以进行切换

    在安装anaconda的时候,默认安装的是python3.6 但是cs231n课程作业是在py2.7环境下运行的.所以需要在jupyter notebook中安装并启用python2.7版本 方法: ...

  3. 在jupyter notebook中同时安装python2和python3

    之前讨论过在anaconda下安装多个python版本,本期来讨论下,jupyter notebook中怎样同时安装python2.7 和python3.x. 由于我之前使用的jupyter note ...

  4. Windows下的Jupyter Notebook 安装与自定义启动(图文详解)

    不多说,直接上干货! 前期博客 Windows下的Python 3.6.1的下载与安装(适合32bits和64bits)(图文详解) 这是我自定义的Python 的安装目录 (D:\SoftWare\ ...

  5. 非线性函数的最小二乘拟合及在Jupyter notebook中输入公式 [原创]

    突然有个想法,能否通过学习一阶RC电路的阶跃响应得到RC电路的结构特征——时间常数τ(即R*C).回答无疑是肯定的,但问题是怎样通过最小二乘法.正规方程,以更多的采样点数来降低信号采集噪声对τ估计值的 ...

  6. Windows下的Jupyter Notebook 安装与自定义启动

    1.Jupyter Notebook 和 pip 为了更加方便地写 Python 代码,还需要安装 Jupyter notebook. 利用 pip 安装 Jupyter notebook. 为什么要 ...

  7. windows系统下jupyter notebook使用虚拟环境

    目录 [亲测好使]windows系统下jupyter notebook使用虚拟环境 在虚拟环境中安装jupyter,并添加到jupyter kernel 参考 [未测试,但觉得比上面那方法好,因为上面 ...

  8. (转)如何在Windows上安装多个MySQL

    原文:http://www.blogjava.net/hongjunli/archive/2009/03/01/257216.html 如何在Windows上安装多个MySQL 本文以免安装版的mys ...

  9. Redis简介以及如何在Windows上安装Redis

    Redis简介 Redis是一个速度非常快的非关系型内存数据库. Redis提供了Java,C/C++,C#,PHP,JavaScript,Perl,Object-C,Python,Ruby,Erla ...

随机推荐

  1. cygwin的sh: line 15: $'\r': command not found错误

    安装dos2unix工具,转换一下sh文件即可 apt-cyg install dos2unix dos2unix xxx.sh

  2. msmq访问格式

    //集群测试,以下格式不行(应是Host映射之类没配置OK) //_MSMQPath = @"FormatName:DIRECT=TCP:msmq496-ha\private$\496-10 ...

  3. 3.3.1 MyBatis框架介绍

    MyBatis框架介绍 1. 什么是框架 (1) 框架是偷懒的程序员将代码进行封装, 之后进行重复使用的过程. (2) 框架其实是一个半成品, 以连接数据库为例, 连接数据库使用的驱动, url, 用 ...

  4. SWIT2019无线通信和信息技术国际研讨会(上海)

    无线通信和信息技术国际研讨会(SWIT 2019)将于2019年6月29日至30日在中国上海皇冠晶品酒店举行.本次会议将讨论无线通信和信息技术问题.它致力于创造一个交流最新研究成果和分享先进研究方法的 ...

  5. Backup &recovery备份和还原

    实践版本:MySQL5.7 备份类型(backup type)物理和逻辑备份(Physical Versus Logical Backup)        物理备份是指直接复制存储数据库内容的目录和文 ...

  6. static 关键字的作用

    在C语言中,关键字static有三个明显的作用: 1)在函数体内,一个被声明为静态的变量在这一函数被调用过程中维持其值不变(该变量存放在静态变量区). 2) 在模块内(但在函数体外),一个被声明为静态 ...

  7. 获取上一行记录lag

    SELECT EMPLID ,EFFDT ,END_DT ,COMPANY ,DEPTID ,POSITION_NBR ,' ' ,' ' FROM ( SELECT J1.EMPLID ,J1.EF ...

  8. 微信网页浏览器打开链接后跳转到其他浏览器下载APK文件包

    做微信营销活动或者APK下载推广时候,是无法直接下载,做到微信中正常使用呢?这就要借助一些工具来实现有效的操作. 安卓手机的话是通过点击链接,直接跳转出微信.自动打开手机默认的浏览器.但是这个方法IO ...

  9. SSH整合时多表关联查询出现Javassist增强失败

    Customer类对应的表为另一个表LinkMan的外键,在进行LinkMan表操作时,出现如下错误. 遇到Javassist增强失败网上说法不一,有的说Customer没有无参构造方法,javass ...

  10. 【错误总结1:unity StartCoroutine 报 NullReferenceException 错误】

    今天在一个项目中,写了一个单例的全局类,该类的作用是使用协程加载场景.但在StartCoroutine 这一步报了NullReferenceException 的错.仔细分析和搜索之后,得到错误原因. ...