启动PySpark:

[root@node1 ~]# pyspark
Python 2.7.5 (default, Nov  6 2016, 00:28:07)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Python version 2.7.5 (default, Nov  6 2016 00:28:07)
SparkContext available as sc, HiveContext available as sqlContext.

上下文已经包含 sc 和 sqlContext:

SparkContext available as sc, HiveContext available as sqlContext.

执行脚本:

>>> from __future__ import print_function
>>> import os
>>> import sys
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType# RDD is created from a list of rows
>>> some_rdd = sc.parallelize([Row(name="John", age=19),Row(name="Smith", age=23),Row(name="Sarah", age=18)])# Infer schema from the first row, create a DataFrame and print the schema
>>> some_df = sqlContext.createDataFrame(some_rdd)
>>> some_df.printSchema()
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)
# Another RDD is created from a list of tuples
>>> another_rdd = sc.parallelize([("John", 19), ("Smith", 23), ("Sarah", 18)])# Schema with two fields - person_name and person_age
>>> schema = StructType([StructField("person_name", StringType(), False),StructField("person_age", IntegerType(), False)])# Create a DataFrame by applying the schema to the RDD and print the schema
>>> another_df = sqlContext.createDataFrame(another_rdd, schema)
>>> another_df.printSchema()
root
 |-- person_name: string (nullable = false)
 |-- person_age: integer (nullable = false)

进入Github下载people.json文件:

并上传到HDFS上:

继续执行脚本:

# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files.
>>> if len(sys.argv) < 2:
...     path = "/user/cf/people.json"
... else:
...     path = sys.argv[1]
...
# Create a DataFrame from the file(s) pointed to by path
>>> people = sqlContext.jsonFile(path)
[Stage 5:>                                                          (0 + 1) / 2]19/07/04 10:34:33 WARN spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
# The inferred schema can be visualized using the printSchema() method.
>>> people.printSchema()
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

# Register this DataFrame as a table.
>>> people.registerAsTable("people")
/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/sql/dataframe.py:142: UserWarning: Use registerTempTable instead of registerAsTable.
  warnings.warn("Use registerTempTable instead of registerAsTable.")
# SQL statements can be run by using the sql methods provided by sqlContext
>>> teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
>>> for each in teenagers.collect():
...     print(each[0])
...
Justin   

执行结束:

>>> sc.stop()
>>> 

参考程序:

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from __future__ import print_function

import os
import sys

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType

if __name__ == "__main__":
    sc = SparkContext(appName="PythonSQL")
    sqlContext = SQLContext(sc)

    # RDD is created from a list of rows
    some_rdd = sc.parallelize([Row(name="John", age=19),
                              Row(name="Smith", age=23),
                              Row(name="Sarah", age=18)])
    # Infer schema from the first row, create a DataFrame and print the schema
    some_df = sqlContext.createDataFrame(some_rdd)
    some_df.printSchema()

    # Another RDD is created from a list of tuples
    another_rdd = sc.parallelize([("John", 19), ("Smith", 23), ("Sarah", 18)])
    # Schema with two fields - person_name and person_age
    schema = StructType([StructField("person_name", StringType(), False),
                        StructField("person_age", IntegerType(), False)])
    # Create a DataFrame by applying the schema to the RDD and print the schema
    another_df = sqlContext.createDataFrame(another_rdd, schema)
    another_df.printSchema()
    # root
    #  |-- age: integer (nullable = true)
    #  |-- name: string (nullable = true)

    # A JSON dataset is pointed to by path.
    # The path can be either a single text file or a directory storing text files.
    if len(sys.argv) < 2:
        path = "file://" + \
            os.path.join(os.environ['SPARK_HOME'], "examples/src/main/resources/people.json")
    else:
        path = sys.argv[1]
    # Create a DataFrame from the file(s) pointed to by path
    people = sqlContext.jsonFile(path)
    # root
    #  |-- person_name: string (nullable = false)
    #  |-- person_age: integer (nullable = false)

    # The inferred schema can be visualized using the printSchema() method.
    people.printSchema()
    # root
    #  |-- age: IntegerType
    #  |-- name: StringType

    # Register this DataFrame as a table.
    people.registerAsTable("people")

    # SQL statements can be run by using the sql methods provided by sqlContext
    teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

    for each in teenagers.collect():
        print(each[0])

    sc.stop()

Spark教程——(5)PySpark入门的更多相关文章

  1. Spark教程——(11)Spark程序local模式执行、cluster模式执行以及Oozie/Hue执行的设置方式

    本地执行Spark SQL程序: package com.fc //import common.util.{phoenixConnectMode, timeUtil} import org.apach ...

  2. Spring_MVC_教程_快速入门_深入分析

    Spring MVC 教程,快速入门,深入分析 博客分类: SPRING Spring MVC 教程快速入门  资源下载: Spring_MVC_教程_快速入门_深入分析V1.1.pdf Spring ...

  3. AFNnetworking快速教程,官方入门教程译

    AFNnetworking快速教程,官方入门教程译 分类: IOS2013-12-15 20:29 12489人阅读 评论(5) 收藏 举报 afnetworkingjsonios入门教程快速教程 A ...

  4. 【译】ASP.NET MVC 5 教程 - 1:入门

    原文:[译]ASP.NET MVC 5 教程 - 1:入门 本教程将教你使用Visual Studio 2013 预览版构建 ASP.NET MVC 5 Web 应用程序 的基础知识.本主题还附带了一 ...

  5. Nginx教程(一) Nginx入门教程

    Nginx教程(一) Nginx入门教程 1 Nginx入门教程 Nginx是一款轻量级的Web服务器/反向代理服务器及电子邮件(IMAP/POP3)代理服务器,并在一个BSD-like协议下发行.由 ...

  6. spark教程

    某大神总结的spark教程, 地址 http://litaotao.github.io/introduction-to-spark?s=inner

  7. Android基础-系统架构分析,环境搭建,下载Android Studio,AndroidDevTools,Git使用教程,Github入门,界面设计介绍

    系统架构分析 Android体系结构 安卓结构有四大层,五个部分,Android分四层为: 应用层(Applications),应用框架层(Application Framework),系统运行层(L ...

  8. Spark SQL 编程API入门系列之SparkSQL的依赖

    不多说,直接上干货! 不带Hive支持 <dependency> <groupId>org.apache.spark</groupId> <artifactI ...

  9. spark教程(七)-文件读取案例

    sparkSession 读取 csv 1. 利用 sparkSession 作为 spark 切入点 2. 读取 单个 csv 和 多个 csv from pyspark.sql import Sp ...

  10. spark教程(六)-Python 编程与 spark-submit 命令

    hadoop 是 java 开发的,原生支持 java:spark 是 scala 开发的,原生支持 scala: spark 还支持 java.python.R,本文只介绍 python spark ...

随机推荐

  1. 国内免费可用的STUN服务器(webrtc 必备)

    更新于2017年8月,本人亲测,国内可用,而且速度非常快! webRTC不可缺少的环节. 免费 STUN服务器列表(亲测有效)NO. STUN服务器 端口 有效 测试日期1 stun.xten.com ...

  2. Java日志介绍(1)-java.util.logging.Logger

    java.util.logging.Logger是JDK自带的日志工具,其简单实现了日志的功能,不是很完善,所以在实际应用中使用的比较少.本文直接用代码演示其使用方法,文中所使用到的软件版本:Java ...

  3. 「题解」「POJ1322」Chocolate

    目录 题目 原题目 简易题意 思路分析 代码 练习题 题目 原题目 点这里 简易题意 包裹里有无限个分布均匀且刚好 \(c\) 种颜色的巧克力,现在要依次拿 \(n\) 个出来放到桌子上.每次如果桌子 ...

  4. 需要写的CSS博客

    重绘与回流 BFC 水平垂直居中 定位 基线各种线 inline-block,img标签空字符

  5. Vue——路由回退至指定页面

    先来引出一下遇到的问题:在做一个移动端支付页面,在付款页面点击支付按钮,支付失败时跳转至支付失败提示页面:支付成功时跳转至支付成功页面.在支付成功页面下,如果用户点击手机自带的“返回”键,就又会跳转至 ...

  6. 吴裕雄--天生自然Numpy库学习笔记:NumPy 位运算

    bitwise_and() 函数对数组中整数的二进制形式执行位与运算. import numpy as np print ('13 和 17 的二进制形式:') a,b = 13,17 print ( ...

  7. AOP统一日志打印处理

    在日常开发工作中,我们免不了要打印很多log.而大部分需要输出的log又是重复的(例如传入参数,返回值).因此,通过AOP方式来进行日志管理可以减少很多代码量,也更加优雅. Springboot通过A ...

  8. Vue父组件主动获取子组件的数据和方法

    Vue父组件主动获取子组件的数据和方法 https://www.jianshu.com/p/bf88fc809131

  9. office 2016

    Excel 2016: F4  : 重复上一步操作. 例子: 如果上一步是合并单元格, 则 再次选中其他几个单元格, F4即再次完成合并. 单元格中插入对角线: 选中单元格, 右键--设置单元格格式- ...

  10. 洛谷 P1339 [USACO09OCT]热浪Heat Wave(最短路)

    嗯... 题目链接:https://www.luogu.org/problem/P1339 这道题是水的不能在水的裸最短路问题...这里用的dijkstra 但是自己进了一个坑—— 因为有些城市之间可 ...