[Spark][Python]DataFrame的左右连接例子

$ hdfs dfs -cat people.json

{"name":"Alice","pcode":"94304"}
{"name":"Brayden","age":30,"pcode":"94304"}
{"name":"Carla","age":19,"pcoe":"10036"}
{"name":"Diana","age":46}
{"name":"Etienne","pcode":"94104"}

$ hdfs dfs -cat pcodes.json

{"pcode":"10036","city":"New York","state":"NY"}
{"pcode":"87501","city":"Santa Fe","state":"NM"}
{"pcode":"94304","city":"Palo Alto","state":"CA"}
{"pcode":"94104","city":"San Francisco","state":"CA"}

$pyspark

sqlContext = HiveContext(sc)
peopleDF = sqlContext.read.json("people.json")
peopleDF.limit(5).show()

+----+-------+-----+-----+
| age| name|pcode| pcoe|
+----+-------+-----+-----+
|null| Alice|94304| null|
| 30|Brayden|94304| null|
| 19| Carla| null|10036|
| 46| Diana| null| null|
|null|Etienne|94104| null|
+----+-------+-----+-----+

sqlContext = HiveContext(sc)
pcodesDF = sqlContext.read.json("pcodes.json")
pcodesDF.limit(5).show()

+-------------+-----+-----+
| city|pcode|state|
+-------------+-----+-----+
| New York|10036| NY|
| Santa Fe|87501| NM|
| Palo Alto|94304| CA|
|San Francisco|94104| CA|
+-------------+-----+-----+

mydf000 = peopleDF.join(pcodesDF,"pcode")
mydf000.limit(5).show()

+-----+----+-------+----+-------------+-----+
|pcode| age| name|pcoe| city|state|
+-----+----+-------+----+-------------+-----+
|94304|null| Alice|null| Palo Alto| CA|
|94304| 30|Brayden|null| Palo Alto| CA|
|94104|null|Etienne|null|San Francisco| CA|
+-----+----+-------+----+-------------+-----+

mydf001=peopleDF.join(pcodesDF,"pcode","leftsemi")
mydf001.limit(5).show()

+-----+----+-------+----+
|pcode| age| name|pcoe|
+-----+----+-------+----+
|94304|null| Alice|null|
|94304| 30|Brayden|null|
|94104|null|Etienne|null|
+-----+----+-------+----+

mydf002=peopleDF.join(pcodesDF,"pcode","left_outer")
mydf002.limit(5).show()

+-----+----+-------+-----+-------------+-----+
|pcode| age| name| pcoe| city|state|
+-----+----+-------+-----+-------------+-----+
|94304|null| Alice| null| Palo Alto| CA|
|94304| 30|Brayden| null| Palo Alto| CA|
| null| 19| Carla|10036| null| null|
| null| 46| Diana| null| null| null|
|94104|null|Etienne| null|San Francisco| CA|
+-----+----+-------+-----+-------------+-----+

mydf003=peopleDF.join(pcodesDF,"pcode","right_outer")
mydf003.limit(5).show()

+-----+----+-------+----+-------------+-----+
|pcode| age| name|pcoe| city|state|
+-----+----+-------+----+-------------+-----+
|10036|null| null|null| New York| NY|
|87501|null| null|null| Santa Fe| NM|
|94304|null| Alice|null| Palo Alto| CA|
|94304| 30|Brayden|null| Palo Alto| CA|
|94104|null|Etienne|null|San Francisco| CA|
+-----+----+-------+----+-------------+-----+

[Spark][Python]DataFrame的左右连接例子的更多相关文章

  1. [Spark][Python][DataFrame][RDD]DataFrame中抽取RDD例子

    [Spark][Python][DataFrame][RDD]DataFrame中抽取RDD例子 sqlContext = HiveContext(sc) peopleDF = sqlContext. ...

  2. [Spark][Python][DataFrame][RDD]从DataFrame得到RDD的例子

    [Spark][Python][DataFrame][RDD]从DataFrame得到RDD的例子 $ hdfs dfs -cat people.json {"name":&quo ...

  3. [Spark][Python][DataFrame][Write]DataFrame写入的例子

    [Spark][Python][DataFrame][Write]DataFrame写入的例子 $ hdfs dfs -cat people.json {"name":" ...

  4. [Spark][Python][DataFrame][SQL]Spark对DataFrame直接执行SQL处理的例子

    [Spark][Python][DataFrame][SQL]Spark对DataFrame直接执行SQL处理的例子 $cat people.json {"name":" ...

  5. [Spark][Python]DataFrame where 操作例子

    [Spark][Python]DataFrame中取出有限个记录的例子 的 继续 [15]: myDF=peopleDF.where("age>21") In [16]: m ...

  6. [Spark][Python]DataFrame select 操作例子

    [Spark][Python]DataFrame中取出有限个记录的例子 的 继续 In [4]: peopleDF.select("age")Out[4]: DataFrame[a ...

  7. [Spark][Python]DataFrame中取出有限个记录的例子

    [Spark][Python]DataFrame中取出有限个记录的例子: sqlContext = HiveContext(sc) peopleDF = sqlContext.read.json(&q ...

  8. [Spark][Python]DataFrame select 操作例子II

    [Spark][Python]DataFrame中取出有限个记录的   继续 In [4]: peopleDF.select("age","name") In ...

  9. [Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子

    [Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子 from pyspark.sql.types import * schema = Struct ...

随机推荐

  1. Five Android layouts

    线性布局: 1 <?xml version="1.0" encoding="utf-8"?> 2 <LinearLayout xmlns:an ...

  2. JS笔记(一):基础知识

    (一) 标识符 标识符就是一个名字,在JS中,标识符用来对变量和函数命名,或者用做JS代码中某些循环语句中的跳转位置的标记.JS的标识符必须以字母._或$符号开始,后续字符可以是字母.数字._或$符号 ...

  3. 前端测试框架jest 简介

    转自: https://www.cnblogs.com/Wolfmanlq/p/8012847.html 作者:Ken Wang 出处:http://www.cnblogs.com/Wolfmanlq ...

  4. NB-IOT模块 M5310-A接入百度开放云IOT Hub MQTT

    目录 1.登陆百度开放云,在产品服务中选择IOT HUB 2 2.选择 创建计费套餐,目前1百万条/每月是免费的 2 3.点击管理控制台进入项目列表 4 4. 点击创建项目,项目类型选择数据型 4 5 ...

  5. Sublime Text 3 注册码激活码被移除的解决办法

    新版的sublime text3中加入了验证功能,之前成功注册的也被移除了,在网上搜索的验证码要么已经失效要么已经被封,少数几个正常的注册输入进去注册成功后几分钟之内这个注册码就会被莫名其妙的被移除. ...

  6. 【PAT】B1061 判断题(15 分)

    简单逻辑题, #include<stdio.h> #include<algorithm> using namespace std; int main(){ int N,M;// ...

  7. python3编写网络爬虫13-Ajax数据爬取

    一.Ajax数据爬取 1. 简介:Ajax 全称Asynchronous JavaScript and XML 异步的Javascript和XML. 它不是一门编程语言,而是利用JavaScript在 ...

  8. IE浏览器打不开网页的解决方法

    前阵子一下子安装了很多软件,后来使用IE游览器的时候,莫名其妙的打不开网页,虽然用其他浏览器(比如谷歌.火狐)可以正常浏览网页,但是由于很多软件内嵌页面都会调用Windows的IE浏览器来加载,所以I ...

  9. [bug] 验证selenium的显式和隐式等待而发现的一个低级错误

    隐式等待:如果在规定时间内网页加载完成,则执行下一步,否则一直等到时间截止,然后执行下一步.按照这说法举了个例子为啥不会按照预期执行了,难不成是这个定义有问题(~~~~~直接否定不是定义的问题,相信它 ...

  10. python五十八课——正则表达式(替换)

    替换:sub(regex,repl,string,count,[flags=0]): 替换数据,返回字符串(已经被替换完成后的内容)subn(regex,repl,string,count,[flag ...