[Spark][python]以DataFrame方式打开Json文件的例子
[Spark][python]以DataFrame方式打开Json文件的例子:
[training@localhost ~]$ cat people.json
{"name":"Alice","pcode":"94304"}
{"name":"Brayden","age":30,"pcode":"94304"}
{"name":"Carla","age":19,"pcoe":"10036"}
{"name":"Diana","age":46}
{"name":"Etienne","pcode":"94104"}
[training@localhost ~]$
[training@localhost ~]$ hdfs dfs -put people.json
[training@localhost ~]$ hdfs dfs -cat people.json
{"name":"Alice","pcode":"94304"}
{"name":"Brayden","age":30,"pcode":"94304"}
{"name":"Carla","age":19,"pcoe":"10036"}
{"name":"Diana","age":46}
{"name":"Etienne","pcode":"94104"}
In [1]: sqlContext = HiveContext(sc)
In [2]: peopleDF = sqlContext.read.json("people.json")
17/10/01 05:20:22 INFO hive.HiveContext: Initializing execution hive, version 1.1.0
17/10/01 05:20:22 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.7.0
17/10/01 05:20:22 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/01 05:20:23 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
17/10/01 05:20:23 INFO hive.metastore: Opened a connection to metastore, current connections: 1
17/10/01 05:20:23 INFO hive.metastore: Connected to metastore.
17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training
17/10/01 05:20:23 INFO session.SessionState: Created local directory: /tmp/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9_resources
17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9
17/10/01 05:20:23 INFO session.SessionState: Created local directory: /tmp/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9
17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9/_tmp_space.db
17/10/01 05:20:23 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
17/10/01 05:20:23 INFO json.JSONRelation: Listing hdfs://localhost:8020/user/training/people.json on driver
17/10/01 05:20:25 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 251.1 KB, free 251.1 KB)
17/10/01 05:20:25 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 21.6 KB, free 272.7 KB)
17/10/01 05:20:25 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:42171 (size: 21.6 KB, free: 208.8 MB)
17/10/01 05:20:25 INFO spark.SparkContext: Created broadcast 0 from json at NativeMethodAccessorImpl.java:-2
17/10/01 05:20:26 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/01 05:20:26 INFO spark.SparkContext: Starting job: json at NativeMethodAccessorImpl.java:-2
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Got job 0 (json at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (json at NativeMethodAccessorImpl.java:-2)
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Missing parents: List()
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at json at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/10/01 05:20:26 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.3 KB, free 277.1 KB)
17/10/01 05:20:26 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KB, free 279.5 KB)
17/10/01 05:20:26 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:42171 (size: 2.4 KB, free: 208.8 MB)
17/10/01 05:20:26 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at json at NativeMethodAccessorImpl.java:-2)
17/10/01 05:20:26 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/10/01 05:20:26 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2149 bytes)
17/10/01 05:20:26 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
17/10/01 05:20:26 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/people.json:0+179
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/10/01 05:20:27 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2354 bytes result sent to driver
17/10/01 05:20:27 INFO scheduler.DAGScheduler: ResultStage 0 (json at NativeMethodAccessorImpl.java:-2) finished in 0.715 s
17/10/01 05:20:27 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 667 ms on localhost (1/1)
17/10/01 05:20:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/10/01 05:20:27 INFO scheduler.DAGScheduler: Job 0 finished: json at NativeMethodAccessorImpl.java:-2, took 1.084685 s
17/10/01 05:20:27 INFO hive.HiveContext: default warehouse location is /user/hive/warehouse
17/10/01 05:20:28 INFO hive.HiveContext: Initializing metastore client version 1.1.0 using Spark classes.
17/10/01 05:20:28 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.7.0
17/10/01 05:20:28 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/01 05:20:28 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on localhost:42171 in memory (size: 2.4 KB, free: 208.8 MB)
17/10/01 05:20:28 INFO spark.ContextCleaner: Cleaned accumulator 2
17/10/01 05:20:30 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
17/10/01 05:20:30 INFO hive.metastore: Opened a connection to metastore, current connections: 1
17/10/01 05:20:30 INFO hive.metastore: Connected to metastore.
17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training
17/10/01 05:20:30 INFO session.SessionState: Created local directory: /tmp/8c1eba54-7260-4314-abbf-7b7de85bdf0a_resources
17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
17/10/01 05:20:30 INFO session.SessionState: Created local directory: /tmp/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a/_tmp_space.db
17/10/01 05:20:30 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
In [3]: type(peopleDF)
Out[3]: pyspark.sql.dataframe.DataFrame
In [4]:
[Spark][python]以DataFrame方式打开Json文件的例子的更多相关文章
- pycharm 打开json 文件 \2 自动成了转义字符
打开json 文件 \2 自动成了转义字符 暂时只发现在( \2 ) \ 后面为数字的情况下会出现转义json 文件为是指:在pycharm 中新建 file 后缀为json的文件 如: 1234.j ...
- [Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子
[Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子 from pyspark.sql.types import * schema = Struct ...
- gdal以GA_Update方式打开jpg文件的做法
作者:朱金灿 来源:http://blog.csdn.net/clever101 gdal库是不支持以GA_Update方式打开jpg文件的,原因在于gdal_1_10_1\frmts\jpeg文件夹 ...
- C++->以读或写方式打开一个文件
以读或写方式打开一个文件 #include<iostream.h> //.h以C|非C标准引用库文件 #include<fstream.h> #include<std ...
- python webdriver 测试框架-数据驱动json文件驱动的方式
数据驱动json文件的方式 test_data_list.json: [ "邓肯||蒂姆", "乔丹||迈克尔", "库里||斯蒂芬", & ...
- Python【8】-分析json文件
一.本节用到的基础知识 1.逐行读取文件 for line in open('E:\Demo\python\json.txt'): print line 2.解析json字符串 Python中有一些内 ...
- Python3编写网络爬虫09-数据存储方式二-JSON文件存储
2.JSON文件存储 全称为JavaScript Object Notation 通过对象和数组的组合来表示数据,构造简洁且结构化程度非常高.是一种轻量级的数据交换格式 2.1 对象和数组 在Java ...
- VisualStudio如何以源码文本方式打开rc文件
视图 >> 解决方案资源管理器 >> 右击XXX.rc >> 打开方式 >> 源代码(文本)编辑器
- python 实现excel转化成json文件
1.准备工作 python 2.7 安装 安装xlrd -- pip install xlrd 2. 直接上代码 import xlrd from collections import Ordered ...
随机推荐
- Android开发专业名词及工具概述
前言: 系统的学习下Android开发中涉及到的一些专业名词 和Android开发工具 名词: 一.SDK(Software Development Kit) 软件开发工具包:一般都是一些软件工程师为 ...
- SG Input 软件安全分析之逆向分析
前言 通过本文介绍怎么对一个 windows 程序进行安全分析.分析的软件版本为 2018-10-9 , 所有相关文件的链接 链接:https://pan.baidu.com/s/1l6BuuL-HP ...
- Fiddler抓包使用教程-乱码处理 Decode
转载请标明出处:http://blog.csdn.net/zhaoyanjun6/article/details/73350344 本文出自[赵彦军的博客] 在 Fiddler 的工具栏中有一个 De ...
- html 知识整理
一. 前言 本文全面介绍了html的定义.使用和具体常用标签. 参考资料:菜鸟教程 二.定义 html是HyperText Markup Language的简称,也就是超文本标记语言的缩写.通过htm ...
- Javascript 高级程序设计--总结【二】
********************** Chapter 6 ********************** 属性: 数据属性: Configurable: 能否通过delete 删除属性,默认 ...
- 高通移植mipi LCD的过程LK代码
lk部分:(实现LCD兼容) 1. 函数定位 aboot_init()来到target_display_init(): 这就是高通原生lk LCD 兼容的关键所在.至于你需要兼容多少LCD 就在whi ...
- Deuteronomy
You should choose the right path when you can choose, and you should choose the right path even if y ...
- leetcode 5. Longest Palindromic Substring [java]
public String longestPalindrome(String s) { String rs = ""; int res = 0; for(int i = 0; i& ...
- 今天遇到一件开心事,在eclipse编写的代码在命令窗口中编译后无法运行,提示 “错误: 找不到或无法加载主类”
java中带package和不带package的编译运行方式是不同的. 首先来了解一下package的概念:简单定义为,package是一个为了方便管理组织java文件的目录结构,并防止不同java文 ...
- 深入探究JFreeChart
1 简介 JFreeChart 是 SourceForge.net 上的一个开源项目,它的源码和 API 都可以免费获得. JFreeChart 的功能非常强大,可以实现饼图 ( 二维和三维 ) , ...