[Spark][python]以DataFrame方式打开Json文件的例子
[Spark][python]以DataFrame方式打开Json文件的例子:
[training@localhost ~]$ cat people.json
{"name":"Alice","pcode":"94304"}
{"name":"Brayden","age":30,"pcode":"94304"}
{"name":"Carla","age":19,"pcoe":"10036"}
{"name":"Diana","age":46}
{"name":"Etienne","pcode":"94104"}
[training@localhost ~]$
[training@localhost ~]$ hdfs dfs -put people.json
[training@localhost ~]$ hdfs dfs -cat people.json
{"name":"Alice","pcode":"94304"}
{"name":"Brayden","age":30,"pcode":"94304"}
{"name":"Carla","age":19,"pcoe":"10036"}
{"name":"Diana","age":46}
{"name":"Etienne","pcode":"94104"}
In [1]: sqlContext = HiveContext(sc)
In [2]: peopleDF = sqlContext.read.json("people.json")
17/10/01 05:20:22 INFO hive.HiveContext: Initializing execution hive, version 1.1.0
17/10/01 05:20:22 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.7.0
17/10/01 05:20:22 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/01 05:20:23 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
17/10/01 05:20:23 INFO hive.metastore: Opened a connection to metastore, current connections: 1
17/10/01 05:20:23 INFO hive.metastore: Connected to metastore.
17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training
17/10/01 05:20:23 INFO session.SessionState: Created local directory: /tmp/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9_resources
17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9
17/10/01 05:20:23 INFO session.SessionState: Created local directory: /tmp/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9
17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9/_tmp_space.db
17/10/01 05:20:23 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
17/10/01 05:20:23 INFO json.JSONRelation: Listing hdfs://localhost:8020/user/training/people.json on driver
17/10/01 05:20:25 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 251.1 KB, free 251.1 KB)
17/10/01 05:20:25 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 21.6 KB, free 272.7 KB)
17/10/01 05:20:25 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:42171 (size: 21.6 KB, free: 208.8 MB)
17/10/01 05:20:25 INFO spark.SparkContext: Created broadcast 0 from json at NativeMethodAccessorImpl.java:-2
17/10/01 05:20:26 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/01 05:20:26 INFO spark.SparkContext: Starting job: json at NativeMethodAccessorImpl.java:-2
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Got job 0 (json at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (json at NativeMethodAccessorImpl.java:-2)
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Missing parents: List()
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at json at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/10/01 05:20:26 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.3 KB, free 277.1 KB)
17/10/01 05:20:26 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KB, free 279.5 KB)
17/10/01 05:20:26 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:42171 (size: 2.4 KB, free: 208.8 MB)
17/10/01 05:20:26 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at json at NativeMethodAccessorImpl.java:-2)
17/10/01 05:20:26 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/10/01 05:20:26 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2149 bytes)
17/10/01 05:20:26 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
17/10/01 05:20:26 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/people.json:0+179
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/10/01 05:20:27 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2354 bytes result sent to driver
17/10/01 05:20:27 INFO scheduler.DAGScheduler: ResultStage 0 (json at NativeMethodAccessorImpl.java:-2) finished in 0.715 s
17/10/01 05:20:27 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 667 ms on localhost (1/1)
17/10/01 05:20:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/10/01 05:20:27 INFO scheduler.DAGScheduler: Job 0 finished: json at NativeMethodAccessorImpl.java:-2, took 1.084685 s
17/10/01 05:20:27 INFO hive.HiveContext: default warehouse location is /user/hive/warehouse
17/10/01 05:20:28 INFO hive.HiveContext: Initializing metastore client version 1.1.0 using Spark classes.
17/10/01 05:20:28 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.7.0
17/10/01 05:20:28 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/01 05:20:28 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on localhost:42171 in memory (size: 2.4 KB, free: 208.8 MB)
17/10/01 05:20:28 INFO spark.ContextCleaner: Cleaned accumulator 2
17/10/01 05:20:30 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
17/10/01 05:20:30 INFO hive.metastore: Opened a connection to metastore, current connections: 1
17/10/01 05:20:30 INFO hive.metastore: Connected to metastore.
17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training
17/10/01 05:20:30 INFO session.SessionState: Created local directory: /tmp/8c1eba54-7260-4314-abbf-7b7de85bdf0a_resources
17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
17/10/01 05:20:30 INFO session.SessionState: Created local directory: /tmp/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a/_tmp_space.db
17/10/01 05:20:30 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
In [3]: type(peopleDF)
Out[3]: pyspark.sql.dataframe.DataFrame
In [4]:
[Spark][python]以DataFrame方式打开Json文件的例子的更多相关文章
- pycharm 打开json 文件 \2 自动成了转义字符
打开json 文件 \2 自动成了转义字符 暂时只发现在( \2 ) \ 后面为数字的情况下会出现转义json 文件为是指:在pycharm 中新建 file 后缀为json的文件 如: 1234.j ...
- [Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子
[Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子 from pyspark.sql.types import * schema = Struct ...
- gdal以GA_Update方式打开jpg文件的做法
作者:朱金灿 来源:http://blog.csdn.net/clever101 gdal库是不支持以GA_Update方式打开jpg文件的,原因在于gdal_1_10_1\frmts\jpeg文件夹 ...
- C++->以读或写方式打开一个文件
以读或写方式打开一个文件 #include<iostream.h> //.h以C|非C标准引用库文件 #include<fstream.h> #include<std ...
- python webdriver 测试框架-数据驱动json文件驱动的方式
数据驱动json文件的方式 test_data_list.json: [ "邓肯||蒂姆", "乔丹||迈克尔", "库里||斯蒂芬", & ...
- Python【8】-分析json文件
一.本节用到的基础知识 1.逐行读取文件 for line in open('E:\Demo\python\json.txt'): print line 2.解析json字符串 Python中有一些内 ...
- Python3编写网络爬虫09-数据存储方式二-JSON文件存储
2.JSON文件存储 全称为JavaScript Object Notation 通过对象和数组的组合来表示数据,构造简洁且结构化程度非常高.是一种轻量级的数据交换格式 2.1 对象和数组 在Java ...
- VisualStudio如何以源码文本方式打开rc文件
视图 >> 解决方案资源管理器 >> 右击XXX.rc >> 打开方式 >> 源代码(文本)编辑器
- python 实现excel转化成json文件
1.准备工作 python 2.7 安装 安装xlrd -- pip install xlrd 2. 直接上代码 import xlrd from collections import Ordered ...
随机推荐
- mybatis学习系列--逆向工程简单使用及mybatis原理
2逆向工程简单测试(68-70) SqlSessionFactory sqlSessionFactory=getSqlSessionFactory(); SqlSession session = sq ...
- Django 知识总结(一)
Django已经学过的知识点: 1. Urls.py 路由系统: 正则 分组匹配 --> 位置参数 分组命名匹配 --> 关键字参数 分级路由 include 给路由起别名 name=&q ...
- 【HANA系列】SAP HANA XS的JavaScript安全事项
公众号:SAP Technical 本文作者:matinal 原文出处:http://www.cnblogs.com/SAPmatinal/ 原文链接:[HANA系列]SAP HANA XS使用Jav ...
- 字符串相似度算法-LEVENSHTEIN DISTANCE算法
Levenshtein Distance 算法,又叫 Edit Distance 算法,是指两个字符串之间,由一个转成另一个所需的最少编辑操作次数.许可的编辑操作包括将一个字符替换成另一个字符,插入一 ...
- January 12th, 2018 Week 02nd Friday
Nothing behind me, everything ahead of me, as is ever so on the road. 我的身后空空荡荡,整个世界都在前方,这就是在路上. That ...
- 静态库lib调试
1.首先生成lib文件的解决方案编译通过. 2.将最新的lib和头文件在需要调用的exe中,替换掉. 3.复制需要调试的cpp文件到exe解决方案下,并添加现有项. 4.运行无错误后,添加断点即可调试 ...
- Android Activity学习笔记——Activity的启动和创建
http://www.cnblogs.com/bastard/archive/2012/04/07/2436262.html 最近学习Android相关知识,感觉仅仅了解Activity几个生命周期函 ...
- Server版Linux命令提示符揭秘
一直都在Ubuntu12.04和12.10 Desktop下玩.如今要在Centos6.3 Server版下做开发了,感觉还是非常不一样的. 克服一个有一个不顺利后,有那种站在山顶的 ...
- laravel 使用构造器进行增删改查
使用原生语句进行增删改查 //$list = DB::select('select * from wt_category where id = :id', ['id' => 34]); //$i ...
- pump模块的学习-metamask
pump = require('pump') pump简介 https://github.com/terinjokes/gulp-uglify/blob/master/docs/why-use-pum ...