[Spark][python]以DataFrame方式打开Json文件的例子
[Spark][python]以DataFrame方式打开Json文件的例子:
[training@localhost ~]$ cat people.json
{"name":"Alice","pcode":"94304"}
{"name":"Brayden","age":30,"pcode":"94304"}
{"name":"Carla","age":19,"pcoe":"10036"}
{"name":"Diana","age":46}
{"name":"Etienne","pcode":"94104"}
[training@localhost ~]$
[training@localhost ~]$ hdfs dfs -put people.json
[training@localhost ~]$ hdfs dfs -cat people.json
{"name":"Alice","pcode":"94304"}
{"name":"Brayden","age":30,"pcode":"94304"}
{"name":"Carla","age":19,"pcoe":"10036"}
{"name":"Diana","age":46}
{"name":"Etienne","pcode":"94104"}
In [1]: sqlContext = HiveContext(sc)
In [2]: peopleDF = sqlContext.read.json("people.json")
17/10/01 05:20:22 INFO hive.HiveContext: Initializing execution hive, version 1.1.0
17/10/01 05:20:22 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.7.0
17/10/01 05:20:22 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/01 05:20:23 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
17/10/01 05:20:23 INFO hive.metastore: Opened a connection to metastore, current connections: 1
17/10/01 05:20:23 INFO hive.metastore: Connected to metastore.
17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training
17/10/01 05:20:23 INFO session.SessionState: Created local directory: /tmp/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9_resources
17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9
17/10/01 05:20:23 INFO session.SessionState: Created local directory: /tmp/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9
17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9/_tmp_space.db
17/10/01 05:20:23 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
17/10/01 05:20:23 INFO json.JSONRelation: Listing hdfs://localhost:8020/user/training/people.json on driver
17/10/01 05:20:25 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 251.1 KB, free 251.1 KB)
17/10/01 05:20:25 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 21.6 KB, free 272.7 KB)
17/10/01 05:20:25 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:42171 (size: 21.6 KB, free: 208.8 MB)
17/10/01 05:20:25 INFO spark.SparkContext: Created broadcast 0 from json at NativeMethodAccessorImpl.java:-2
17/10/01 05:20:26 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/01 05:20:26 INFO spark.SparkContext: Starting job: json at NativeMethodAccessorImpl.java:-2
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Got job 0 (json at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (json at NativeMethodAccessorImpl.java:-2)
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Missing parents: List()
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at json at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/10/01 05:20:26 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.3 KB, free 277.1 KB)
17/10/01 05:20:26 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KB, free 279.5 KB)
17/10/01 05:20:26 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:42171 (size: 2.4 KB, free: 208.8 MB)
17/10/01 05:20:26 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at json at NativeMethodAccessorImpl.java:-2)
17/10/01 05:20:26 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/10/01 05:20:26 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2149 bytes)
17/10/01 05:20:26 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
17/10/01 05:20:26 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/people.json:0+179
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/10/01 05:20:27 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2354 bytes result sent to driver
17/10/01 05:20:27 INFO scheduler.DAGScheduler: ResultStage 0 (json at NativeMethodAccessorImpl.java:-2) finished in 0.715 s
17/10/01 05:20:27 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 667 ms on localhost (1/1)
17/10/01 05:20:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/10/01 05:20:27 INFO scheduler.DAGScheduler: Job 0 finished: json at NativeMethodAccessorImpl.java:-2, took 1.084685 s
17/10/01 05:20:27 INFO hive.HiveContext: default warehouse location is /user/hive/warehouse
17/10/01 05:20:28 INFO hive.HiveContext: Initializing metastore client version 1.1.0 using Spark classes.
17/10/01 05:20:28 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.7.0
17/10/01 05:20:28 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/01 05:20:28 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on localhost:42171 in memory (size: 2.4 KB, free: 208.8 MB)
17/10/01 05:20:28 INFO spark.ContextCleaner: Cleaned accumulator 2
17/10/01 05:20:30 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
17/10/01 05:20:30 INFO hive.metastore: Opened a connection to metastore, current connections: 1
17/10/01 05:20:30 INFO hive.metastore: Connected to metastore.
17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training
17/10/01 05:20:30 INFO session.SessionState: Created local directory: /tmp/8c1eba54-7260-4314-abbf-7b7de85bdf0a_resources
17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
17/10/01 05:20:30 INFO session.SessionState: Created local directory: /tmp/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a/_tmp_space.db
17/10/01 05:20:30 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
In [3]: type(peopleDF)
Out[3]: pyspark.sql.dataframe.DataFrame
In [4]:
[Spark][python]以DataFrame方式打开Json文件的例子的更多相关文章
- pycharm 打开json 文件 \2 自动成了转义字符
打开json 文件 \2 自动成了转义字符 暂时只发现在( \2 ) \ 后面为数字的情况下会出现转义json 文件为是指:在pycharm 中新建 file 后缀为json的文件 如: 1234.j ...
- [Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子
[Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子 from pyspark.sql.types import * schema = Struct ...
- gdal以GA_Update方式打开jpg文件的做法
作者:朱金灿 来源:http://blog.csdn.net/clever101 gdal库是不支持以GA_Update方式打开jpg文件的,原因在于gdal_1_10_1\frmts\jpeg文件夹 ...
- C++->以读或写方式打开一个文件
以读或写方式打开一个文件 #include<iostream.h> //.h以C|非C标准引用库文件 #include<fstream.h> #include<std ...
- python webdriver 测试框架-数据驱动json文件驱动的方式
数据驱动json文件的方式 test_data_list.json: [ "邓肯||蒂姆", "乔丹||迈克尔", "库里||斯蒂芬", & ...
- Python【8】-分析json文件
一.本节用到的基础知识 1.逐行读取文件 for line in open('E:\Demo\python\json.txt'): print line 2.解析json字符串 Python中有一些内 ...
- Python3编写网络爬虫09-数据存储方式二-JSON文件存储
2.JSON文件存储 全称为JavaScript Object Notation 通过对象和数组的组合来表示数据,构造简洁且结构化程度非常高.是一种轻量级的数据交换格式 2.1 对象和数组 在Java ...
- VisualStudio如何以源码文本方式打开rc文件
视图 >> 解决方案资源管理器 >> 右击XXX.rc >> 打开方式 >> 源代码(文本)编辑器
- python 实现excel转化成json文件
1.准备工作 python 2.7 安装 安装xlrd -- pip install xlrd 2. 直接上代码 import xlrd from collections import Ordered ...
随机推荐
- Scrum敏捷开发沉思录
计算机科学的诞生,是世人为了用数字手段解决实际生活中的问题.随着时代的发展,技术的进步,人们对于现实世界中的问题理解越来越深刻,描述也越来越抽象,于是对计算机软件的需求也越来越高,越来越复杂,变化也越 ...
- JS笔记(一):基础知识
(一) 标识符 标识符就是一个名字,在JS中,标识符用来对变量和函数命名,或者用做JS代码中某些循环语句中的跳转位置的标记.JS的标识符必须以字母._或$符号开始,后续字符可以是字母.数字._或$符号 ...
- scrapy系列(四)——CrawlSpider解析
CrawlSpider也继承自Spider,所以具备它的所有特性,这些特性上章已经讲过了,就再在赘述了,这章就讲点它本身所独有的. 参与过网站后台开发的应该会知道,网站的url都是有一定规则的.像dj ...
- DB2表被锁,如何解锁
原因与解决方案 1.原因:修改表结构表结构发生变化后再对表进行任何操作都不被允许,SQLState为57016(因为表不活动,所以不能对其进行访问),由于修改了表字段权限,导致表处于不可用状态 2.解 ...
- 简单Nginx下防跨站、跨目录安全设置,支持PHP 5.3.3以上版本
Nginx下存在跨站和跨目录的问题,跨站和跨目录影响同服务器/VPS上的其他网站. PHP在5.3.3以上已经增加了HOST配置,可以起到防跨站.跨目录的问题. 如果你是PHP 5.3.3以上的版本, ...
- Java客户端连接kafka集群报错
往kafka集群发送消息时,报错如下: page_visits-1: 30005 ms has passed since batch creation plus linger time 加入log4j ...
- jQuery的收尾
一 后台管理布局增删改 二 常用事件 三 jQuery扩展 一 后台管理布局增删改(多种方法) <!DOCTYPE html> <!-- saved from url=(00 ...
- python五十六课——正则表达式(常用函数之match)
函数:match(regex,string,[flags=0])参数:regex:就是正则表达式(定义了一套验证规则)string:需要被验证的字符串数据flags:模式/标志位,默认情况下(不定义) ...
- centos7下安装docker(3.2创建镜像build)
通过Dockerfile创建镜像 注:这个Dockerfile一开始真的不知道是在哪来的,还以为是在官网下载下来得(当然网上也有很多dockerfile的模板,参考:https://hub.docke ...
- 2017-2018-2 20155314《网络对抗技术》Exp2 后门原理与实践
2017-2018-2 20155314<网络对抗技术>Exp2 后门原理与实践 目录 实验要求 实验内容 实验环境 预备知识 1.后门概念 2.常用后门工具 实验步骤 1 用nc或net ...