[Spark][Python]spark 从 avro 文件获取 Dataframe 的例子
[Spark][Python]spark 从 avro 文件获取 Dataframe 的例子
从如下地址获取文件:
https://github.com/databricks/spark-avro/raw/master/src/test/resources/episodes.avro
导入到 hdfs 系统:
hdfs dfs -put episodes.avro
读入:
mydata001=sqlContext.read.format("com.databricks.spark.avro").load("episodes.avro")
交互式运行结果:
In [7]: mydata001=sqlContext.read.format("com.databricks.spark.avro").load("episodes.avro")
17/10/03 07:00:47 INFO avro.AvroRelation: Listing hdfs://localhost:8020/user/training/episodes.avro on driver
In [8]: type(mydata001)
Out[8]: pyspark.sql.dataframe.DataFrame
In [9]: mydata001.count()
17/10/03 07:01:05 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 65.5 KB, free 65.5 KB)
17/10/03 07:01:05 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 21.4 KB, free 86.9 KB)
17/10/03 07:01:05 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:40075 (size: 21.4 KB, free: 208.8 MB)
17/10/03 07:01:05 INFO spark.SparkContext: Created broadcast 3 from count at NativeMethodAccessorImpl.java:-2
17/10/03 07:01:05 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 230.4 KB, free 317.3 KB)
17/10/03 07:01:06 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 21.5 KB, free 338.8 KB)
17/10/03 07:01:06 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:40075 (size: 21.5 KB, free: 208.8 MB)
17/10/03 07:01:06 INFO spark.SparkContext: Created broadcast 4 from hadoopFile at AvroRelation.scala:121
17/10/03 07:01:06 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/03 07:01:07 INFO spark.SparkContext: Starting job: count at NativeMethodAccessorImpl.java:-2
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Registering RDD 16 (count at NativeMethodAccessorImpl.java:-2)
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Got job 1 (count at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (count at NativeMethodAccessorImpl.java:-2)
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 2)
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 2 (MapPartitionsRDD[16] at count at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/10/03 07:01:07 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 11.5 KB, free 350.3 KB)
17/10/03 07:01:07 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 5.7 KB, free 356.0 KB)
17/10/03 07:01:07 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:40075 (size: 5.7 KB, free: 208.8 MB)
17/10/03 07:01:07 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 2 (MapPartitionsRDD[16] at count at NativeMethodAccessorImpl.java:-2)
17/10/03 07:01:07 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
17/10/03 07:01:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2249 bytes)
17/10/03 07:01:07 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2)
17/10/03 07:01:07 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/episodes.avro:0+597
17/10/03 07:01:08 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 2). 2484 bytes result sent to driver
17/10/03 07:01:08 INFO scheduler.DAGScheduler: ShuffleMapStage 2 (count at NativeMethodAccessorImpl.java:-2) finished in 0.691 s
17/10/03 07:01:08 INFO scheduler.DAGScheduler: looking for newly runnable stages
17/10/03 07:01:08 INFO scheduler.DAGScheduler: running: Set()
17/10/03 07:01:08 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 3)
17/10/03 07:01:08 INFO scheduler.DAGScheduler: failed: Set()
17/10/03 07:01:08 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 693 ms on localhost (1/1)
17/10/03 07:01:08 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
17/10/03 07:01:08 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[19] at count at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/10/03 07:01:08 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 12.6 KB, free 368.5 KB)
17/10/03 07:01:08 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 6.1 KB, free 374.7 KB)
17/10/03 07:01:08 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:40075 (size: 6.1 KB, free: 208.8 MB)
17/10/03 07:01:08 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006
17/10/03 07:01:08 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[19] at count at NativeMethodAccessorImpl.java:-2)
17/10/03 07:01:08 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
17/10/03 07:01:08 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, partition 0,NODE_LOCAL, 1999 bytes)
17/10/03 07:01:08 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 3)
17/10/03 07:01:08 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
17/10/03 07:01:08 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
17/10/03 07:01:08 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 3). 1666 bytes result sent to driver
17/10/03 07:01:08 INFO scheduler.DAGScheduler: ResultStage 3 (count at NativeMethodAccessorImpl.java:-2) finished in 0.344 s
17/10/03 07:01:08 INFO scheduler.DAGScheduler: Job 1 finished: count at NativeMethodAccessorImpl.java:-2, took 1.480495 s
17/10/03 07:01:08 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 345 ms on localhost (1/1)
17/10/03 07:01:08 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
Out[9]: 8
In [10]: mydata001.take(1)
17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 230.1 KB, free 604.8 KB)
17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 21.4 KB, free 626.2 KB)
17/10/03 07:01:18 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:40075 (size: 21.4 KB, free: 208.7 MB)
17/10/03 07:01:18 INFO spark.SparkContext: Created broadcast 7 from take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 230.5 KB, free 856.7 KB)
17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 21.5 KB, free 878.2 KB)
17/10/03 07:01:18 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:40075 (size: 21.5 KB, free: 208.7 MB)
17/10/03 07:01:18 INFO spark.SparkContext: Created broadcast 8 from take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/03 07:01:18 INFO spark.SparkContext: Starting job: take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Got job 2 (take at <ipython-input-10-35862abbc114>:1) with 1 output partitions
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (take at <ipython-input-10-35862abbc114>:1)
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Missing parents: List()
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[27] at take at <ipython-input-10-35862abbc114>:1), which has no missing parents
17/10/03 07:01:19 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 5.6 KB, free 883.8 KB)
17/10/03 07:01:19 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 3.0 KB, free 886.9 KB)
17/10/03 07:01:19 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:40075 (size: 3.0 KB, free: 208.7 MB)
17/10/03 07:01:19 INFO spark.SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:1006
17/10/03 07:01:19 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[27] at take at <ipython-input-10-35862abbc114>:1)
17/10/03 07:01:19 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
17/10/03 07:01:19 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, localhost, partition 0,PROCESS_LOCAL, 2260 bytes)
17/10/03 07:01:19 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 4)
17/10/03 07:01:19 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/episodes.avro:0+597
17/10/03 07:01:19 INFO codegen.GenerateUnsafeProjection: Code generated in 124.624053 ms
17/10/03 07:01:19 INFO executor.Executor: Finished task 0.0 in stage 4.0 (TID 4). 2237 bytes result sent to driver
17/10/03 07:01:19 INFO scheduler.DAGScheduler: ResultStage 4 (take at <ipython-input-10-35862abbc114>:1) finished in 0.415 s
17/10/03 07:01:19 INFO scheduler.DAGScheduler: Job 2 finished: take at <ipython-input-10-35862abbc114>:1, took 0.565858 s
17/10/03 07:01:19 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 4) in 415 ms on localhost (1/1)
17/10/03 07:01:19 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
Out[10]: [Row(title=u'The Eleventh Hour', air_date=u'3 April 2010', doctor=11)]
In [11]:
[Spark][Python]spark 从 avro 文件获取 Dataframe 的例子的更多相关文章
- [Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子:
[Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子: mydf001=sqlContext.read.format("jdbc").o ...
- Spark中如何生成Avro文件
研究spark的目的之一就是要取代MR,目前我司MR的一个典型应用场景即为生成Avro文件,然后加载到HIVE表里,所以如何在Spark中生成Avro文件,就是必然之路了. 我本人由于对java不熟, ...
- [Spark][Python]Spark Python 索引页
Spark Python 索引页 为了查找方便,建立此页 === RDD 基本操作: [Spark][Python]groupByKey例子
- [spark][python]Spark map 处理
map 就是对一个RDD的各个元素都施加处理,得到一个新的RDD 的过程 [training@localhost ~]$ cat names.txtYear,First Name,County,Sex ...
- python批量读取txt文件为DataFrame
我们有时候会批量处理同一个文件夹下的文件,并且希望读取到一个文件里面便于我们计算操作.比方我有下图一系列的txt文件,我该如何把它们写入一个txt文件中并且读取为DataFrame格式呢? 首先我们要 ...
- Python抓取远程文件获取真实文件名
用urllib下载远程文件并转存到hdfs服务器,在下载时,下载地址中不一定包含文件名,需要从连接信息中获取. 1 file_url = request.form.get('file_url') 2 ...
- Python——urllib函数网络文件获取
*/ * Copyright (c) 2016,烟台大学计算机与控制工程学院 * All rights reserved. * 文件名:text.cpp * 作者:常轩 * 微信公众号:Worldhe ...
- [Spark][Python]Spark Join 小例子
[training@localhost ~]$ hdfs dfs -cat people.json {"name":"Alice","pcode&qu ...
- [Spark][python]以DataFrame方式打开Json文件的例子
[Spark][python]以DataFrame方式打开Json文件的例子: [training@localhost ~]$ cat people.json{"name":&qu ...
随机推荐
- Linux 磁盘分区方案简析
Linux 磁盘分区方案简析 by:授客 QQ:1033553122 磁盘分区 任何硬盘在使用前都要进行分区.硬盘的分区有两种类型:主分区和扩展分区.一个硬盘上最多只能有4个主分区,其中一个主分区 ...
- Web API与JWT认证
JWT(Json Web Token)定义了一种使用Json形式在网络间安全地传递信息的简洁开放的标准(RFC 7519).JWT使用数字签名确保信息是可信的. 一.Session认证和Token认 ...
- < meta http-equiv = "X-UA-Compatible" content = "IE=edge,chrome=1" />的意义
X-UA-Compatible是神马? X-UA-Compatible是IE8的一个专有<meta>属性,它告诉IE8采用何种IE版本去渲染网页,在html的<head>标签中 ...
- Dbvisualizer软件设置SQL语句的自动提示功能
之前从来没有使用过Dbvisualizer软件,用起来之后发现比mysqlfront不是好一点.之前一直不知道sql语句的自动提示功能,只能一个个单词输入,而且不是默认设置.之后在网上找到了怎么设置, ...
- 【HANA系列】SAP HANA XS使用JavaScript(JS)调用存储过程(Procedures)
公众号:SAP Technical 本文作者:matinal 原文出处:http://www.cnblogs.com/SAPmatinal/ 原文链接:[HANA系列]SAP HANA XS使用Jav ...
- Asp.Net配置不允许通过url方式访问目录下的资源
Asp.Net网站发布后,有部分文件为了安全性,是不能直接通过url访问获取 通常有2种做法: 1.将文件目录建立在 App_code 或者App_Data 等默认的隐藏目录下 2.将文件的目录添加到 ...
- myeclipse编写servlet
1.File--New--Other.搜索web--Dynamic Web Project--Next,Project name--Next,Next--web应用的根目录和web资源存放的目录--- ...
- Java中常用的字节流和字符流
IO流(输入流.输出流) 字节流.字符流 1.字节流: InputStream.OutputStream InputStream抽象了应用程序读取数据的方式: OutputStream抽象了应用程序写 ...
- vSphere ESXi 重新安装后的虚拟机恢复(转载)
安装的 ESXi 的物理主机密码忘记,登录 不上了,需要重新安装 ESXi,安装后恢复原先物理主机上的 虚拟机的方法如下(VMFS分区完好): 关于 VMFS 分区: ESXi 的安装时会划分一个分区 ...
- ELK-elasticsearch-6.3.2部署
参考博客:linux下ElasticSearch.6.2.2集群安装与head.Kibana.X-Pack..插件的配置安装 参考博客:ELK5.5.1 插件安装实践纪要(head/bigdesk/k ...