map 就是对一个RDD的各个元素都施加处理,得到一个新的RDD 的过程

[training@localhost ~]$ cat names.txt
Year,First Name,County,Sex,Count
2012,DOMINIC,CAYUGA,M,6
2012,ADDISON,ONONDAGA,F,14
2012,ADDISON,ONONDAGA,F,14
2012,JULIA,ONONDAGA,F,15
[training@localhost ~]$ hdfs dfs -put names.txt
[training@localhost ~]$ hdfs dfs -cat names.txt
Year,First Name,County,Sex,Count
2012,DOMINIC,CAYUGA,M,6
2012,ADDISON,ONONDAGA,F,14
2012,ADDISON,ONONDAGA,F,14
2012,JULIA,ONONDAGA,F,15
[training@localhost ~]$

In [98]: t_names = sc.textFile("names.txt")
17/09/24 06:24:22 INFO storage.MemoryStore: Block broadcast_27 stored as values in memory (estimated size 230.5 KB, free 2.3 MB)
17/09/24 06:24:23 INFO storage.MemoryStore: Block broadcast_27_piece0 stored as bytes in memory (estimated size 21.5 KB, free 2.3 MB)
17/09/24 06:24:23 INFO storage.BlockManagerInfo: Added broadcast_27_piece0 in memory on localhost:33950 (size: 21.5 KB, free: 208.6 MB)
17/09/24 06:24:23 INFO spark.SparkContext: Created broadcast 27 from textFile at NativeMethodAccessorImpl.java:-2

In [99]: rows=t_names.map(lambda line: line.split(","))

In [100]: rows.take(1)

17/09/24 06:25:23 INFO mapred.FileInputFormat: Total input paths to process : 1
17/09/24 06:25:23 INFO spark.SparkContext: Starting job: runJob at PythonRDD.scala:393
17/09/24 06:25:23 INFO scheduler.DAGScheduler: Got job 15 (runJob at PythonRDD.scala:393) with 1 output partitions
17/09/24 06:25:23 INFO scheduler.DAGScheduler: Final stage: ResultStage 15 (runJob at PythonRDD.scala:393)
17/09/24 06:25:23 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/09/24 06:25:23 INFO scheduler.DAGScheduler: Missing parents: List()
17/09/24 06:25:23 INFO scheduler.DAGScheduler: Submitting ResultStage 15 (PythonRDD[46] at RDD at PythonRDD.scala:43), which has no missing parents
17/09/24 06:25:23 INFO storage.MemoryStore: Block broadcast_28 stored as values in memory (estimated size 5.2 KB, free 2.3 MB)
17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_26_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 8
17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_18_piece0 on localhost:33950 in memory (size: 3.7 KB, free: 208.6 MB)
17/09/24 06:25:24 INFO storage.MemoryStore: Block broadcast_28_piece0 stored as bytes in memory (estimated size 3.3 KB, free 2.3 MB)
17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 9
17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_19_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 10
17/09/24 06:25:24 INFO storage.BlockManagerInfo: Added broadcast_28_piece0 in memory on localhost:33950 (size: 3.3 KB, free: 208.6 MB)
17/09/24 06:25:24 INFO spark.SparkContext: Created broadcast 28 from broadcast at DAGScheduler.scala:1006
17/09/24 06:25:24 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 15 (PythonRDD[46] at RDD at PythonRDD.scala:43)
17/09/24 06:25:24 INFO scheduler.TaskSchedulerImpl: Adding task set 15.0 with 1 tasks
17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_20_piece0 on localhost:33950 in memory (size: 3.7 KB, free: 208.6 MB)
17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 11
17/09/24 06:25:24 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 15.0 (TID 15, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_21_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 12
17/09/24 06:25:24 INFO executor.Executor: Running task 0.0 in stage 15.0 (TID 15)
17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_22_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 13
17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_23_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 14
17/09/24 06:25:24 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/names.txt:0+136
17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_24_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 15
17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_25_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 16
17/09/24 06:25:24 INFO python.PythonRunner: Times: total = 78, boot = 49, init = 25, finish = 4
17/09/24 06:25:24 INFO executor.Executor: Finished task 0.0 in stage 15.0 (TID 15). 2203 bytes result sent to driver
17/09/24 06:25:24 INFO scheduler.DAGScheduler: ResultStage 15 (runJob at PythonRDD.scala:393) finished in 0.438 s
17/09/24 06:25:24 INFO scheduler.DAGScheduler: Job 15 finished: runJob at PythonRDD.scala:393, took 1.160085 s
17/09/24 06:25:24 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 15.0 (TID 15) in 429 ms on localhost (1/1)
17/09/24 06:25:24 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool
Out[100]: [[u'Year', u'First Name', u'County', u'Sex', u'Count']]

In [101]: rows.take(2)
17/09/24 06:25:29 INFO spark.SparkContext: Starting job: runJob at PythonRDD.scala:393
17/09/24 06:25:29 INFO scheduler.DAGScheduler: Got job 16 (runJob at PythonRDD.scala:393) with 1 output partitions
17/09/24 06:25:29 INFO scheduler.DAGScheduler: Final stage: ResultStage 16 (runJob at PythonRDD.scala:393)
17/09/24 06:25:29 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/09/24 06:25:29 INFO scheduler.DAGScheduler: Missing parents: List()
17/09/24 06:25:29 INFO scheduler.DAGScheduler: Submitting ResultStage 16 (PythonRDD[47] at RDD at PythonRDD.scala:43), which has no missing parents
17/09/24 06:25:29 INFO storage.MemoryStore: Block broadcast_29 stored as values in memory (estimated size 5.2 KB, free 2.2 MB)
17/09/24 06:25:29 INFO storage.MemoryStore: Block broadcast_29_piece0 stored as bytes in memory (estimated size 3.3 KB, free 2.2 MB)
17/09/24 06:25:29 INFO storage.BlockManagerInfo: Added broadcast_29_piece0 in memory on localhost:33950 (size: 3.3 KB, free: 208.6 MB)
17/09/24 06:25:29 INFO spark.SparkContext: Created broadcast 29 from broadcast at DAGScheduler.scala:1006
17/09/24 06:25:29 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 16 (PythonRDD[47] at RDD at PythonRDD.scala:43)
17/09/24 06:25:29 INFO scheduler.TaskSchedulerImpl: Adding task set 16.0 with 1 tasks
17/09/24 06:25:29 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 16.0 (TID 16, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/09/24 06:25:29 INFO executor.Executor: Running task 0.0 in stage 16.0 (TID 16)
17/09/24 06:25:29 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/names.txt:0+136
17/09/24 06:25:29 INFO python.PythonRunner: Times: total = 71, boot = 25, init = 45, finish = 1
17/09/24 06:25:29 INFO executor.Executor: Finished task 0.0 in stage 16.0 (TID 16). 2267 bytes result sent to driver
17/09/24 06:25:30 INFO scheduler.DAGScheduler: ResultStage 16 (runJob at PythonRDD.scala:393) finished in 0.196 s
17/09/24 06:25:30 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 16.0 (TID 16) in 202 ms on localhost (1/1)
17/09/24 06:25:30 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all completed, from pool
17/09/24 06:25:30 INFO scheduler.DAGScheduler: Job 16 finished: runJob at PythonRDD.scala:393, took 0.408908 s
Out[101]:
[[u'Year', u'First Name', u'County', u'Sex', u'Count'],
[u'2012', u'DOMINIC', u'CAYUGA', u'M', u'6']]

In [102]:

来自:

https://www.supergloo.com/fieldnotes/apache-spark-transformations-python-examples/

[spark][python]Spark map 处理的更多相关文章

  1. [Spark][Python]spark 从 avro 文件获取 Dataframe 的例子

    [Spark][Python]spark 从 avro 文件获取 Dataframe 的例子 从如下地址获取文件: https://github.com/databricks/spark-avro/r ...

  2. [Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子:

    [Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子: mydf001=sqlContext.read.format("jdbc").o ...

  3. [Spark][Python]Spark Python 索引页

    Spark Python 索引页 为了查找方便,建立此页 === RDD 基本操作: [Spark][Python]groupByKey例子

  4. [Spark][Python]Spark Join 小例子

    [training@localhost ~]$ hdfs dfs -cat people.json {"name":"Alice","pcode&qu ...

  5. 【原】Learning Spark (Python版) 学习笔记(三)----工作原理、调优与Spark SQL

    周末的任务是更新Learning Spark系列第三篇,以为自己写不完了,但为了改正拖延症,还是得完成给自己定的任务啊 = =.这三章主要讲Spark的运行过程(本地+集群),性能调优以及Spark ...

  6. [Spark][Python][DataFrame][RDD]DataFrame中抽取RDD例子

    [Spark][Python][DataFrame][RDD]DataFrame中抽取RDD例子 sqlContext = HiveContext(sc) peopleDF = sqlContext. ...

  7. [Spark][Python]DataFrame中取出有限个记录的例子

    [Spark][Python]DataFrame中取出有限个记录的例子: sqlContext = HiveContext(sc) peopleDF = sqlContext.read.json(&q ...

  8. [Spark][python]以DataFrame方式打开Json文件的例子

    [Spark][python]以DataFrame方式打开Json文件的例子: [training@localhost ~]$ cat people.json{"name":&qu ...

  9. [Spark][Python]sortByKey 例子

    [Spark][Python]sortByKey 例子: [training@localhost ~]$ hdfs dfs -cat test02.txt00002 sku01000001 sku93 ...

随机推荐

  1. loadrunner 场景设计-制定负载测试计划

    by:授客 QQ:1033553122 场景设计-制定负载测试计划 步骤1.分析应用程序 你应该对硬件和软件组建,系统配置和典型的使用场景很熟悉.这些应用程序的分析保证你在使用loadrunner进行 ...

  2. 《Inside C#》笔记(十五) 非托管代码 下

    二编写不安全代码 a)fixed关键字 代码中体现了fixed的用法:fixed (type* ptr= expression) { …}:type是类似int*这样的非托管类型或void类型,exp ...

  3. Java并发编程(十一)线程池的使用

    1.new Thread的弊端如下: a. 每次new Thread新建对象性能差. b. 线程缺乏统一管理,可能无限制新建线程,相互之间竞争,及可能占用过多系统资源导致死机或oom. c. 缺乏更多 ...

  4. SQLServer限制IP,限制用户,限制SSMS登录

    SQL Server不像Mysql那样原生支持限制IP登录. 但可以使用Login触发器来实现. 以下为使用Login触发器实现限制用户u_user_r在指定IP192.168.1.205使用SSMS ...

  5. 【HANA系列】SAP HANA XS使用JavaScript(JS)调用存储过程(Procedures)

    公众号:SAP Technical 本文作者:matinal 原文出处:http://www.cnblogs.com/SAPmatinal/ 原文链接:[HANA系列]SAP HANA XS使用Jav ...

  6. Oracle 单引号 双引号 转义符 分隔符

    概述 单引号用来标记字符串 双引号用来标记识别对象名 以下使用会比较绕: 字符串中出现单引号.双引号: 表或字段等对象的别名(alias)中出单引号.双引号: 单引号.双引号与空格一起使用: 双引号 ...

  7. SSO阅读有感

    SSO比较详细且理解.赞 链接:https://www.cnblogs.com/ywlaker/p/6113927.html

  8. 1094 和为k的连续区间(暴力)

    基准时间限制:1 秒 空间限制:131072 KB 分值: 10 难度:2级算法题 收藏 关注 一整数数列a1, a2, ... , an(有正有负),以及另一个整数k,求一个区间[i, j],(1 ...

  9. 转载 SpringMVC详解(二)------详细架构

    目录 1.SpringMVC 详细介绍 2.SpringMVC 处理请求流程 3.配置前端控制器 4.配置处理器适配器 5.编写 Handler 5.配置处理器映射器 6.配置视图解析器 7.Disp ...

  10. oracle 迁移

    一.创建逻辑目录,该命令不会在操作系统创建真正的目录,最好以system等管理员创建. create directory exp_shengchan as '/home/oracle/exp_shen ...