spark下统计单词频次
写了一个简单的语句,还没有优化:
scala> sc.
| textFile("/etc/profile").
| flatMap((s:String)=>s.split("\\s")).
| map(_.toUpperCase).
| map((s:String)=>(s, 1)).
| filter((pair)=>pair._1.forall((ch)=>ch>'A'&&ch<'Z')).
| reduceByKey(_+_).
| sortByKey().
| foreach(println)
注意这代码还可以优化:
scala> sc.
| textFile("/etc/profile").
| flatMap(_.split("\\s")).
| map(_.toUpperCase).
| map((_, 1)).
| filter(_._1.forall((ch)=>ch>'A'&&ch<'Z')).
| reduceByKey(_+_).
| sortByKey().
| foreach(println)
输出结果如下:
15/03/06 08:50:44 INFO MemoryStore: ensureFreeSpace(75904) called with curMem=259812, maxMem=277842493
15/03/06 08:50:44 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 74.1 KB, free 264.7 MB)
15/03/06 08:50:44 INFO FileInputFormat: Total input paths to process : 1
15/03/06 08:50:44 INFO SparkContext: Starting job: sortByKey at <console>:20
15/03/06 08:50:44 INFO DAGScheduler: Registering RDD 25 (filter at <console>:18)
15/03/06 08:50:44 INFO DAGScheduler: Got job 4 (sortByKey at <console>:20) with 2 output partitions (allowLocal=false)
15/03/06 08:50:44 INFO DAGScheduler: Final stage: Stage 10(sortByKey at <console>:20)
15/03/06 08:50:44 INFO DAGScheduler: Parents of final stage: List(Stage 11)
15/03/06 08:50:44 INFO DAGScheduler: Missing parents: List(Stage 11)
15/03/06 08:50:44 INFO DAGScheduler: Submitting Stage 11 (FilteredRDD[25] at filter at <console>:18), which has no missing parents
15/03/06 08:50:44 INFO MemoryStore: ensureFreeSpace(3736) called with curMem=335716, maxMem=277842493
15/03/06 08:50:44 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 3.6 KB, free 264.6 MB)
15/03/06 08:50:44 INFO DAGScheduler: Submitting 2 missing tasks from Stage 11 (FilteredRDD[25] at filter at <console>:18)
15/03/06 08:50:44 INFO TaskSchedulerImpl: Adding task set 11.0 with 2 tasks
15/03/06 08:50:44 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 16, localhost, PROCESS_LOCAL, 1162 bytes)
15/03/06 08:50:44 INFO TaskSetManager: Starting task 1.0 in stage 11.0 (TID 17, localhost, PROCESS_LOCAL, 1162 bytes)
15/03/06 08:50:44 INFO Executor: Running task 1.0 in stage 11.0 (TID 17)
15/03/06 08:50:44 INFO Executor: Running task 0.0 in stage 11.0 (TID 16)
15/03/06 08:50:44 INFO HadoopRDD: Input split: file:/etc/profile:1189+1189
15/03/06 08:50:44 INFO HadoopRDD: Input split: file:/etc/profile:0+1189
15/03/06 08:50:44 INFO Executor: Finished task 1.0 in stage 11.0 (TID 17). 1863 bytes result sent to driver
15/03/06 08:50:44 INFO TaskSetManager: Finished task 1.0 in stage 11.0 (TID 17) in 43 ms on localhost (1/2)
15/03/06 08:50:44 INFO Executor: Finished task 0.0 in stage 11.0 (TID 16). 1863 bytes result sent to driver
15/03/06 08:50:44 INFO TaskSetManager: Finished task 0.0 in stage 11.0 (TID 16) in 51 ms on localhost (2/2)
15/03/06 08:50:44 INFO DAGScheduler: Stage 11 (filter at <console>:18) finished in 0.054 s
15/03/06 08:50:44 INFO DAGScheduler: looking for newly runnable stages
15/03/06 08:50:44 INFO DAGScheduler: running: Set()
15/03/06 08:50:44 INFO DAGScheduler: waiting: Set(Stage 10)
15/03/06 08:50:44 INFO DAGScheduler: failed: Set()
15/03/06 08:50:44 INFO TaskSchedulerImpl: Removed TaskSet 11.0, whose tasks have all completed, from pool
15/03/06 08:50:44 INFO DAGScheduler: Missing parents for Stage 10: List()
15/03/06 08:50:44 INFO DAGScheduler: Submitting Stage 10 (MapPartitionsRDD[28] at sortByKey at <console>:20), which is now runnable
15/03/06 08:50:44 INFO MemoryStore: ensureFreeSpace(2856) called with curMem=339452, maxMem=277842493
15/03/06 08:50:44 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 2.8 KB, free 264.6 MB)
15/03/06 08:50:44 INFO DAGScheduler: Submitting 2 missing tasks from Stage 10 (MapPartitionsRDD[28] at sortByKey at <console>:20)
15/03/06 08:50:44 INFO TaskSchedulerImpl: Adding task set 10.0 with 2 tasks
15/03/06 08:50:44 INFO TaskSetManager: Starting task 0.0 in stage 10.0 (TID 18, localhost, PROCESS_LOCAL, 948 bytes)
15/03/06 08:50:44 INFO TaskSetManager: Starting task 1.0 in stage 10.0 (TID 19, localhost, PROCESS_LOCAL, 948 bytes)
15/03/06 08:50:44 INFO Executor: Running task 0.0 in stage 10.0 (TID 18)
15/03/06 08:50:44 INFO Executor: Running task 1.0 in stage 10.0 (TID 19)
15/03/06 08:50:44 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/03/06 08:50:44 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
15/03/06 08:50:44 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/03/06 08:50:44 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/03/06 08:50:44 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
15/03/06 08:50:44 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/03/06 08:50:44 INFO Executor: Finished task 0.0 in stage 10.0 (TID 18). 1165 bytes result sent to driver
15/03/06 08:50:44 INFO TaskSetManager: Finished task 0.0 in stage 10.0 (TID 18) in 18 ms on localhost (1/2)
15/03/06 08:50:44 INFO Executor: Finished task 1.0 in stage 10.0 (TID 19). 1293 bytes result sent to driver
15/03/06 08:50:44 INFO TaskSetManager: Finished task 1.0 in stage 10.0 (TID 19) in 28 ms on localhost (2/2)
15/03/06 08:50:44 INFO DAGScheduler: Stage 10 (sortByKey at <console>:20) finished in 0.031 s
15/03/06 08:50:44 INFO TaskSchedulerImpl: Removed TaskSet 10.0, whose tasks have all completed, from pool
15/03/06 08:50:44 INFO SparkContext: Job finished: sortByKey at <console>:20, took 0.107864348 s
15/03/06 08:50:44 INFO SparkContext: Starting job: foreach at <console>:21
15/03/06 08:50:44 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 4 is 144 bytes
15/03/06 08:50:44 INFO DAGScheduler: Registering RDD 26 (reduceByKey at <console>:19)
15/03/06 08:50:44 INFO DAGScheduler: Got job 5 (foreach at <console>:21) with 2 output partitions (allowLocal=false)
15/03/06 08:50:44 INFO DAGScheduler: Final stage: Stage 12(foreach at <console>:21)
15/03/06 08:50:44 INFO DAGScheduler: Parents of final stage: List(Stage 14)
15/03/06 08:50:44 INFO DAGScheduler: Missing parents: List(Stage 14)
15/03/06 08:50:44 INFO DAGScheduler: Submitting Stage 14 (ShuffledRDD[26] at reduceByKey at <console>:19), which has no missing parents
15/03/06 08:50:44 INFO MemoryStore: ensureFreeSpace(2472) called with curMem=342308, maxMem=277842493
15/03/06 08:50:44 INFO MemoryStore: Block broadcast_13 stored as values in memory (estimated size 2.4 KB, free 264.6 MB)
15/03/06 08:50:44 INFO DAGScheduler: Submitting 2 missing tasks from Stage 14 (ShuffledRDD[26] at reduceByKey at <console>:19)
15/03/06 08:50:44 INFO TaskSchedulerImpl: Adding task set 14.0 with 2 tasks
15/03/06 08:50:44 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 20, localhost, PROCESS_LOCAL, 937 bytes)
15/03/06 08:50:44 INFO TaskSetManager: Starting task 1.0 in stage 14.0 (TID 21, localhost, PROCESS_LOCAL, 937 bytes)
15/03/06 08:50:44 INFO Executor: Running task 1.0 in stage 14.0 (TID 21)
15/03/06 08:50:44 INFO Executor: Running task 0.0 in stage 14.0 (TID 20)
15/03/06 08:50:44 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/03/06 08:50:44 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
15/03/06 08:50:44 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/03/06 08:50:44 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/03/06 08:50:44 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
15/03/06 08:50:44 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 1 ms
15/03/06 08:50:44 INFO Executor: Finished task 1.0 in stage 14.0 (TID 21). 996 bytes result sent to driver
15/03/06 08:50:44 INFO TaskSetManager: Finished task 1.0 in stage 14.0 (TID 21) in 14 ms on localhost (1/2)
15/03/06 08:50:44 INFO Executor: Finished task 0.0 in stage 14.0 (TID 20). 996 bytes result sent to driver
15/03/06 08:50:44 INFO TaskSetManager: Finished task 0.0 in stage 14.0 (TID 20) in 21 ms on localhost (2/2)
15/03/06 08:50:44 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool
15/03/06 08:50:44 INFO DAGScheduler: Stage 14 (reduceByKey at <console>:19) finished in 0.022 s
15/03/06 08:50:44 INFO DAGScheduler: looking for newly runnable stages
15/03/06 08:50:44 INFO DAGScheduler: running: Set()
15/03/06 08:50:44 INFO DAGScheduler: waiting: Set(Stage 12)
15/03/06 08:50:44 INFO DAGScheduler: failed: Set()
15/03/06 08:50:44 INFO DAGScheduler: Missing parents for Stage 12: List()
15/03/06 08:50:44 INFO DAGScheduler: Submitting Stage 12 (ShuffledRDD[29] at sortByKey at <console>:20), which is now runnable
15/03/06 08:50:44 INFO MemoryStore: ensureFreeSpace(2304) called with curMem=344780, maxMem=277842493
15/03/06 08:50:44 INFO MemoryStore: Block broadcast_14 stored as values in memory (estimated size 2.3 KB, free 264.6 MB)
15/03/06 08:50:44 INFO DAGScheduler: Submitting 2 missing tasks from Stage 12 (ShuffledRDD[29] at sortByKey at <console>:20)
15/03/06 08:50:44 INFO TaskSchedulerImpl: Adding task set 12.0 with 2 tasks
15/03/06 08:50:44 INFO TaskSetManager: Starting task 0.0 in stage 12.0 (TID 22, localhost, PROCESS_LOCAL, 948 bytes)
15/03/06 08:50:44 INFO TaskSetManager: Starting task 1.0 in stage 12.0 (TID 23, localhost, PROCESS_LOCAL, 948 bytes)
15/03/06 08:50:45 INFO Executor: Running task 1.0 in stage 12.0 (TID 23)
15/03/06 08:50:45 INFO Executor: Running task 0.0 in stage 12.0 (TID 22)
15/03/06 08:50:45 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/03/06 08:50:45 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
15/03/06 08:50:45 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/03/06 08:50:45 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/03/06 08:50:45 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
15/03/06 08:50:45 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
(LOGIN,2)
(MERGING,1)
(MUCH,1)
(NEED,1)
(NOT,1)
(PREVENT,1)
(RESERVED,1)
(SCRIPT,1)
(SETS,1)
(SETUP,1)
(SHELL,2)
(SYSTEM,2)
(THE,1)
(THEN,8)
(THIS,3)
(THRESHOLD,1)
(TO,5)
(UIDGID,1)
(UNLESS,1)
(UNSET,2)
(USER,1)
(WE,1)
(WIDE,1)
(WILL,1)
(YOU,3)
(YOUR,1)
15/03/06 08:50:45 INFO Executor: Finished task 1.0 in stage 12.0 (TID 23). 826 bytes result sent to driver
15/03/06 08:50:45 INFO TaskSetManager: Finished task 1.0 in stage 12.0 (TID 23) in 13 ms on localhost (1/2)
(,260)
(BETTER,1)
(BY,1)
(CHECK,1)
(COULD,1)
(CURRENT,1)
(CUSTOM,1)
(DO,1)
(DONE,1)
(ELSE,5)
(ENVIRONMENT,1)
(EXPORT,15)
(FI,8)
(FILE,2)
(FOR,5)
(FUNCTIONS,1)
(FUTURE,1)
(GET,1)
(GO,1)
(GOOD,1)
(HISTCONTROL,1)
(I,2)
(IF,8)
(IN,6)
(IS,1)
(IT,1)
(KNOW,1)
(KSH,1)
15/03/06 08:50:45 INFO Executor: Finished task 0.0 in stage 12.0 (TID 22). 826 bytes result sent to driver
15/03/06 08:50:45 INFO TaskSetManager: Finished task 0.0 in stage 12.0 (TID 22) in 27 ms on localhost (2/2)
15/03/06 08:50:45 INFO TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool
15/03/06 08:50:45 INFO DAGScheduler: Stage 12 (foreach at <console>:21) finished in 0.025 s
15/03/06 08:50:45 INFO SparkContext: Job finished: foreach at <console>:21, took 0.07397057 s
通过如下代码,可以输出参与计算的节点名称,注意start-all并指定shell的–master参数:
spark-shell --master spark://bluejoe0:7077
代码如下:
rdd.mapPartitions(_=>Array[String](("hostname" !!).trim).iterator, false).collect
res28: Array[String] = Array(bluejoe4, bluejoe5)
spark下统计单词频次的更多相关文章
- Perl 笔试题2 -- 统计单词频次
Nvidia 2019 perl 笔试题 统计一个文件内单词的频次并排序 文本如下: "ALL happy families resemble one another; every unha ...
- Linux下统计出现次数最多的指定字段值
假设桌面上有一个叫“data.txt”的文本,内容如下: {id='xxx' info='xxx' kk='xxx' target='111111' dd='xxx'}{id='xxx' info=' ...
- [Ext JS 4] 实战之多选下拉单 (带checkbox)
前言 Ext js 创建一个多选下拉单的方式很简单, 使用Ext.form.ComboBox, 设置 multiSelect 为true 就可以了. 但是如果要在每个下拉之前加上一个checkbox, ...
- Windows 下统计行数的命令
大家都知道在Linux下统计文本行数能够用wc -l 命令.比如: -bash-3.2$ cat pif_install.log | wc -l 712 但在Windows下怎样统计输出文 ...
- SharePoint 2013 InfoPath 无法保存下列表单
转载自:http://www.cnblogs.com/jianyus/p/3470121.html 在使用InfoPath发布表单,发布到SharePoint服务器报错,如下介绍: 环境:Window ...
- 使用jdk8 stream 统计单词数
在我的SpringBoot2.0不容错过的新特性 WebFlux响应式编程里面,有同学问如何使用stream统计单词数.这是个好例子,也很典型,在这里补上. 下面的例子实现了从一个文本文件读取(英文) ...
- 【转】【Linux】Linux下统计当前文件夹下的文件个数、目录个数
[转][Linux]Linux下统计当前文件夹下的文件个数.目录个数 统计当前文件夹下文件的个数,包括子文件夹里的 ls -lR|grep "^-"|wc -l 统计文件夹下目录的 ...
- Spark2.2+ES6.4.2(三十一):Spark下生成测试数据,并在Spark环境下使用BulkProcessor将测试数据入库到ES
Spark下生成2000w测试数据(每条记录150列) 使用spark生成大量数据过程中遇到问题,如果sc.parallelize(fukeData, 64);的记录数特别大比如500w,1000w时 ...
- JS实现下拉单的二级联动
因工作需要,做了一个下拉单的二级联动. 第一级是固定的选项,有A.B两个选项,第二级的选项随着第一级选项的变化而变化. 一开始是这样的: HTML代码 <html> <head> ...
随机推荐
- Python引用传值总结
Python函数的参数传值使用的是引用传值,也就是说传的是参数的内存地址值,因此在函数中改变参数的值,函数外也会改变. 这里需要注意的是如果传的参数类型是不可改变的,如String类型.元组类型,函数 ...
- java条件选择学习
boolean类型用于声明布尔型变量,只能是true或false中的一个 boolean lightOn = true; 一个简单的数学学习工具: public class Main { public ...
- 新建数据库,然后使用SQL语句创建表、存储过程、用户说明
需要在数据库的安全性,用户那里为用户设置一下权限
- Android的事件处理
1 android事件处理概述 不论是桌面应用还是手机应用程序,面对最多的就是用户,经常需要处理用户的动作-------也就是需要为用户动作提供响应,这种为用户动作提供响应的机制就是事件处理.andr ...
- SqlLite ---.net连接数据库
初识SqlLite ---.net连接数据库 SqlLite以小巧和嵌入式闻名,以前只是听说,现在终于忍不住要尝试下. 先下载ADO.NET2.0 Provider for SQLite,下载完后 ...
- Innodb 锁 (简单笔记)
看过很多innodb锁的文章,已经明白的就不写了,简单做个笔记 Innodb 锁的兼容性: 1.意向锁和意向锁之间都是兼容的 2.X(排他锁)与任何锁都是不兼容的 3.排他意向锁 IX 于S锁是不 ...
- 用hdfs存储海量的视频数据的设计思路
用hdfs存储海量的视频数据 存储海量的视频数据,主要考虑两个因素:如何接收视频数据和如何存储视频数据. 我们要根据数据block在集群上的位置分配计算量,要充分利用带宽的优势. 1.接收视频数据 将 ...
- Android 编程下 java.lang.NoClassDefFoundError: cn.jpush.android.api.JPushInterface 报错
使用了极光推送的 jar 包项目在从 SVN 中检出后,假设不又一次对 jar 包和 Bulid Path 进行配置就会抛出 java.lang.NoClassDefFoundError: cn.jp ...
- 【转】placement new
原文:http://www.cnblogs.com/wanghetao/archive/2011/11/21/2257403.html 1. placement new的含义placement new ...
- Libgdx Box2D真实---这缓释微丸(三:规则经常使用body和精灵联合)
介绍规则body怎样和图片结合.上一篇文章我介绍了box2D的基本知识,假设你用心的话.你会搜索网上相关简单demo吧.那些我就不写了.那么假设我用图片表示我的那个body.而不是简单线条.那该怎么办 ...