大数据python词频统计之hdfs分发-cacheArchive
-cacheArchive也是从hdfs上进分发,但是分发文件是一个压缩包,压缩包内可能会包含多层目录多个文件
1.The_Man_of_Property.txt文件如下(将其上传至hdfs上)
hadoop fs -put The_Man_of_Property.txt /mapreduce
Preface
“The Forsyte Saga” was the title originally destined for that part of it which is called “The Man of Property”; and to adopt it for the collected chronicles of the Forsyte family has indulged the Forsytean tenacity that is in all of us. The word Saga might be objected to on the ground that it connotes the heroic and that there is little heroism in these pages. But it is used with a suitable irony; and, after all, this long tale, though it may deal with folk in frock coats, furbelows, and a gilt-edged period, is not devoid of the essential heat of conflict. Discounting for the gigantic stature and blood-thirstiness of old days, as they have come down to us in fairy-tale and legend, the folk of the old Sagas were Forsytes, assuredly, in their possessive instincts, and as little proof against the inroads of beauty and passion as Swithin, Soames, or even Young Jolyon. And if heroic figures, in days that never were, seem to startle out from their surroundings in fashion unbecoming to a Forsyte of the Victorian era, we may be sure that tribal instinct was even then the prime force, and that “family” and the sense of home and property counted as they do to this day, for all the recent efforts to “talk them out.”
So many people have written and claimed that their families were the originals of the Forsytes that one has been almost encouraged to believe in the typicality of an imagined species. Manners change and modes evolve, and “Timothy’s on the Bayswater Road” becomes a nest of the unbelievable in all except essentials; we shall not look upon its like again, nor perhaps on such a one as James or Old Jolyon. And yet the figures of Insurance Societies and the utterances of Judges reassure us daily that our earthly paradise is still a rich preserve, where the wild raiders, Beauty and Passion, come stealing in, filching security from beneath our noses. As surely as a dog will bark at a brass band, so will the essential Soames in human nature ever rise up uneasily against the dissolution which hovers round the folds of ownership.
“Let the dead Past bury its dead” would be a better saying if the Past ever died. The persistence of the Past is one of those tragi-comic blessings which each new age denies, coming cocksure on to the stage to mouth its claim to a perfect novelty.
But no Age is so new as that! Human Nature, under its changing pretensions and clothes, is and ever will be very much of a Forsyte, and might, after all, be a much worse animal.
Looking back on the Victorian era, whose ripeness, decline, and ‘fall-of’ is in some sort pictured in “The Forsyte Saga,” we see now that we have but jumped out of a frying-pan into a fire. It would be difficult to substantiate a claim that the case of England was better in than it was in , when the Forsytes assembled at Old Jolyon’s to celebrate the engagement of June to Philip Bosinney. And in , when again the clan gathered to bless the marriage of Fleur with Michael Mont, the state of England is as surely too molten and bankrupt as in the eighties it was too congealed and low-percented. If these chronicles had been a really scientific study of transition one would have dwelt probably on such factors as the invention of bicycle, motor-car, and flying-machine; the arrival of a cheap Press; the decline of country life and increase of the towns; the birth of the Cinema. Men are, in fact, quite unable to control their own inventions; they at best develop adaptability to the new conditions those inventions create.
But this long tale is no scientific study of a period; it is rather an intimate incarnation of the disturbance that Beauty effects in the lives of men.
The figure of Irene, never, as the reader may possibly have observed, present, except through the senses of other characters, is a concretion of disturbing Beauty impinging on a possessive world.
One has noticed that readers, as they wade on through the salt waters of the Saga, are inclined more and more to pity Soames, and to think that in doing so they are in revolt against the mood of his creator. Far from it! He, too, pities Soames, the tragedy of whose life is the very simple, uncontrollable tragedy of being unlovable, without quite a thick enough skin to be thoroughly unconscious of the fact. Not even Fleur loves Soames as he feels he ought to be loved. But in pitying Soames, readers incline, perhaps, to animus against Irene: After all, they think, he wasn’t a bad fellow, it wasn’t his fault; she ought to have forgiven him, and so on!
2.white_list1与white_list2做为白名单,找出白名单文件中单词在The_Man_of_Property.tx中出现的次数(实现将2个文件打包为white.tar.gz,上传至hdfs上)
white_list1如下:
suitable
against
recent
across
white_list2如下:
Age
on
打包并上传至hdfs:
tar czvf white.tar.gz white_list1 white_list2
hadoop fs -put white.tar.gz /mapreduce
map函数代码如下:思路(1.遍历找到所有文件的路径,2.读取white_list文件内容;3.进行过滤)
#!usr/bin/python
import sys
import os
def read_dir_file(file_dir,dir):
fs = os.listdir(dir)
for f1 in fs:
tmp_path=os.path.join(dir,f1)
if not os.path.isdir(tmp_path):
file_dir.append(tmp_path)
else:
read_dir_file(file_dir,tmp_path)
return file_dir
def read_local_file(file_dir):
word_set = set()
for file in file_dir:
file_in = open (file,'r')
for line in file_in:
word = line.strip()
word_set.add(word)
return word_set
def mapper_func(dir):
file_dir=[]
file_dir=read_dir_file(file_dir,dir)
word_set=read_local_file(file_dir)
for line in sys.stdin:
ss=line.strip().split()
for word in ss:
word.strip()
if word != "" and (word in word_set):
print "%s\t%s"%(word,"")
if __name__ == "__main__":
func = getattr(sys.modules[__name__],sys.argv[1])
args = None
if len(sys.argv) > 1:
args = sys.argv[2:]
func(*args)
4.reduce端代码如下:
#!usr/bin/python
import sys
def reducer_func():
word="None"
sum=0
for line in sys.stdin:
ss=line.split()
cur_word=ss[0]
cnt=int(ss[1])
if cur_word!=word:
if word!="None":
print "%s\t%s"%(word,sum)
word=cur_word
sum=0
else:
sum+=cnt
print "%s\t%s"%(word,sum)
if __name__ == "__main__":
func = getattr(sys.modules[__name__],sys.argv[1])
args = None
if len(sys.argv) > 1:
args=sys.argv[2:]
func(*args)
5.运行脚本run.sh如下:
HADOOP="/usr/local/src/hadoop-1.2.1/bin/hadoop"
HADOOP_STREAMING="/usr/local/src/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar"
INPUT_PATH="/mapreduce/The_Man_of_Property.txt"
OUTPUT_PATH="/mapreduce/out"
$HADOOP fs -rmr $OUTPUT_PATH
$HADOOP jar $HADOOP_STREAMING \
-input "$INPUT_PATH" \
-output "$OUTPUT_PATH" \
-mapper "python map.py mapper_func ABC" \
-reducer "python red.py reducer_func" \
-file "./map.py"\
-file "./red.py"\
-cacheArchive "hdfs://master:9000/mapreduce/white.tar.gz#ABC"
大数据python词频统计之hdfs分发-cacheArchive的更多相关文章
- 大数据python词频统计之hdfs分发-cacheFile
-cacheFile 分发,文件事先上传至Hdfs上,分发的是一个文件 1.找一篇文章The_Man_of_Property.txt: He was proud of him! He could no ...
- 大数据python词频统计之本地分发-file
统计某几个词在文章出现的次数 -file参数分发,是从客户端分发到各个执行mapreduce端的机器上 1.找一篇文章The_Man_of_Property.txt如下: He was proud o ...
- Python 词频统计
利用Python做一个词频统计 GitHub地址:FightingBob [Give me a star , thanks.] 词频统计 对纯英语的文本文件[Eg: 瓦尔登湖(英文版).txt]的英文 ...
- 大数据Python学习大纲
最近公司在写一个课程<大数据运维实训课>,分为4个部分,linux实训课.Python开发.hadoop基础知识和项目实战.这门课程主要针对刚从学校毕业的学生去应聘时不会像一个小白菜一样被 ...
- python词频统计及其效能分析
1) 博客开头给出自己的基本信息,格式建议如下: 学号2017****7128 姓名:肖文秀 词频统计及其效能分析仓库:https://gitee.com/aichenxi/word_frequenc ...
- 大数据学习(一)-------- HDFS
需要精通java开发,有一定linux基础. 1.简介 大数据就是对海量数据进行数据挖掘. 已经有了很多框架方便使用,常用的有hadoop,storm,spark,flink等,辅助框架hive,ka ...
- 大数据学习之旅1——HDFS版本演化
最近开始学习大数据,发现大数据有很多很多组件,我现在负责的是HDFS(Hadoop分布式储存系统)的学习,整理了一下HDFS的版本情况.因为HDFS是Hadoop的重要组成部分,所以有关HDFS的版本 ...
- 大数据谢列3:Hdfs的HA实现
在之前的文章:大数据系列:一文初识Hdfs , 大数据系列2:Hdfs的读写操作 中Hdfs的组成.读写有简单的介绍. 在里面介绍Secondary NameNode和Hdfs读写的流程. 并且在文章 ...
- 大数据学习(02)——HDFS入门
Hadoop模块 提到大数据,Hadoop是一个绕不开的话题,我们来看看Hadoop本身包含哪些模块. Common是基础模块,这个是必须用的.剩下常用的就是HDFS和YARN. MapReduce现 ...
随机推荐
- redis的常用命令及实例讲解
使用命令行操作redis 数据类型 字符串String 列表list 使用双向循序链表实现(LinkedList) 散列 Hash 一般应用于将redis作为分布式缓存,存储数据库中的数据对象 集合s ...
- mybatis-configxml样板
<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE configuration PUBLIC & ...
- PHP中的会话控制—session和cookie(实现数据传值功能)
1.session 登录上一个页面以后,长时间没有操作,刷新页面以后需要重新登录. 特点:(1)session是存储在服务器: (2)session每个人(登陆者)存一份: (3)session ...
- steps/train_mono.sh
<<单音素HMM的训练流程图.vsdx>> 定义拓扑结构.参数初始化 $ gmm-init-mono --shared-phones=$lang/phones/sets.int ...
- 1Mybatis入门--1.1单独使用jdbc编程问题总结
1.1.1 jdbc程序 Public static void main(String[] args) { Connection connection = null; PreparedStatemen ...
- java--GC Root有哪些
GC管理的主要区域是Java堆,一般情况下只针对堆进行垃圾回收.方法区.JVM栈和Native栈不被GC所管理,因而选择这些非堆区的对象作为GC roots,被GC roots引用的对象不被GC回收. ...
- linux下中文乱码问题解决
1.首先输入locale,查看编码设置 2.是否安装中文支持,没有则安装中文语言支持 方法一:yum方式——完全的中文环境支持. #yum groupinstall chinese-support ...
- 使用WinIo32绕过密码控件实现自动登录
通过winIO32绕过密码控件,实现自动登录 环境: vmware上安装windows 32位系统:windows xp / windows 7 selenium版本: 3.11.0 IEDriver ...
- 使用SQL*Plus连接数据库
About SQL*Plus SQL*Plus is the primary command-line interface to your Oracle database. You use SQL*P ...
- k64 datasheet学习笔记11---Port Control and Interrupts (PORT)
1.前言 Port Control and Interrupt (PORT) 模块提供了port control,digital filtering,和外部中断功能 每个pin的大部分功能可被独立配 ...