-cacheArchive也是从hdfs上进分发,但是分发文件是一个压缩包,压缩包内可能会包含多层目录多个文件

1.The_Man_of_Property.txt文件如下(将其上传至hdfs上)

hadoop fs -put The_Man_of_Property.txt  /mapreduce
Preface
“The Forsyte Saga” was the title originally destined for that part of it which is called “The Man of Property”; and to adopt it for the collected chronicles of the Forsyte family has indulged the Forsytean tenacity that is in all of us. The word Saga might be objected to on the ground that it connotes the heroic and that there is little heroism in these pages. But it is used with a suitable irony; and, after all, this long tale, though it may deal with folk in frock coats, furbelows, and a gilt-edged period, is not devoid of the essential heat of conflict. Discounting for the gigantic stature and blood-thirstiness of old days, as they have come down to us in fairy-tale and legend, the folk of the old Sagas were Forsytes, assuredly, in their possessive instincts, and as little proof against the inroads of beauty and passion as Swithin, Soames, or even Young Jolyon. And if heroic figures, in days that never were, seem to startle out from their surroundings in fashion unbecoming to a Forsyte of the Victorian era, we may be sure that tribal instinct was even then the prime force, and that “family” and the sense of home and property counted as they do to this day, for all the recent efforts to “talk them out.”
So many people have written and claimed that their families were the originals of the Forsytes that one has been almost encouraged to believe in the typicality of an imagined species. Manners change and modes evolve, and “Timothy’s on the Bayswater Road” becomes a nest of the unbelievable in all except essentials; we shall not look upon its like again, nor perhaps on such a one as James or Old Jolyon. And yet the figures of Insurance Societies and the utterances of Judges reassure us daily that our earthly paradise is still a rich preserve, where the wild raiders, Beauty and Passion, come stealing in, filching security from beneath our noses. As surely as a dog will bark at a brass band, so will the essential Soames in human nature ever rise up uneasily against the dissolution which hovers round the folds of ownership.
“Let the dead Past bury its dead” would be a better saying if the Past ever died. The persistence of the Past is one of those tragi-comic blessings which each new age denies, coming cocksure on to the stage to mouth its claim to a perfect novelty.
But no Age is so new as that! Human Nature, under its changing pretensions and clothes, is and ever will be very much of a Forsyte, and might, after all, be a much worse animal.
Looking back on the Victorian era, whose ripeness, decline, and ‘fall-of’ is in some sort pictured in “The Forsyte Saga,” we see now that we have but jumped out of a frying-pan into a fire. It would be difficult to substantiate a claim that the case of England was better in than it was in , when the Forsytes assembled at Old Jolyon’s to celebrate the engagement of June to Philip Bosinney. And in , when again the clan gathered to bless the marriage of Fleur with Michael Mont, the state of England is as surely too molten and bankrupt as in the eighties it was too congealed and low-percented. If these chronicles had been a really scientific study of transition one would have dwelt probably on such factors as the invention of bicycle, motor-car, and flying-machine; the arrival of a cheap Press; the decline of country life and increase of the towns; the birth of the Cinema. Men are, in fact, quite unable to control their own inventions; they at best develop adaptability to the new conditions those inventions create.
But this long tale is no scientific study of a period; it is rather an intimate incarnation of the disturbance that Beauty effects in the lives of men.
The figure of Irene, never, as the reader may possibly have observed, present, except through the senses of other characters, is a concretion of disturbing Beauty impinging on a possessive world.
One has noticed that readers, as they wade on through the salt waters of the Saga, are inclined more and more to pity Soames, and to think that in doing so they are in revolt against the mood of his creator. Far from it! He, too, pities Soames, the tragedy of whose life is the very simple, uncontrollable tragedy of being unlovable, without quite a thick enough skin to be thoroughly unconscious of the fact. Not even Fleur loves Soames as he feels he ought to be loved. But in pitying Soames, readers incline, perhaps, to animus against Irene: After all, they think, he wasn’t a bad fellow, it wasn’t his fault; she ought to have forgiven him, and so on!

2.white_list1与white_list2做为白名单,找出白名单文件中单词在The_Man_of_Property.tx中出现的次数(实现将2个文件打包为white.tar.gz,上传至hdfs上)

white_list1如下:

suitable
against
recent
across

white_list2如下:

Age
on

打包并上传至hdfs:

tar czvf white.tar.gz white_list1 white_list2
hadoop fs -put white.tar.gz /mapreduce

map函数代码如下:思路(1.遍历找到所有文件的路径,2.读取white_list文件内容;3.进行过滤)

#!usr/bin/python
import sys
import os
def read_dir_file(file_dir,dir):
fs = os.listdir(dir)
for f1 in fs:
tmp_path=os.path.join(dir,f1)
if not os.path.isdir(tmp_path):
file_dir.append(tmp_path)
else:
read_dir_file(file_dir,tmp_path)
return file_dir
def read_local_file(file_dir):
word_set = set()
for file in file_dir:
file_in = open (file,'r')
for line in file_in:
word = line.strip()
word_set.add(word)
return word_set
def mapper_func(dir):
file_dir=[]
file_dir=read_dir_file(file_dir,dir)
word_set=read_local_file(file_dir)
for line in sys.stdin:
ss=line.strip().split()
for word in ss:
word.strip()
if word != "" and (word in word_set):
print "%s\t%s"%(word,"")
if __name__ == "__main__":
func = getattr(sys.modules[__name__],sys.argv[1])
args = None
if len(sys.argv) > 1:
args = sys.argv[2:]
func(*args)

4.reduce端代码如下:

#!usr/bin/python
import sys
def reducer_func():
word="None"
sum=0
for line in sys.stdin:
ss=line.split()
cur_word=ss[0]
cnt=int(ss[1])
if cur_word!=word:
if word!="None":
print "%s\t%s"%(word,sum)
word=cur_word
sum=0
else:
sum+=cnt
print "%s\t%s"%(word,sum)
if __name__ == "__main__":
func = getattr(sys.modules[__name__],sys.argv[1])
args = None
if len(sys.argv) > 1:
args=sys.argv[2:]
func(*args)

5.运行脚本run.sh如下:

HADOOP="/usr/local/src/hadoop-1.2.1/bin/hadoop"
HADOOP_STREAMING="/usr/local/src/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar"
INPUT_PATH="/mapreduce/The_Man_of_Property.txt"
OUTPUT_PATH="/mapreduce/out"
$HADOOP fs -rmr $OUTPUT_PATH
$HADOOP jar $HADOOP_STREAMING \
-input "$INPUT_PATH" \
-output "$OUTPUT_PATH" \
-mapper "python map.py mapper_func ABC" \
-reducer "python red.py reducer_func" \
-file "./map.py"\
-file "./red.py"\
-cacheArchive "hdfs://master:9000/mapreduce/white.tar.gz#ABC"

大数据python词频统计之hdfs分发-cacheArchive的更多相关文章

  1. 大数据python词频统计之hdfs分发-cacheFile

    -cacheFile 分发,文件事先上传至Hdfs上,分发的是一个文件 1.找一篇文章The_Man_of_Property.txt: He was proud of him! He could no ...

  2. 大数据python词频统计之本地分发-file

    统计某几个词在文章出现的次数 -file参数分发,是从客户端分发到各个执行mapreduce端的机器上 1.找一篇文章The_Man_of_Property.txt如下: He was proud o ...

  3. Python 词频统计

    利用Python做一个词频统计 GitHub地址:FightingBob [Give me a star , thanks.] 词频统计 对纯英语的文本文件[Eg: 瓦尔登湖(英文版).txt]的英文 ...

  4. 大数据Python学习大纲

    最近公司在写一个课程<大数据运维实训课>,分为4个部分,linux实训课.Python开发.hadoop基础知识和项目实战.这门课程主要针对刚从学校毕业的学生去应聘时不会像一个小白菜一样被 ...

  5. python词频统计及其效能分析

    1) 博客开头给出自己的基本信息,格式建议如下: 学号2017****7128 姓名:肖文秀 词频统计及其效能分析仓库:https://gitee.com/aichenxi/word_frequenc ...

  6. 大数据学习(一)-------- HDFS

    需要精通java开发,有一定linux基础. 1.简介 大数据就是对海量数据进行数据挖掘. 已经有了很多框架方便使用,常用的有hadoop,storm,spark,flink等,辅助框架hive,ka ...

  7. 大数据学习之旅1——HDFS版本演化

    最近开始学习大数据,发现大数据有很多很多组件,我现在负责的是HDFS(Hadoop分布式储存系统)的学习,整理了一下HDFS的版本情况.因为HDFS是Hadoop的重要组成部分,所以有关HDFS的版本 ...

  8. 大数据谢列3:Hdfs的HA实现

    在之前的文章:大数据系列:一文初识Hdfs , 大数据系列2:Hdfs的读写操作 中Hdfs的组成.读写有简单的介绍. 在里面介绍Secondary NameNode和Hdfs读写的流程. 并且在文章 ...

  9. 大数据学习(02)——HDFS入门

    Hadoop模块 提到大数据,Hadoop是一个绕不开的话题,我们来看看Hadoop本身包含哪些模块. Common是基础模块,这个是必须用的.剩下常用的就是HDFS和YARN. MapReduce现 ...

随机推荐

  1. lua 设置文件运行的环境

    背景 在一个lua文件中书写的代码, 使用的变量, 需要设置其运行环境. 目的: 1. 不破坏全局环境. 2. 限定文件所使用的环境, 作为沙箱功能. 解法 限定运行空间环境的文件: local m ...

  2. 【bzoj 2049】Cave 洞穴勘测

    Description 辉辉热衷于洞穴勘测.某天,他按照地图来到了一片被标记为JSZX的洞穴群地区.经过初步勘测,辉辉发现这片区域由n个洞穴(分别编号为1到n)以及若干通道组成,并且每条通道连接了恰好 ...

  3. canvas粒子背景

  4. ubuntu 简单安装配置gitlab

    安装 gitlab-ce 社区版 依赖 sudo apt-get install curl openssh-server ca-certificates postfix 添加gitlab包服务并安装 ...

  5. 前端开发者不得不知的es6十大特性(转)

    转载自AlloyTeam:http://www.alloyteam.com/2016/03/es6-front-end-developers-will-have-to-know-the-top-ten ...

  6. DEX、ODEX、OAT文件&Dalvik和ART虚拟机

    https://www.jianshu.com/p/389911e2cdfb https://www.jianshu.com/p/a468e714aca7 ODEX是安卓上的应用程序apk中提取出来的 ...

  7. Django中的信号基础知识

    Django中提供了“信号调度”,用于在框架执行操作时解耦.通俗来讲,就是一些动作发生的时候,信号允许特定的发送者去提醒一些接受者. 1.Django内置信号 1 2 3 4 5 6 7 8 9 10 ...

  8. 建立Oracle GoldenGate凭证

    了解如何为与数据库交互的流程创建数据库用户,分配正确的权限并防止未经授权使用凭据. 主题: 为Oracle GoldenGate分配凭证 保护Oracle GoldenGate凭证 3.1 为Orac ...

  9. oracle启用异步IO(db file async I/O submit)

    市局双随机awr报告中有大量db file async I/O submit等待事件 参考两篇文章: [案例]Oracle等待事件db file async I/O submit产生原因和解决办法 d ...

  10. Java基础3-数组操作;类概述

    昨日内容回顾 数据类型 基本数据类型 1) byte, short, int, long, float, double 2) boolean[true, false] 3) char 100: 默认为 ...