Web Intelligence and Big Data
by Dr. Gautam Shroff

这门课是关于大数据处理，本周是第一次编程作业，要求使用Map-Reduce对文本数据进行统计。使用的工具为轻量级的mincemeat。

需要注意的是，使用正则式来匹配单词。做完之后先按照姓名和频率排序，即双重排序，然后写入文件。做作业时因为有两分钟的时间限制，要即时进行搜索。

作业要求如下：

Download data files bundled as a .zip file from hw3data.zip

Each file in this archive contains entries that look like:

journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2.

that represent bibliographic information about publications, formatted as follows:

paper-id:::author1::author2::…. ::authorN:::title

Your task is to compute how many times every term occurs across titles, for each author.

For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2, using:2, programs:2, and logic:2.

Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.” should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms.

The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python.

These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively.

I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar.

Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such as the top terms that occur for some particular author.

输入范例如下：

conf/fc/KravitzG99:::David W. Kravitz::David M. Goldschlag:::Conditional Access Concepts and Principles.

conf/fc/Moskowitz01:::Scott Moskowitz:::A Solution to the Napster Phenomenon: Why Value Cannot Be Created Absent the Transfer of Subjective Data.

conf/fc/BellareNPS01:::Mihir Bellare::Chanathip Namprempre::David Pointcheval::Michael Semanko:::The Power of RSA Inversion Oracles and the Security of Chaum's RSA-Based Blind Signature Scheme.

conf/fc/Kocher98:::Paul C. Kocher:::On Certificate Revocation and Validation.

conf/ep/BertiDM98:::Laure Berti::Jean-Luc Damoiseaux::Elisabeth Murisasco:::Combining the Power of Query Languages and Search Engines for On-line Document and Information Retrieval : The QIRi@D Environment.

conf/ep/LouS98:::Qun Lou::Peter Stucki:::Funfamentals of 3D Halftoning.

conf/ep/Mather98:::Laura A. Mather:::A Linear Algebra Approach to Language Identification.

conf/ep/BallimCLV98:::Afzal Ballim::Giovanni Coray::A. Linden::Christine Vanoirbeek:::The Use of Automatic Alignment on Structured Multilingual Documents.

conf/ep/ErdenechimegMN98:::Myatav Erdenechimeg::Richard Moore::Yumbayar Namsrai:::On the Specification of the Display of Documents in Multi-lingual Computing.

conf/ep/VercoustreP98:::Anne-Marie Vercoustre::François Paradis:::Reuse of Linked Documents through Virtual Document Prescriptions.

conf/ep/CruzBMW98:::Isabel F. Cruz::Slava Borisov::Michael A. Marks::Timothy R. Webb:::Measuring Structural Similarity Among Web Documents: Preliminary Results.

conf/er/Hohenstein89:::Uwe Hohenstein:::Automatic Transformation of an Entity-Relationship Query Language into SQL.

conf/er/NakanishiHT01:::Yoshihiro Nakanishi::Tatsuo Hirose::Katsumi Tanaka:::Modeling and Structuring Multiple Perspective Video for Browsing.

conf/er/Sciore91:::Edward Sciore:::Abbreviation Techniques in Entity-Relationship Query Languages.

conf/er/Chen79:::Peter P. Chen:::Recent Literature on the Entity-Relationship Approach.

进行处理时，需要开两个客户端。使用的命令分别是：

python mincemeat.py -p pwd localhost

python hw3.py

hw3.py的code为：

import glob

import mincemeat

import operator

all_filepaths = glob.glob('hw3data/*')

def file_contents(filename):

        f = open(filename)

        try:

                return f.read()

        finally:

                f.close()

datasource = dict((filename,file_contents(filename)) for filename in all_filepaths)

def my_mapper(key,value):

   from stopwords import allStopWords

   import re

   for line in value.splitlines():

        allThree=line.split(':::')

        for author in allThree[1].split('::'):

                for word in re.sub(r'([^\s\t0-9a-zA-Z-])+', '',allThree[2]).split():

                        tmpWord=word.strip().lower()

                        if len(tmpWord)<=1 or tmpWord in allStopWords:

                                continue

                        yield (author,tmpWord),1

def my_reducer(key,value):

   result=sum(value)

   return result

s = mincemeat.Server()

s.datasource = datasource

s.mapfn = my_mapper

s.reducefn = my_reducer

results = s.run_server(password="pwd")

print results

resList=[(x[0],x[1],results[x]) for x in results.keys()]

sorted_results = sorted(resList, key=operator.itemgetter(0,2))

with open('output.txt','w') as f:

        for (a,b,c) in sorted_results:

                f.write(a+' *** '+b+' *** '+str(c)+'\n')

输出的结果范例如下：

Stephen L. Bloom *** scalar *** 1

Stephen L. Bloom *** concatenation *** 1

Stephen L. Bloom *** point *** 1

Stephen L. Bloom *** varieties *** 1

Stephen L. Bloom *** observation *** 1

Stephen L. Bloom *** equivalence *** 1

Stephen L. Bloom *** axioms *** 1

Stephen L. Bloom *** languages *** 1

Stephen L. Bloom *** logical *** 1

Stephen L. Bloom *** algebras *** 1

Stephen L. Bloom *** equations *** 1

Stephen L. Bloom *** number *** 1

Stephen L. Bloom *** vector *** 1

Stephen L. Bloom *** polynomial *** 1

Stephen L. Bloom *** solving *** 1

Stephen L. Bloom *** equational *** 1

Stephen L. Bloom *** axiomatizing *** 1

Stephen L. Bloom *** characterization *** 1

Stephen L. Bloom *** regular *** 2

Stephen L. Bloom *** sets *** 2

Stephen L. Bloom *** iteration *** 3

Stephen L. Lieman *** unacceptable *** 1

Stephen L. Lieman *** correcting *** 1

Stephen L. Lieman *** never *** 1

Stephen L. Lieman *** powerful *** 1

Stephen L. Lieman *** accept *** 1

网络智能和大数据公开课Homework3 Map-Reduce编程的更多相关文章

大文本通过 hadoop spark map reduce 获取特征列的属性值计算速度
大文本通过 hadoop spark map reduce 获取特征列的属性值计算速度
大数据学习（4）MapReduce编程Helloworld：WordCount
Maven依赖: <dependency> <groupId>jdk.tools</groupId> <artifactId>jdk.tools< ...
大数据之路week02--day03 Map集合、Collections工具类的用法
1.Map(掌握) (1)将键映射到值的对象.一个映射不能包含重复的键:每个键最多只能映射到一个值. (2)Map和Collection的区别? A: Map 存储的是键值对形式的元素,键唯一,值可以 ...
【学习笔记】大数据技术原理与应用（MOOC视频、厦门大学林子雨）
1 大数据概述大数据特性:4v volume velocity variety value 即大量化.快速化.多样化.价值密度低数据量大:大数据摩尔定律快速化:从数据的生成到消耗,时间窗口小,可 ...
年度钜献，108个大数据文档PDF开放下载
1.大数据的开放式创新——吴甘沙相关阅读:[PPT]吴甘沙:让不同领域的数据真正流动.融合起来,才能释放大数据的价值下载:大数据的开放式创新——吴甘沙.pdf 2.微软严治庆——让大数据为每个人服 ...
【Energy Big Data】能源互联网和电力大数据
背景今年的政府工作报告突出了互联网在经济结构转型中的重要地位,报告明白指出:要制定"互联网+"行动计划,推动移动互联网.云计算.大数据.物联网等与现代制造业结合,促进电子商务.工 ...
杂项：大数据（巨量数据集合（IT行业术语））
ylbtech-杂项:大数据 (巨量数据集合(IT行业术语)) 大数据(big data),指无法在一定时间范围内用常规软件工具进行捕捉.管理和处理的数据集合,是需要新处理模式才能具有更强的决策力.洞 ...
大数据框架：Spark vs Hadoop vs Storm
大数据时代,TB级甚至PB级数据已经超过单机尺度的数据处理,分布式处理系统应运而生. 知识预热「专治不明觉厉」之“大数据”: 大数据生态圈及其技术栈: 关于大数据的四大特征(4V) 海量的数据规模( ...
Java转大数据开发全套视频资料
大数据在近两年可算是特别火,有很多人都想去学大数据,有java转大数据的,零基础学习大数据的.但是大数据真的好学吗. 我们先来了解一下什么是大数据. 大数据是指无法在一定时间内用常规软件工具对其内容进 ...

随机推荐

centos7 部署ssserver
centos7 部署shadowsocks服务端为什么要选centos7? 以后centos7 肯定是主流,在不重要的环境还是尽量使用新系统吧 centos7 的坑默认可能会有firewall 或 ...
（转）PHP中构造函数和析构函数解析
--http://www.jb51.net/article/56047.htm 构造函数 void __construct ([ mixed $args [, $... ]] ) PHP 5 允行开发 ...
java中this关键字和static关键字和super关键字的用法
this关键字 1. this 关键字是类内部当中对自己的一个引用,可以方便类中方法访问自己的属性: 2.可以返回对象的自己这个类的引用,同时还可以在一个构造函数当中调用另一个构造函数(这里面上面有个 ...
nodejs概论（实操篇）
什么是模块? 模块分为原生模块(node.jsAPI提供的原生模块,在启动时已经被加载)和文件模块(动态加载模块,主要由原生模块module来实现和完成.通过调用node.js的require方法 ...
【USACO 2.2.1】序言页码
[题目描述] 一类书的序言是以罗马数字标页码的.传统罗马数字用单个字母表示特定的数值,以下是标准数字表: I 1 L 50 M 1000 V 5 C 100 X 10 D 500 最多3个同样的可以表 ...
Mysql 应该选择什么引擎
对于如何选择存储引擎,可以简答的归纳为一句话:“除非需要用到某些INNODB 不具备的特性,并且没有其他办法可以替代,否则都应该选择INNODB 引擎”.例如:如果要用到全文索引,建议优先考虑INNO ...
Centos7下Intel与AMD双显卡驱动的安装
前2天,在Nvidia单显卡上成功安装上了NVIdia的驱动,一时兴起,拿出另外的一个HP笔记本也准备装上驱动,悲催的是HP的显卡是AMD的,更加.更加悲催的是还是Intel+AMD的双显卡.网络 ...
Extjs中Chart利用series的tips属性设置鼠标划过时显示数据
效果如下: 从官网找到的例子,大家参考下吧.源码: Ext.require('Ext.chart.*'); Ext.require('Ext.layout.container.Fit'); Ext.o ...
javascript获得给定日期的前一天的日期
/** * 获得当前日期的前一天 */ function getYestoday(date){ var yesterday_milliseconds=date.getTime()-1000*60*60 ...
Java多线程中易混淆的概念
概述最近在看<ThinKing In Java>,看到多线程章节时觉得有一些概念比较容易混淆有必要总结一下,虽然都不是新的东西,不过还是蛮重要,很基本的,在开发或阅读源码中经常会遇到,在 ...

网络智能和大数据公开课Homework3 Map-Reduce编程

Web Intelligence and Big Data by Dr. Gautam Shroff

网络智能和大数据公开课Homework3 Map-Reduce编程的更多相关文章

随机推荐

热门专题

Web Intelligence and Big Data
by Dr. Gautam Shroff